# Characterizing CpG Methylation

To describe general metylation trends, irrespective of pCO<sub>2</sub> treatment in *C. virginica* gonad sequence data, I need to characterize individual CpG loci. Gavery and Roberts (2013) and Olson and Roberts (2013) define a CpG locus as methylated if at least half of the reads remained unconverted after bisulfite treatment. I will creat a master `.cov` file to identify methylated CpG loci.

Another thing I will do is identify methylation islands by replicating [Jeong et al. 2018](https://academic.oup.com/gbe/article/10/10/2766/5098531). I will use [their script](https://github.com/soojinyilab/Methylation-Islands) but modify parameters to reflect differences in insect and *C. virginica* methylation.

1. Create master coverage file
2. Limit to 10x coverage
3. Characterize methylation levels for loci
4. Characterize loci locations
5. Identify methylation islands

## 0. Prepare for analyses

## 0a. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/project-gigas-oa-meth/notebooks'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/project-gigas-oa-meth/analyses


In [3]:
!mkdir 2020-02-11-Characterizing-CpG-Methylation

In [3]:
cd 2020-02-11-Characterizing-CpG-Methylation

/Users/yaamini/Documents/project-gigas-oa-meth/analyses/2020-02-11-Characterizing-CpG-Methylation


## 1. Create master coverage file

Coverage files were previously downloaded in [this Jupyter notebook](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/notebooks/2019-09-13-Generating-Coverage-Tracks.ipynb).

In [5]:
#See what the file looks like. 
#Columns: <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
!head -n 1 ../2019-09-13-IGV-Verification/YRVA_R1_001_bismark_bt2_pe.deduplicated.bismark.cov

C12722	104	104	33.3333333333333	1	2


In [6]:
%%bash

for f in ../2019-09-13-IGV-Verification/*cov
do
  awk '{print $1"-"$2"\t"$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}' ${f} \
  | sort -k1,1 \
  > ${f}.sorted
done

In [7]:
!head ../2019-09-13-IGV-Verification/YRVA_R1_001_bismark_bt2_pe.deduplicated.bismark.cov.sorted

C12722-104	C12722	104	104	33.3333333333333	1	2
C12722-105	C12722	105	105	0	0	1
C12722-134	C12722	134	134	33.3333333333333	1	2
C12722-135	C12722	135	135	0	0	1
C12722-154	C12722	154	154	0	0	3
C12722-155	C12722	155	155	0	0	1
C12722-164	C12722	164	164	0	0	1
C12734-137	C12734	137	137	0	0	2
C12734-178	C12734	178	178	0	0	1
C12734-179	C12734	179	179	0	0	2


In [8]:
#Join the first column in the first file with the first column in the second file
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
#Print unpairable lines in file 1 (-a1) and 2 (-a2) to simulate an outer join. Replace empty fields with 0 (-e) and only print the following fields (-o)
#Convert - to \t to uncouple the chromosome and start position
!join -1 1 -2 1 \
-t $'\t' \
-a1 -a2 -e 0 -o '0,1.5,1.6,1.7,2.5,2.6,2.7' \
../2019-09-13-IGV-Verification/YRVA_R1_001_bismark_bt2_pe.deduplicated.bismark.cov.sorted \
../2019-09-13-IGV-Verification/YRVL_R1_001_bismark_bt2_pe.deduplicated.bismark.cov.sorted \
| tr '-' "\t" > cgigas_gonad-10x_raw.cov

In [9]:
#Check join output after conversion
!head cgigas_gonad-10x_raw.cov

C12722	104	33.3333333333333	1	2	0	0	0
C12722	105	0	0	1	0	0	0
C12722	134	33.3333333333333	1	2	0	0	0
C12722	135	0	0	1	0	0	0
C12722	154	0	0	3	0	0	0
C12722	155	0	0	1	0	0	0
C12722	164	0	0	1	0	0	0
C12734	136	0	0	0	0	0	2
C12734	137	0	0	2	0	0	0
C12734	178	0	0	1	100	1	0


In [23]:
#Calculate total count methylated and unmethylated for each locus
#Sum number of reads at each locus
#Calculating revised methylation percentage for each locus
#Multiply percentages by 100
!awk '{ print $1"\t"$2"\t"$2"\t"$4+$7"\t"$5+$8}' cgigas_gonad-10x_raw.cov \
| awk '{ print $1"\t"$2"\t"$3"\t"$4+$5"\t"$4"\t"$5}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$5/$4"\t"$5"\t"$6}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$4*100"\t"$5"\t"$6}' \
> cgigas_gonad-10x_concat.cov

In [24]:
#Check final concatenated file
!head cgigas_gonad-10x_concat.cov

C12722	104	104	33.3333	1	2
C12722	105	105	0	0	1
C12722	134	134	33.3333	1	2
C12722	135	135	0	0	1
C12722	154	154	0	0	3
C12722	155	155	0	0	1
C12722	164	164	0	0	1
C12734	136	136	0	0	2
C12734	137	137	0	0	2
C12734	178	178	50	1	1


In [25]:
#See how many loci have data
!awk '{if ($5+$6 >= 1) { print $1, $2-1, $3, $4, $5+$6}}' cgigas_gonad-10x_concat.cov \
| wc -l

 18793411


## 2. Limit to 10x coverage

In [26]:
#If total coverage (count methylated + unmethylated) is greater than 10
#then print the chromosome, start pos -1, stop pos, percent methylation, and total coverage
#Save output as new file
!awk '{if ($5+$6 >= 10) { print $1, $2-1, $3, $4, $5+$6}}' cgigas_gonad-10x_concat.cov \
> 2020-02-11-All-10x-CpGs.bedgraph

In [27]:
#Check columns for one of the file: <chromosome> <start position> <stop position> <percent methylation> <coverage>
!head 2020-02-11-All-10x-CpGs.bedgraph

C12838 107 108 36.3636 11
C12838 155 156 20 10
C12838 60 61 27.2727 11
C12838 64 65 54.5455 11
C12838 82 83 45.4545 11
C12924 102 103 4.7619 21
C12924 127 128 5 20
C12924 136 137 14.2857 21
C12924 185 186 0 14
C12924 19 20 40 15


In [28]:
#Count loci with 5x coverage
!wc -l 2020-02-11-All-10x-CpGs.bedgraph

 12728174 2020-02-11-All-10x-CpGs.bedgraph


In [29]:
#Replace delimiters to save .bedgraph as .csv
!awk '{print $1","$2","$3","$4 }' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpGs.csv

In [30]:
#Confirm .csv creation
!head 2020-02-11-All-10x-CpGs.csv

C12838,107,108,36.3636
C12838,155,156,20
C12838,60,61,27.2727
C12838,64,65,54.5455
C12838,82,83,45.4545
C12924,102,103,4.7619
C12924,127,128,5
C12924,136,137,14.2857
C12924,185,186,0
C12924,19,20,40


## 3. Characterize methylation levels for loci

Olson and Roberts (2014) define the following categories for CpG methylation:

- Methylated (50% methylation and above)
- Sparsely methylated (0-50% methylated)
- Unmethylated (0% methylation)

I will modify this slightly:

- Methylated (50% methylation and above)
- Sparsely methylated (10-50% methylated)
- Unmethylated (10% methylation and below)

### 3a. Methylated loci

In [None]:
#If percent methylation is greater or equal to 50, then save the loci information
!awk '{if ($4 >= 50) { print $1, $2, $3, $4 }}' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph

In [None]:
#Confirm methylated loci were saved
!head 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph

In [None]:
#Count methylated loci
!wc -l 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph

In [None]:
#Replace delimiters to save .bedgraph as .csv
!awk '{print $1","$2","$3","$4 }' 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Methylated.csv

In [None]:
#Check .csv was saved
!head 2020-02-11-All-10x-CpG-Loci-Methylated.csv

### 3b. Sparsely methylated loci

In [33]:
%%bash
awk '{if ($4 < 50) { print $1, $2, $3, $4}}' 2020-02-11-All-10x-CpGs.bedgraph \
| awk '{if ($4 > 10) { print $1, $2, $3, $4 }}' \
> 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph

In [34]:
#Confirm sparsely methylated loci were saved
!head 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph

NC_007175.2 1506 1507 16.6666666666667
NC_007175.2 1820 1821 20
NC_007175.2 2128 2129 11.7647058823529
NC_007175.2 4841 4842 15
NC_007175.2 13069 13070 20
NC_035780.1 421 422 14.2857142857143
NC_035780.1 1101 1102 12.5
NC_035780.1 1540 1541 16.6666666666667
NC_035780.1 3468 3469 16.6666666666667
NC_035780.1 9254 9255 28.5714285714286


In [35]:
#Count sparsely methylated loci
!wc -l 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph

  481788 2019-04-09-All-5x-CpG-Loci-Sparsely-Methylated.bedgraph


### 3c. Unmethylated loci

In [36]:
!awk '{if ($4 <= 10) { print $1, $2, $3, $4 }}' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph

In [37]:
#Confirm unmethylated loci were saved
!head 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph

NC_007175.2 48 49 1.25
NC_007175.2 49 50 0
NC_007175.2 50 51 1.18343195266272
NC_007175.2 51 52 0
NC_007175.2 87 88 1.02459016393443
NC_007175.2 88 89 1.38888888888889
NC_007175.2 146 147 1.99115044247788
NC_007175.2 147 148 2.29885057471264
NC_007175.2 173 174 0
NC_007175.2 192 193 1.25786163522013


In [38]:
#Count unmethylated loci
!wc -l 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph

  640565 2019-04-09-All-5x-CpG-Loci-Unmethylated.bedgraph


## 4. Characterize loci locations

My final step is to characterize the location of various loci categories in the genome. I will use `intersectBed` to find overlaps between all 5x CpGs, methylated loci, sparsely methylated loci, and unmethylated loci with exons, introns, mRNA coding regions, transposable elements, and putative promoter regions.

### 4a. Create `.bed` files

#### All 5x CpGs

In [69]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpGs.bed

In [70]:
#Confirm file creation
!head 2020-02-11-All-10x-CpGs.bed

NC_007175.2	48	49
NC_007175.2	49	50
NC_007175.2	50	51
NC_007175.2	51	52
NC_007175.2	87	88
NC_007175.2	88	89
NC_007175.2	146	147
NC_007175.2	147	148
NC_007175.2	173	174
NC_007175.2	192	193


#### Methylated loci

In [8]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Methylated.bed

In [4]:
#Confirm file creation
!head 2020-02-11-All-10x-CpG-Loci-Methylated.bed

NC_035780.1	9253	9254
NC_035780.1	9637	9638
NC_035780.1	9657	9658
NC_035780.1	10089	10090
NC_035780.1	10331	10332
NC_035780.1	11692	11693
NC_035780.1	11706	11707
NC_035780.1	11711	11712
NC_035780.1	12686	12687
NC_035780.1	12758	12759


#### Sparsely methylated loci

In [5]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bed

In [6]:
#Confirm file creation
!head 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bed

NC_007175.2	1506	1507
NC_007175.2	1820	1821
NC_007175.2	2128	2129
NC_007175.2	4841	4842
NC_007175.2	13069	13070
NC_035780.1	421	422
NC_035780.1	1101	1102
NC_035780.1	1540	1541
NC_035780.1	3468	3469
NC_035780.1	9254	9255


#### Unmethylated loci

In [7]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Unmethylated.bed

In [8]:
#Confirm file creation
!head 2020-02-11-All-10x-CpG-Loci-Unmethylated.bed

NC_007175.2	48	49
NC_007175.2	49	50
NC_007175.2	50	51
NC_007175.2	51	52
NC_007175.2	87	88
NC_007175.2	88	89
NC_007175.2	146	147
NC_007175.2	147	148
NC_007175.2	173	174
NC_007175.2	192	193


### 4b. Set variable paths

In [7]:
bedtoolsDirectory = "/Users/Shared/bioinformatics/bedtools2/bin/"

In [8]:
all10xCpGs = "2020-02-11-All-10x-CpGs.bed"

In [9]:
methylatedLoci = "2020-02-11-All-10x-CpG-Loci-Methylated.bed"

In [10]:
sparselyMethylatedLoci = "2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bed"

In [11]:
unmethylatedLoci = "2020-02-11-All-10x-CpG-Loci-Unmethylated.bed"

In [12]:
exonList = "../2019-09-15-DML-Analysis/Cgigas_v9_exon.gff"

In [13]:
intronList = "../2019-09-15-DML-Analysis/Cgigas_v9_intron.gff"

In [14]:
geneList = "../2019-09-15-DML-Analysis/Cgigas_v9_gene.gff"

In [15]:
transposableElementsAll = "../2019-09-15-DML-Analysis/Cgigas_v9_TE.gff"

In [17]:
putativePromoters = "../2019-09-15-DML-Analysis/Cgigas_v9_1k5p_gene_promoter.gff"

### 4c. Exons

#### All 5x CpGs

In [16]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {exonList} \
| wc -l
!echo "all 10x CpG loci overlaps with exons"

 1366779
all 5x CpG loci overlaps with exons


In [17]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {exonList} \
> 2020-02-11-All10xCpGs-Exon.txt

In [18]:
!head 2020-02-11-All10xCpGs-Exon.txt

NC_035780.1	28992	28993	NC_035780.1	28961	29073
NC_035780.1	29001	29002	NC_035780.1	28961	29073
NC_035780.1	30723	30724	NC_035780.1	30524	31557
NC_035780.1	30765	30766	NC_035780.1	30524	31557
NC_035780.1	30811	30812	NC_035780.1	30524	31557
NC_035780.1	30906	30907	NC_035780.1	30524	31557
NC_035780.1	30932	30933	NC_035780.1	30524	31557
NC_035780.1	30935	30936	NC_035780.1	30524	31557
NC_035780.1	31017	31018	NC_035780.1	30524	31557
NC_035780.1	31018	31019	NC_035780.1	30524	31557


#### Methylated loci

In [19]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {exonList} \
| wc -l
!echo "methylated loci overlaps with exons"

 1013691
methylated loci overlaps with exons


In [20]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {exonList} \
> 2020-02-11-MethLoci-Exon.txt

In [21]:
!head 2020-02-11-MethLoci-Exon.txt

NC_035780.1	100558	100559	NC_035780.1	100554	100661
NC_035780.1	100559	100560	NC_035780.1	100554	100661
NC_035780.1	100575	100576	NC_035780.1	100554	100661
NC_035780.1	100576	100577	NC_035780.1	100554	100661
NC_035780.1	100581	100582	NC_035780.1	100554	100661
NC_035780.1	100582	100583	NC_035780.1	100554	100661
NC_035780.1	100634	100635	NC_035780.1	100554	100661
NC_035780.1	100635	100636	NC_035780.1	100554	100661
NC_035780.1	100643	100644	NC_035780.1	100554	100661
NC_035780.1	100644	100645	NC_035780.1	100554	100661


#### Sparsely methylated loci

In [22]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {exonList} \
| wc -l
!echo "sparsely methylated loci overlaps with exons"

  105871
sparsely methylated loci overlaps with exons


In [23]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {exonList} \
> 2020-02-11-SparseMethLoci-Exon.txt

In [24]:
!head 2020-02-11-SparseMethLoci-Exon.txt

NC_035780.1	31078	31079	NC_035780.1	30524	31557
NC_035780.1	85755	85756	NC_035780.1	85606	85777
NC_035780.1	94754	94755	NC_035780.1	94571	95254
NC_035780.1	106236	106237	NC_035780.1	106004	106460
NC_035780.1	204528	204529	NC_035780.1	204243	204795
NC_035780.1	207401	207402	NC_035780.1	207388	207743
NC_035780.1	207423	207424	NC_035780.1	207388	207743
NC_035780.1	207472	207473	NC_035780.1	207388	207743
NC_035780.1	223409	223410	NC_035780.1	223311	223637
NC_035780.1	223416	223417	NC_035780.1	223311	223637


#### Unmethylated loci

In [25]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {exonList} \
| wc -l
!echo "unmethylated loci overlaps with exons"

  247217
unmethylated loci overlaps with exons


In [26]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {exonList} \
> 2020-02-11-UnMethLoci-Exon.txt

In [27]:
!head 2020-02-11-UnMethLoci-Exon.txt

NC_035780.1	28992	28993	NC_035780.1	28961	29073
NC_035780.1	29001	29002	NC_035780.1	28961	29073
NC_035780.1	30723	30724	NC_035780.1	30524	31557
NC_035780.1	30765	30766	NC_035780.1	30524	31557
NC_035780.1	30811	30812	NC_035780.1	30524	31557
NC_035780.1	30906	30907	NC_035780.1	30524	31557
NC_035780.1	30932	30933	NC_035780.1	30524	31557
NC_035780.1	30935	30936	NC_035780.1	30524	31557
NC_035780.1	31017	31018	NC_035780.1	30524	31557
NC_035780.1	31018	31019	NC_035780.1	30524	31557


### 4d. Introns

#### All 5x CpG

In [28]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {intronList} \
| wc -l
!echo "all 10x CpG loci overlaps with introns"

 1884429
all 5x CpG loci overlaps with introns


In [29]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {intronList} \
> 2020-02-11-All5xCpGs-Intron.txt

In [30]:
!head 2020-02-11-All5xCpGs-Intron.txt

NC_035780.1	29412	29413	NC_035780.1	29073	30523
NC_035780.1	31940	31941	NC_035780.1	31887	31976
NC_035780.1	44372	44373	NC_035780.1	44358	45912
NC_035780.1	45142	45143	NC_035780.1	44358	45912
NC_035780.1	45542	45543	NC_035780.1	44358	45912
NC_035780.1	46515	46516	NC_035780.1	46506	64122
NC_035780.1	47583	47584	NC_035780.1	46506	64122
NC_035780.1	47590	47591	NC_035780.1	46506	64122
NC_035780.1	47651	47652	NC_035780.1	46506	64122
NC_035780.1	47679	47680	NC_035780.1	46506	64122


#### Methylated loci

In [31]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {intronList} \
| wc -l
!echo "methylated loci overlaps with introns"

 1504791
methylated loci overlaps with introns


In [32]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {intronList} \
> 2020-02-11-MethLoci-Intron.txt

In [33]:
!head 2020-02-11-MethLoci-Intron.txt

NC_035780.1	29412	29413	NC_035780.1	29073	30523
NC_035780.1	87531	87532	NC_035780.1	85777	88422
NC_035780.1	87541	87542	NC_035780.1	85777	88422
NC_035780.1	87590	87591	NC_035780.1	85777	88422
NC_035780.1	87595	87596	NC_035780.1	85777	88422
NC_035780.1	100664	100665	NC_035780.1	100661	104928
NC_035780.1	100665	100666	NC_035780.1	100661	104928
NC_035780.1	100917	100918	NC_035780.1	100661	104928
NC_035780.1	100975	100976	NC_035780.1	100661	104928
NC_035780.1	101305	101306	NC_035780.1	100661	104928


#### Sparsely methylated loci

In [34]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {intronList} \
| wc -l
!echo "sparsely methylated loci overlaps with introns"

  211143
sparsely methylated loci overlaps with introns


In [35]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {intronList} \
> 2020-02-11-SparseMethLoci-Intron.txt

In [36]:
!head 2020-02-11-SparseMethLoci-Intron.txt

NC_035780.1	45142	45143	NC_035780.1	44358	45912
NC_035780.1	45542	45543	NC_035780.1	44358	45912
NC_035780.1	48914	48915	NC_035780.1	46506	64122
NC_035780.1	48928	48929	NC_035780.1	46506	64122
NC_035780.1	48940	48941	NC_035780.1	46506	64122
NC_035780.1	87599	87600	NC_035780.1	85777	88422
NC_035780.1	87607	87608	NC_035780.1	85777	88422
NC_035780.1	103272	103273	NC_035780.1	100661	104928
NC_035780.1	104332	104333	NC_035780.1	100661	104928
NC_035780.1	105767	105768	NC_035780.1	105614	106003


#### Unmethylated loci

In [37]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {intronList} \
| wc -l
!echo "unmethylated loci overlaps with introns"

  168495
unmethylated loci overlaps with introns


In [38]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {intronList} \
> 2020-02-11-UnMethLoci-Intron.txt

In [39]:
!head 2020-02-11-UnMethLoci-Intron.txt

NC_035780.1	31940	31941	NC_035780.1	31887	31976
NC_035780.1	44372	44373	NC_035780.1	44358	45912
NC_035780.1	46515	46516	NC_035780.1	46506	64122
NC_035780.1	47583	47584	NC_035780.1	46506	64122
NC_035780.1	47590	47591	NC_035780.1	46506	64122
NC_035780.1	47651	47652	NC_035780.1	46506	64122
NC_035780.1	47679	47680	NC_035780.1	46506	64122
NC_035780.1	48094	48095	NC_035780.1	46506	64122
NC_035780.1	48108	48109	NC_035780.1	46506	64122
NC_035780.1	48114	48115	NC_035780.1	46506	64122


### 4e. Genes

#### All 5x CpGs

In [40]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {geneList} \
| wc -l
!echo "all 10x CpG loci overlaps with genes"

 3255049
all 5x CpG loci overlaps with genes


In [41]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {geneList} \
> 2020-02-11-All5xCpGs-Genes.txt

In [42]:
!head 2020-02-11-All5xCpGs-Genes.txt

NC_035780.1	28992	28993	NC_035780.1	28961	33324
NC_035780.1	29001	29002	NC_035780.1	28961	33324
NC_035780.1	29412	29413	NC_035780.1	28961	33324
NC_035780.1	30723	30724	NC_035780.1	28961	33324
NC_035780.1	30765	30766	NC_035780.1	28961	33324
NC_035780.1	30811	30812	NC_035780.1	28961	33324
NC_035780.1	30906	30907	NC_035780.1	28961	33324
NC_035780.1	30932	30933	NC_035780.1	28961	33324
NC_035780.1	30935	30936	NC_035780.1	28961	33324
NC_035780.1	31017	31018	NC_035780.1	28961	33324


In [44]:
!cut -f6 2020-02-11-All5xCpGs-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represented in overlaps"

   33126
unique genes represented in overlaps


#### Methylated loci

In [45]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {geneList} \
| wc -l
!echo "methylated loci overlaps with genes"

 2521653
methylated loci overlaps with genes


In [47]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {geneList} \
> 2020-02-11-MethLoci-Genes.txt

In [48]:
!head 2020-02-11-MethLoci-Genes.txt

NC_035780.1	29412	29413	NC_035780.1	28961	33324
NC_035780.1	87531	87532	NC_035780.1	85606	95254
NC_035780.1	87541	87542	NC_035780.1	85606	95254
NC_035780.1	87590	87591	NC_035780.1	85606	95254
NC_035780.1	87595	87596	NC_035780.1	85606	95254
NC_035780.1	100558	100559	NC_035780.1	99840	106460
NC_035780.1	100559	100560	NC_035780.1	99840	106460
NC_035780.1	100575	100576	NC_035780.1	99840	106460
NC_035780.1	100576	100577	NC_035780.1	99840	106460
NC_035780.1	100581	100582	NC_035780.1	99840	106460


In [52]:
!cut -f6 2020-02-11-MethLoci-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represented in overlaps"

   25496
unique genes represented in overlaps


#### Sparsely methylated loci

In [53]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {geneList} \
| wc -l
!echo "sparsely methylated loci overlaps with genes"

  317249
sparsely methylated loci overlaps with genes


In [54]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {geneList} \
> 2020-02-11-SparseMethLoci-Genes.txt

In [55]:
!head 2020-02-11-SparseMethLoci-Genes.txt

NC_035780.1	31078	31079	NC_035780.1	28961	33324
NC_035780.1	45142	45143	NC_035780.1	43111	66897
NC_035780.1	45542	45543	NC_035780.1	43111	66897
NC_035780.1	48914	48915	NC_035780.1	43111	66897
NC_035780.1	48928	48929	NC_035780.1	43111	66897
NC_035780.1	48940	48941	NC_035780.1	43111	66897
NC_035780.1	85755	85756	NC_035780.1	85606	95254
NC_035780.1	87599	87600	NC_035780.1	85606	95254
NC_035780.1	87607	87608	NC_035780.1	85606	95254
NC_035780.1	94754	94755	NC_035780.1	85606	95254


In [57]:
!cut -f6 2020-02-11-SparseMethLoci-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represesnted in overlaps"

   26953
unique genes represesnted in overlaps


#### Unmethylated loci

In [58]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {geneList} \
| wc -l
!echo "unmethylated loci overlaps with genes"

  416147
unmethylated loci overlaps with genes


In [59]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {geneList} \
> 2020-02-11-UnMethLoci-Genes.txt

In [60]:
!head 2020-02-11-UnMethLoci-Genes.txt

NC_035780.1	28992	28993	NC_035780.1	28961	33324
NC_035780.1	29001	29002	NC_035780.1	28961	33324
NC_035780.1	30723	30724	NC_035780.1	28961	33324
NC_035780.1	30765	30766	NC_035780.1	28961	33324
NC_035780.1	30811	30812	NC_035780.1	28961	33324
NC_035780.1	30906	30907	NC_035780.1	28961	33324
NC_035780.1	30932	30933	NC_035780.1	28961	33324
NC_035780.1	30935	30936	NC_035780.1	28961	33324
NC_035780.1	31017	31018	NC_035780.1	28961	33324
NC_035780.1	31018	31019	NC_035780.1	28961	33324


In [61]:
!cut -f6 2020-02-11-UnMethLoci-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represented in overlaps"

   27753
unique genes represented in overlaps


### 4f. Transposable elements

#### All 10x CpGs

In [62]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {transposableElementsAll} \
| wc -l
!echo "all 10x CpG loci overlaps with transposable elements (all)"

 1011883
all 5x CpG loci overlaps with transposable elements (all)


In [63]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {transposableElementsAll} \
> 2020-02-11-All5xCpGs-TE-All.txt

In [64]:
!head 2020-02-11-All5xCpGs-TE-All.txt

NC_007175.2	263	264	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	264	265	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	265	266	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	266	267	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	295	296	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	331	332	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	332	333	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	366	367	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	367	368	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.

#### Methylated loci

In [65]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {transposableElementsAll} \
| wc -l
!echo "methylated loci overlaps with transposable elements (all)"

  755222
methylated loci overlaps with transposable elements (all)


In [66]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {transposableElementsAll} \
> 2020-02-11-MethLoci-TE-All.txt

In [67]:
!head 2020-02-11-MethLoci-TE-All.txt

NC_035780.1	9253	9254	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332
NC_035780.1	19631	19632	NC_035780.1	RepeatMasker	similarity	19431	19866	23.3	-	.	Target "Motif:Crypton-N19_CGi" 580 1033
NC_035780.1	19741	19742	NC_035780.1	RepeatMasker	similarity	19431	19866	23.3	-	.	Target "Motif:Crypton-N19_CGi" 580 1033
NC_035780.1	37557	37558	NC_035780.1	RepeatMasker	similarity	37557	37890	12.9	+	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337
NC_035780.1	37581	37582	NC_035780.1	RepeatMasker	similarity	37557	37890	12.9	+	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337
NC_035780.1	37604	37605	NC_035780.1	RepeatMasker	similarity	37557	37890	12.9	+	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337
NC_035780.1	37611	37612	NC_035780.1	RepeatMasker	similarity	37557	37890	12.9	+	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337
NC_035780.1	37618	37619	NC_035780.1	RepeatMasker	similarity	37557	37890	12.9	+	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337
NC_035780.1	37622	37623	NC_035780.1	Repea

#### Sparsely methylated loci

In [68]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {transposableElementsAll} \
| wc -l
!echo "sparsely methylated loci overlaps with transposable elements (all)"

  155293
sparsely methylated loci overlaps with transposable elements (all)


In [69]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {transposableElementsAll} \
> 2020-02-11-SparseMethLoci-TE-All.txt

In [70]:
!head 2020-02-11-SparseMethLoci-TE-All.txt

NC_007175.2	1820	1821	NC_007175.2	RepeatMasker	similarity	1728	1947	26.1	-	.	Target "Motif:REP-6_LMi" 14320 14534
NC_007175.2	2128	2129	NC_007175.2	RepeatMasker	similarity	2129	2367	20.5	-	.	Target "Motif:REP-6_LMi" 13886 14118
NC_035780.1	9254	9255	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332
NC_035780.1	9266	9267	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332
NC_035780.1	9267	9268	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332
NC_035780.1	9297	9298	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332
NC_035780.1	9298	9299	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332
NC_035780.1	9301	9302	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332
NC_035780.1	9302	9303	NC_035780.1	RepeatMasker	similarity	9223	9562	26.9	-	.	Target "Motif:DNA-19_CGi" 1 332


#### Unmethylated loci

In [71]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {transposableElementsAll} \
| wc -l
!echo "unmethylated loci overlaps with transposable elements (all)"

  101368
unmethylated loci overlaps with transposable elements (all)


In [72]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {transposableElementsAll} \
> 2020-02-11-UnMethLoci-TE-All.txt

In [73]:
!head 2020-02-11-UnMethLoci-TE-All.txt

NC_007175.2	263	264	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	264	265	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	265	266	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	266	267	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	295	296	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	331	332	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	332	333	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	366	367	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	367	368	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.

### 4g. Putative promoters

#### All 10x CpGs

In [18]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {putativePromoters} \
| wc -l
!echo "all 10x CpG loci overlaps with putative promoters"

  176156
all 5x CpG loci overlaps with putative promoters


In [19]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {putativePromoters} \
> 2020-02-11-All5xCpGs-Putative-Promoters.txt

In [20]:
!head 2020-02-11-All5xCpGs-Putative-Promoters.txt

NC_035780.1	27969	27970	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	27979	27980	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	28082	28083	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID

#### Methylated loci

In [21]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {putativePromoters} \
| wc -l
!echo "methylated loci overlaps with putative promoters"

  106111
methylated loci overlaps with putative promoters


In [22]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {putativePromoters} \
> 2020-02-11-MethLoci-Putative-Promoters.txt

In [23]:
!head 2020-02-11-MethLoci-Putative-Promoters.txt

NC_035780.1	27969	27970	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	27979	27980	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	28082	28083	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID

#### Sparsely methylated loci

In [24]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {putativePromoters} \
| wc -l
!echo "sparsely methylated loci overlaps with putative promoters"

   22870
sparsely methylated loci overlaps with putative promoters


In [25]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {putativePromoters} \
> 2020-02-11-SparseMethLoci-Putative-Promoters.txt

In [26]:
!head 2020-02-11-SparseMethLoci-Putative-Promoters.txt

NC_035780.1	95674	95675	NC_035780.1	Gnomon	mRNA	95255	96254	.	-	.	ID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1
NC_035780.1	99251	99252	NC_035780.1	Gnomon	mRNA	98840	99839	.	+	.	ID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1
NC_035780.1	232223	232224	NC_035780.1	Gnomon	mRNA	231965	232964	.	-	.	ID

#### Unmethylated loci

In [27]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {putativePromoters} \
| wc -l
!echo "unmethylated loci overlaps with putative promoters"

   47175
unmethylated loci overlaps with putative promoters


In [28]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {putativePromoters} \
> 2020-02-11-UnMethLoci-Putative-Promoters.txt

In [29]:
!head 2020-02-11-UnMethLoci-Putative-Promoters.txt

NC_035780.1	28859	28860	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	28924	28925	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	46515	46516	NC_035780.1	Gnomon	mRNA	46507	47506	.	-	.	ID=rna3;Parent=gene2;Dbxref=GeneID

### 4h. No overlaps

#### All 10x CpGs

In [30]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {all10xCpGs} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "all 10x CpG loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

  603597
all 5x CpG loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [31]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {all10xCpGs} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-All5xCpGs-NoOverlaps.txt

In [32]:
!head 2020-02-11-All5xCpGs-NoOverlaps.txt

NC_007175.2	48	49
NC_007175.2	49	50
NC_007175.2	50	51
NC_007175.2	51	52
NC_007175.2	87	88
NC_007175.2	88	89
NC_007175.2	146	147
NC_007175.2	147	148
NC_007175.2	173	174
NC_007175.2	192	193


#### Methylated loci

In [33]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {methylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

  372047
methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [34]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {methylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-MethLoci-NoOverlaps.txt

In [35]:
!head 2020-02-11-MethLoci-NoOverlaps.txt

NC_035780.1	9637	9638
NC_035780.1	9657	9658
NC_035780.1	10089	10090
NC_035780.1	10331	10332
NC_035780.1	11692	11693
NC_035780.1	11706	11707
NC_035780.1	11711	11712
NC_035780.1	12686	12687
NC_035780.1	12758	12759
NC_035780.1	13486	13487


#### Sparsely methylated loci

In [36]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {sparselyMethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "sparsely methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

   84582
sparsely methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [37]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {sparselyMethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-SparseMethLoci-NoOverlaps.txt

In [38]:
!head 2020-02-11-SparseMethLoci-NoOverlaps.txt

NC_007175.2	1506	1507
NC_007175.2	4841	4842
NC_007175.2	13069	13070
NC_035780.1	421	422
NC_035780.1	1101	1102
NC_035780.1	1540	1541
NC_035780.1	3468	3469
NC_035780.1	9789	9790
NC_035780.1	9832	9833
NC_035780.1	9854	9855


#### Unmethylated loci

In [39]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {unmethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "unmethylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

  146968
unmethylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [40]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {unmethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-UnMethLoci-NoOverlaps.txt

In [41]:
!head 2020-02-11-UnMethLoci-NoOverlaps.txt

NC_007175.2	48	49
NC_007175.2	49	50
NC_007175.2	50	51
NC_007175.2	51	52
NC_007175.2	87	88
NC_007175.2	88	89
NC_007175.2	146	147
NC_007175.2	147	148
NC_007175.2	173	174
NC_007175.2	192	193


## 5. Identify methylation islands

To identify methylation islands using the method from Jeong et al. (2018), I need to define:

- starting size of the methylation window: 500 bp
- minimum fraction of methylated CpGs required within the window to be accepted: 0.02
- step size to extend the accepted window as long as the mCpG fraction is met: 50 bp
- mCpG file: input with mCpG chromosome and bp position

### 5a. Create mCpG input file

In [None]:
#Modify mCpG file by removing the third column that is not needed for methylation island analysis
!awk '{print $1"\t"$2}' 2020-02-11-All-10x-CpG-Loci-Methylated.bed \ 
> 2020-02-11-All-10x-CpG-Loci-Methylated-Reduced.bed

In [5]:
#Confirm file only has chromosome and start bp for mCpG
!head 2020-02-11-All-10x-CpG-Loci-Methylated-Reduced.bed

NC_035780.1	9253
NC_035780.1	9637
NC_035780.1	9657
NC_035780.1	10089
NC_035780.1	10331
NC_035780.1	11692
NC_035780.1	11706
NC_035780.1	11711
NC_035780.1	12686
NC_035780.1	12758


### 5b. Create methylation islands

In [22]:
#Identify methylation islands using 0.02 mCpG fraction
! ./methyl_island_sliding_window.pl 500 0.02 50 2020-02-11-All-10x-CpG-Loci-Methylated-Reduced.bed \
> 2020-02-11-Methylation-Islands-500_0.02_50.tab

In [23]:
#chr, star, end, number mCpG
#Number of methylation islands
!head 2020-02-11-Methylation-Islands-500_0.02_50.tab
!wc -l 2020-02-11-Methylation-Islands-500_0.02_50.tab

NC_035780.1	23585	23723	13
NC_035780.1	36000	36358	11
NC_035780.1	100558	101923	30
NC_035780.1	102593	103702	37
NC_035780.1	115832	116304	11
NC_035780.1	211199	211544	11
NC_035780.1	239676	240134	13
NC_035780.1	245717	248838	63
NC_035780.1	250197	351003	2024
NC_035780.1	352791	353232	10
   63483 2020-02-06-Methylation-Islands-500_0.02_50.tab


In [21]:
#Count max mCpG in an island
#Count min mCpG in an island
!awk 'NR==1{max = $4 + 0; next} {if ($4 > max) max = $4;} END {print max}' \
2020-02-11-Methylation-Islands-500_0.02_50.tab
!awk 'NR==1{min = $4 + 0; next} {if ($4 < min) min = $4;} END {print min}' \
2020-02-11-Methylation-Islands-500_0.02_50.tab

24777
10


In [25]:
#Filter by MI length
!awk '{if ($3-$2 >= 500) { print $1"\t"$2"\t"$3"\t"$4}}' 2020-02-11-Methylation-Islands-500_0.02_50.tab \
> 2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab
! wc -l 2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab

   37063 2020-02-06-Methylation-Islands-500_0.02_50-filtered.tab


In [26]:
#Count max mCpG in an island
#Count min mCpG in an island
!awk 'NR==1{max = $4 + 0; next} {if ($4 > max) max = $4;} END {print max}' \
2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab
!awk 'NR==1{min = $4 + 0; next} {if ($4 < min) min = $4;} END {print min}' \
2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab

24777
11


## 5c. Create BEDfiles for IGV

In [29]:
#Identify files that need bedgraphs
!find *.tab

2020-02-06-Methylation-Islands-200_0.02_50.tab
2020-02-06-Methylation-Islands-200_0.03_50.tab
2020-02-06-Methylation-Islands-200_0.04_50.tab
2020-02-06-Methylation-Islands-200_0.05_50.tab
2020-02-06-Methylation-Islands-200_0.10_50.tab
2020-02-06-Methylation-Islands-200_0.15_50.tab
2020-02-06-Methylation-Islands-200_0.20_50.tab
2020-02-06-Methylation-Islands-200_0.25_50.tab
2020-02-06-Methylation-Islands-200_0.27_50.tab
2020-02-06-Methylation-Islands-200_0.30_50.tab
2020-02-06-Methylation-Islands-300_0.02_50.tab
2020-02-06-Methylation-Islands-300_0.03_50.tab
2020-02-06-Methylation-Islands-300_0.04_50.tab
2020-02-06-Methylation-Islands-300_0.05_50.tab
2020-02-06-Methylation-Islands-300_0.10_50.tab
2020-02-06-Methylation-Islands-300_0.15_50.tab
2020-02-06-Methylation-Islands-300_0.20_50.tab
2020-02-06-Methylation-Islands-300_0.25_50.tab
2020-02-06-Methylation-Islands-500_0.02_25-filtered.tab
2020-02-06-Methylation-Islands-500_0.02_25.tab
2020-02-06-Methylation-Islands-500_0.02_50-filtered

In [30]:
%%bash
for f in *.tab
do
    awk '{print $1"\t"$2"\t"$3}' ${f} > ${f}.bed
done

In [33]:
#Check the file to ensure loop worked
!head 2020-02-11-Methylation-Islands-200_0.02_50-filtered.tab.bed

NC_035780.1	19901	20081
NC_035780.1	21693	21915
NC_035780.1	23585	23723
NC_035780.1	27826	28082
NC_035780.1	36000	36358
NC_035780.1	37557	37672
NC_035780.1	68011	68137
NC_035780.1	87531	87595
NC_035780.1	99242	99377
NC_035780.1	100558	101923
