# Characterizing CpG Methylation

To describe general metylation trends, irrespective of pCO<sub>2</sub> treatment in *C. virginica* gonad sequence data, I need to characterize individual CpG loci. Gavery and Roberts (2013) and Olson and Roberts (2013) define a CpG locus as methylated if at least half of the reads remained unconverted after bisulfite treatment. I will creat a master `.cov` file to identify methylated CpG loci.

Another thing I will do is identify methylation islands by replicating [Jeong et al. 2018](https://academic.oup.com/gbe/article/10/10/2766/5098531). I will use [their script](https://github.com/soojinyilab/Methylation-Islands) but modify parameters to reflect differences in insect and *C. virginica* methylation.

1. Create master coverage file
2. Limit to 10x coverage
3. Characterize methylation levels for loci
4. Characterize loci locations
5. Identify methylation islands

## 0. Prepare for analyses

## 0a. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/project-gigas-oa-meth/notebooks'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/project-gigas-oa-meth/analyses


In [3]:
!mkdir 2020-02-11-Characterizing-CpG-Methylation

In [3]:
cd 2020-02-11-Characterizing-CpG-Methylation

/Users/yaamini/Documents/project-gigas-oa-meth/analyses/2020-02-11-Characterizing-CpG-Methylation


## 1. Create master coverage file

Coverage files were previously downloaded in [this Jupyter notebook](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/notebooks/2019-09-13-Generating-Coverage-Tracks.ipynb).

In [5]:
#See what the file looks like. 
#Columns: <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
!head -n 1 ../2019-09-13-IGV-Verification/YRVA_R1_001_bismark_bt2_pe.deduplicated.bismark.cov

C12722	104	104	33.3333333333333	1	2


In [6]:
%%bash

for f in ../2019-09-13-IGV-Verification/*cov
do
  awk '{print $1"-"$2"\t"$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}' ${f} \
  | sort -k1,1 \
  > ${f}.sorted
done

In [7]:
!head ../2019-09-13-IGV-Verification/YRVA_R1_001_bismark_bt2_pe.deduplicated.bismark.cov.sorted

C12722-104	C12722	104	104	33.3333333333333	1	2
C12722-105	C12722	105	105	0	0	1
C12722-134	C12722	134	134	33.3333333333333	1	2
C12722-135	C12722	135	135	0	0	1
C12722-154	C12722	154	154	0	0	3
C12722-155	C12722	155	155	0	0	1
C12722-164	C12722	164	164	0	0	1
C12734-137	C12734	137	137	0	0	2
C12734-178	C12734	178	178	0	0	1
C12734-179	C12734	179	179	0	0	2


In [8]:
#Join the first column in the first file with the first column in the second file
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
#Print unpairable lines in file 1 (-a1) and 2 (-a2) to simulate an outer join. Replace empty fields with 0 (-e) and only print the following fields (-o)
#Convert - to \t to uncouple the chromosome and start position
!join -1 1 -2 1 \
-t $'\t' \
-a1 -a2 -e 0 -o '0,1.5,1.6,1.7,2.5,2.6,2.7' \
../2019-09-13-IGV-Verification/YRVA_R1_001_bismark_bt2_pe.deduplicated.bismark.cov.sorted \
../2019-09-13-IGV-Verification/YRVL_R1_001_bismark_bt2_pe.deduplicated.bismark.cov.sorted \
| tr '-' "\t" > cgigas_gonad-10x_raw.cov

In [9]:
#Check join output after conversion
!head cgigas_gonad-10x_raw.cov

C12722	104	33.3333333333333	1	2	0	0	0
C12722	105	0	0	1	0	0	0
C12722	134	33.3333333333333	1	2	0	0	0
C12722	135	0	0	1	0	0	0
C12722	154	0	0	3	0	0	0
C12722	155	0	0	1	0	0	0
C12722	164	0	0	1	0	0	0
C12734	136	0	0	0	0	0	2
C12734	137	0	0	2	0	0	0
C12734	178	0	0	1	100	1	0


In [23]:
#Calculate total count methylated and unmethylated for each locus
#Sum number of reads at each locus
#Calculating revised methylation percentage for each locus
#Multiply percentages by 100
!awk '{ print $1"\t"$2"\t"$2"\t"$4+$7"\t"$5+$8}' cgigas_gonad-10x_raw.cov \
| awk '{ print $1"\t"$2"\t"$3"\t"$4+$5"\t"$4"\t"$5}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$5/$4"\t"$5"\t"$6}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$4*100"\t"$5"\t"$6}' \
> cgigas_gonad-10x_concat.cov

In [24]:
#Check final concatenated file
!head cgigas_gonad-10x_concat.cov

C12722	104	104	33.3333	1	2
C12722	105	105	0	0	1
C12722	134	134	33.3333	1	2
C12722	135	135	0	0	1
C12722	154	154	0	0	3
C12722	155	155	0	0	1
C12722	164	164	0	0	1
C12734	136	136	0	0	2
C12734	137	137	0	0	2
C12734	178	178	50	1	1


In [25]:
#See how many loci have data
!awk '{if ($5+$6 >= 1) { print $1, $2-1, $3, $4, $5+$6}}' cgigas_gonad-10x_concat.cov \
| wc -l

 18793411


## 2. Limit to 10x coverage

In [26]:
#If total coverage (count methylated + unmethylated) is greater than 10
#then print the chromosome, start pos -1, stop pos, percent methylation, and total coverage
#Save output as new file
!awk '{if ($5+$6 >= 10) { print $1, $2-1, $3, $4, $5+$6}}' cgigas_gonad-10x_concat.cov \
> 2020-02-11-All-10x-CpGs.bedgraph

In [27]:
#Check columns for one of the file: <chromosome> <start position> <stop position> <percent methylation> <coverage>
!head 2020-02-11-All-10x-CpGs.bedgraph

C12838 107 108 36.3636 11
C12838 155 156 20 10
C12838 60 61 27.2727 11
C12838 64 65 54.5455 11
C12838 82 83 45.4545 11
C12924 102 103 4.7619 21
C12924 127 128 5 20
C12924 136 137 14.2857 21
C12924 185 186 0 14
C12924 19 20 40 15


In [28]:
#Count loci with 5x coverage
!wc -l 2020-02-11-All-10x-CpGs.bedgraph

 12728174 2020-02-11-All-10x-CpGs.bedgraph


In [29]:
#Replace delimiters to save .bedgraph as .csv
!awk '{print $1","$2","$3","$4 }' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpGs.csv

In [30]:
#Confirm .csv creation
!head 2020-02-11-All-10x-CpGs.csv

C12838,107,108,36.3636
C12838,155,156,20
C12838,60,61,27.2727
C12838,64,65,54.5455
C12838,82,83,45.4545
C12924,102,103,4.7619
C12924,127,128,5
C12924,136,137,14.2857
C12924,185,186,0
C12924,19,20,40


## 3. Characterize methylation levels for loci

Olson and Roberts (2014) define the following categories for CpG methylation:

- Methylated (50% methylation and above)
- Sparsely methylated (0-50% methylated)
- Unmethylated (0% methylation)

I will modify this slightly:

- Methylated (50% methylation and above)
- Sparsely methylated (10-50% methylated)
- Unmethylated (10% methylation and below)

### 3a. Methylated loci

In [31]:
#If percent methylation is greater or equal to 50, then save the loci information
!awk '{if ($4 >= 50) { print $1, $2, $3, $4 }}' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph

In [32]:
#Confirm methylated loci were saved
!head 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph

C12838 64 65 54.5455
C13892 73 74 75
C14282 69 70 69.2308
C14386 66 67 63.6364
C14454 42 43 60
C14614 106 107 53.8462
C14796 68 69 80
C14796 69 70 70.8333
C14868 141 142 52
C14940 160 161 53.8462


In [33]:
#Count methylated loci
!wc -l 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph

 1677041 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph


In [34]:
#Replace delimiters to save .bedgraph as .csv
!awk '{print $1","$2","$3","$4 }' 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Methylated.csv

In [35]:
#Check .csv was saved
!head 2020-02-11-All-10x-CpG-Loci-Methylated.csv

C12838,64,65,54.5455
C13892,73,74,75
C14282,69,70,69.2308
C14386,66,67,63.6364
C14454,42,43,60
C14614,106,107,53.8462
C14796,68,69,80
C14796,69,70,70.8333
C14868,141,142,52
C14940,160,161,53.8462


### 3b. Sparsely methylated loci

In [36]:
%%bash
awk '{if ($4 < 50) { print $1, $2, $3, $4}}' 2020-02-11-All-10x-CpGs.bedgraph \
| awk '{if ($4 > 10) { print $1, $2, $3, $4 }}' \
> 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph

In [37]:
#Confirm sparsely methylated loci were saved
!head 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph

C12838 107 108 36.3636
C12838 155 156 20
C12838 60 61 27.2727
C12838 82 83 45.4545
C12924 136 137 14.2857
C12924 19 20 40
C12924 30 31 15.7895
C12924 38 39 14.2857
C12942 103 104 18.1818
C12942 114 115 18.1818


In [38]:
#Count sparsely methylated loci
!wc -l 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph

 2267700 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph


### 3c. Unmethylated loci

In [39]:
!awk '{if ($4 <= 10) { print $1, $2, $3, $4 }}' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph

In [40]:
#Confirm unmethylated loci were saved
!head 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph

C12924 102 103 4.7619
C12924 127 128 5
C12924 185 186 0
C12924 52 53 0
C12924 60 61 4.7619
C12924 94 95 5
C12942 71 72 9.09091
C12950 124 125 0
C12950 125 126 0
C12950 17 18 0


In [41]:
#Count unmethylated loci
!wc -l 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph

 8783433 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph


## 4. Characterize loci locations

My final step is to characterize the location of various loci categories in the genome. I will use `intersectBed` to find overlaps between all 5x CpGs, methylated loci, sparsely methylated loci, and unmethylated loci with exons, introns, mRNA coding regions, transposable elements, and putative promoter regions.

### 4a. Create `.bed` files

#### All 10x CpGs

In [42]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpGs.bedgraph \
> 2020-02-11-All-10x-CpGs.bed

In [43]:
#Confirm file creation
!head 2020-02-11-All-10x-CpGs.bed

C12838	107	108
C12838	155	156
C12838	60	61
C12838	64	65
C12838	82	83
C12924	102	103
C12924	127	128
C12924	136	137
C12924	185	186
C12924	19	20


#### Methylated loci

In [44]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpG-Loci-Methylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Methylated.bed

In [45]:
#Confirm file creation
!head 2020-02-11-All-10x-CpG-Loci-Methylated.bed

C12838	64	65
C13892	73	74
C14282	69	70
C14386	66	67
C14454	42	43
C14614	106	107
C14796	68	69
C14796	69	70
C14868	141	142
C14940	160	161


#### Sparsely methylated loci

In [46]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bed

In [47]:
#Confirm file creation
!head 2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bed

C12838	107	108
C12838	155	156
C12838	60	61
C12838	82	83
C12924	136	137
C12924	19	20
C12924	30	31
C12924	38	39
C12942	103	104
C12942	114	115


#### Unmethylated loci

In [48]:
%%bash
awk '{print $1"\t"$2"\t"$3}' 2020-02-11-All-10x-CpG-Loci-Unmethylated.bedgraph \
> 2020-02-11-All-10x-CpG-Loci-Unmethylated.bed

In [49]:
#Confirm file creation
!head 2020-02-11-All-10x-CpG-Loci-Unmethylated.bed

C12924	102	103
C12924	127	128
C12924	185	186
C12924	52	53
C12924	60	61
C12924	94	95
C12942	71	72
C12950	124	125
C12950	125	126
C12950	17	18


### 4b. Set variable paths

In [50]:
bedtoolsDirectory = "/Users/Shared/bioinformatics/bedtools2/bin/"

In [51]:
all10xCpGs = "2020-02-11-All-10x-CpGs.bed"

In [52]:
methylatedLoci = "2020-02-11-All-10x-CpG-Loci-Methylated.bed"

In [53]:
sparselyMethylatedLoci = "2020-02-11-All-10x-CpG-Loci-Sparsely-Methylated.bed"

In [54]:
unmethylatedLoci = "2020-02-11-All-10x-CpG-Loci-Unmethylated.bed"

In [55]:
exonList = "../2019-09-15-DML-Analysis/Cgigas_v9_exon.gff"

In [56]:
intronList = "../2019-09-15-DML-Analysis/Cgigas_v9_intron.gff"

In [57]:
geneList = "../2019-09-15-DML-Analysis/Cgigas_v9_gene.gff"

In [58]:
transposableElementsAll = "../2019-09-15-DML-Analysis/Cgigas_v9_TE.gff"

In [59]:
putativePromoters = "../2019-09-15-DML-Analysis/Cgigas_v9_1k5p_gene_promoter.gff"

### 4c. Exons

#### All 5x CpGs

In [60]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {exonList} \
| wc -l
!echo "all 10x CpG loci overlaps with exons"

 1803226
all 10x CpG loci overlaps with exons


In [61]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {exonList} \
> 2020-02-11-All10xCpGs-Exon.txt

In [62]:
!head 2020-02-11-All10xCpGs-Exon.txt

C17212	165	166	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	169	170	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	234	235	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	249	250	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	291	292	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	321	322	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17316	100	101	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17316	101	102	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17316	207	208	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17316	208	209	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;


#### Methylated loci

In [63]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {exonList} \
| wc -l
!echo "methylated loci overlaps with exons"

  559020
methylated loci overlaps with exons


In [64]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {exonList} \
> 2020-02-11-MethLoci-Exon.txt

In [65]:
!head 2020-02-11-MethLoci-Exon.txt

C19100	225	226	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	264	265	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	279	280	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	302	303	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	364	365	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	372	373	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	396	397	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	476	477	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	516	517	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;
C19100	561	562	C19100	GLEAN	CDS	160	681	.	-	0	Parent=CGI_10000013;


#### Sparsely methylated loci

In [66]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {exonList} \
| wc -l
!echo "sparsely methylated loci overlaps with exons"

  199748
sparsely methylated loci overlaps with exons


In [67]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {exonList} \
> 2020-02-11-SparseMethLoci-Exon.txt

In [68]:
!head 2020-02-11-SparseMethLoci-Exon.txt

C17316	89	90	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17998	290	291	C17998	GLEAN	CDS	196	387	.	-	0	Parent=CGI_10000005;
C17998	382	383	C17998	GLEAN	CDS	196	387	.	-	0	Parent=CGI_10000005;
C18346	277	278	C18346	GLEAN	CDS	174	551	.	+	0	Parent=CGI_10000009;
C18346	296	297	C18346	GLEAN	CDS	174	551	.	+	0	Parent=CGI_10000009;
C18346	300	301	C18346	GLEAN	CDS	174	551	.	+	0	Parent=CGI_10000009;
C18346	321	322	C18346	GLEAN	CDS	174	551	.	+	0	Parent=CGI_10000009;
C18346	338	339	C18346	GLEAN	CDS	174	551	.	+	0	Parent=CGI_10000009;
C18964	537	538	C18964	GLEAN	CDS	203	658	.	-	0	Parent=CGI_10000011;
C19356	440	441	C19356	GLEAN	CDS	355	597	.	+	0	Parent=CGI_10000014;


#### Unmethylated loci

In [69]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {exonList} \
| wc -l
!echo "unmethylated loci overlaps with exons"

 1044458
unmethylated loci overlaps with exons


In [70]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {exonList} \
> 2020-02-11-UnMethLoci-Exon.txt

In [71]:
!head 2020-02-11-UnMethLoci-Exon.txt

C17212	165	166	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	169	170	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	234	235	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	249	250	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	291	292	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17212	321	322	C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17316	100	101	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17316	101	102	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17316	207	208	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17316	208	209	C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;


### 4d. Introns

#### All 10x CpG

In [72]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {intronList} \
| wc -l
!echo "all 10x CpG loci overlaps with introns"

 3834047
all 10x CpG loci overlaps with introns


In [73]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {intronList} \
> 2020-02-11-All10xCpGs-Intron.txt

In [74]:
!head 2020-02-11-All10xCpGs-Intron.txt

C19392	243	244	C19392	subtractBed	intrn	184	451	.	+	.	Parent=CGI_10000015;
C19392	244	245	C19392	subtractBed	intrn	184	451	.	+	.	Parent=CGI_10000015;
C20334	785	786	C20334	subtractBed	intrn	524	867	.	-	.	Parent=CGI_10000028;
C20334	814	815	C20334	subtractBed	intrn	524	867	.	-	.	Parent=CGI_10000028;
C20412	243	244	C20412	subtractBed	intrn	215	409	.	-	.	Parent=CGI_10000029;
C20412	285	286	C20412	subtractBed	intrn	215	409	.	-	.	Parent=CGI_10000029;
C20412	498	499	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	499	500	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	568	569	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	569	570	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;


#### Methylated loci

In [75]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {intronList} \
| wc -l
!echo "methylated loci overlaps with introns"

  706876
methylated loci overlaps with introns


In [76]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {intronList} \
> 2020-02-11-MethLoci-Intron.txt

In [77]:
!head 2020-02-11-MethLoci-Intron.txt

C19392	243	244	C19392	subtractBed	intrn	184	451	.	+	.	Parent=CGI_10000015;
C19392	244	245	C19392	subtractBed	intrn	184	451	.	+	.	Parent=CGI_10000015;
C20334	785	786	C20334	subtractBed	intrn	524	867	.	-	.	Parent=CGI_10000028;
C20334	814	815	C20334	subtractBed	intrn	524	867	.	-	.	Parent=CGI_10000028;
C20462	673	674	C20462	subtractBed	intrn	577	822	.	+	.	Parent=CGI_10000030;
C20462	674	675	C20462	subtractBed	intrn	577	822	.	+	.	Parent=CGI_10000030;
C20462	731	732	C20462	subtractBed	intrn	577	822	.	+	.	Parent=CGI_10000030;
C20462	732	733	C20462	subtractBed	intrn	577	822	.	+	.	Parent=CGI_10000030;
C20476	647	648	C20476	subtractBed	intrn	228	792	.	+	.	Parent=CGI_10000031;
C20476	706	707	C20476	subtractBed	intrn	228	792	.	+	.	Parent=CGI_10000031;


#### Sparsely methylated loci

In [78]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {intronList} \
| wc -l
!echo "sparsely methylated loci overlaps with introns"

  768464
sparsely methylated loci overlaps with introns


In [79]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {intronList} \
> 2020-02-11-SparseMethLoci-Intron.txt

In [80]:
!head 2020-02-11-SparseMethLoci-Intron.txt

C20916	729	730	C20916	subtractBed	intrn	618	753	.	-	.	Parent=CGI_10000039;
C21064	332	333	C21064	subtractBed	intrn	115	386	.	+	.	Parent=CGI_10000043;
C21178	978	979	C21178	subtractBed	intrn	549	1034	.	-	.	Parent=CGI_10000045;
C21260	710	711	C21260	subtractBed	intrn	691	934	.	-	.	Parent=CGI_10000048;
C21260	718	719	C21260	subtractBed	intrn	691	934	.	-	.	Parent=CGI_10000048;
C21550	877	878	C21550	subtractBed	intrn	650	952	.	-	.	Parent=CGI_10000051;
C21550	878	879	C21550	subtractBed	intrn	650	952	.	-	.	Parent=CGI_10000051;
C21582	761	762	C21582	subtractBed	intrn	646	1158	.	-	.	Parent=CGI_10000054;
C21650	1014	1015	C21650	subtractBed	intrn	877	1015	.	-	.	Parent=CGI_10000058;
C21650	1125	1126	C21650	subtractBed	intrn	1076	1395	.	-	.	Parent=CGI_10000058;


#### Unmethylated loci

In [81]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {intronList} \
| wc -l
!echo "unmethylated loci overlaps with introns"

 2358707
unmethylated loci overlaps with introns


In [82]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {intronList} \
> 2020-02-11-UnMethLoci-Intron.txt

In [83]:
!head 2020-02-11-UnMethLoci-Intron.txt

C20412	243	244	C20412	subtractBed	intrn	215	409	.	-	.	Parent=CGI_10000029;
C20412	285	286	C20412	subtractBed	intrn	215	409	.	-	.	Parent=CGI_10000029;
C20412	498	499	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	499	500	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	568	569	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	569	570	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	618	619	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	619	620	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20412	704	705	C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20524	380	381	C20524	subtractBed	intrn	188	943	.	-	.	Parent=CGI_10000033;


### 4e. Genes

#### All 10x CpGs

In [84]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {geneList} \
| wc -l
!echo "all 10x CpG loci overlaps with genes"

 5637273
all 10x CpG loci overlaps with genes


In [85]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {geneList} \
> 2020-02-11-All10xCpGs-Genes.txt

In [86]:
!head 2020-02-11-All10xCpGs-Genes.txt

C17212	165	166	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	169	170	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	234	235	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	249	250	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	291	292	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	321	322	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17316	100	101	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17316	101	102	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17316	207	208	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17316	208	209	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;


In [100]:
!cut -f1 2020-02-11-All10xCpGs-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represented in overlaps"

    3184
unique genes represented in overlaps


#### Methylated loci

In [88]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {geneList} \
| wc -l
!echo "methylated loci overlaps with genes"

 1265896
methylated loci overlaps with genes


In [89]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {geneList} \
> 2020-02-11-MethLoci-Genes.txt

In [90]:
!head 2020-02-11-MethLoci-Genes.txt

C19100	225	226	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	264	265	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	279	280	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	302	303	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	364	365	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	372	373	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	396	397	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	476	477	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	516	517	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;
C19100	561	562	C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;


In [101]:
!cut -f1 2020-02-11-MethLoci-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represented in overlaps"

    2233
unique genes represented in overlaps


#### Sparsely methylated loci

In [92]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {geneList} \
| wc -l
!echo "sparsely methylated loci overlaps with genes"

  968212
sparsely methylated loci overlaps with genes


In [93]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {geneList} \
> 2020-02-11-SparseMethLoci-Genes.txt

In [94]:
!head 2020-02-11-SparseMethLoci-Genes.txt

C17316	89	90	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17998	290	291	C17998	GLEAN	mRNA	196	387	1	-	.	ID=CGI_10000005;
C17998	382	383	C17998	GLEAN	mRNA	196	387	1	-	.	ID=CGI_10000005;
C18346	277	278	C18346	GLEAN	mRNA	174	551	1	+	.	ID=CGI_10000009;
C18346	296	297	C18346	GLEAN	mRNA	174	551	1	+	.	ID=CGI_10000009;
C18346	300	301	C18346	GLEAN	mRNA	174	551	1	+	.	ID=CGI_10000009;
C18346	321	322	C18346	GLEAN	mRNA	174	551	1	+	.	ID=CGI_10000009;
C18346	338	339	C18346	GLEAN	mRNA	174	551	1	+	.	ID=CGI_10000009;
C18964	537	538	C18964	GLEAN	mRNA	203	658	0.999572	-	.	ID=CGI_10000011;
C19356	440	441	C19356	GLEAN	mRNA	355	597	1	+	.	ID=CGI_10000014;


In [102]:
!cut -f1 2020-02-11-SparseMethLoci-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represesnted in overlaps"

    3062
unique genes represesnted in overlaps


#### Unmethylated loci

In [96]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {geneList} \
| wc -l
!echo "unmethylated loci overlaps with genes"

 3403165
unmethylated loci overlaps with genes


In [97]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {geneList} \
> 2020-02-11-UnMethLoci-Genes.txt

In [98]:
!head 2020-02-11-UnMethLoci-Genes.txt

C17212	165	166	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	169	170	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	234	235	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	249	250	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	291	292	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17212	321	322	C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17316	100	101	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17316	101	102	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17316	207	208	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17316	208	209	C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;


In [103]:
!cut -f1 2020-02-11-UnMethLoci-Genes.txt| sort | uniq -c | wc -l
!echo "unique genes represented in overlaps"

    3072
unique genes represented in overlaps


### 4f. Transposable elements

#### All 10x CpGs

In [107]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {transposableElementsAll} \
| wc -l
!echo "all 10x CpG loci overlaps with transposable elements"

  612466
all 10x CpG loci overlaps with transposable elements


In [105]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {transposableElementsAll} \
> 2020-02-11-All10xCpGs-TE-All.txt

In [106]:
!head 2020-02-11-All10xCpGs-TE-All.txt

C14746	119	120	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14746	120	121	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14746	156	157	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14746	157	158	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14906	113	114	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	113	114	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.
C14906	114	115	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	114	115	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.
C14906	154	155	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	154	155	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.


#### Methylated loci

In [108]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {transposableElementsAll} \
| wc -l
!echo "methylated loci overlaps with transposable elements"

   41024
methylated loci overlaps with transposable elements


In [109]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {transposableElementsAll} \
> 2020-02-11-MethLoci-TE-All.txt

In [110]:
!head 2020-02-11-MethLoci-TE-All.txt

C17730	113	114	C17730	WUBlastX	LTR_DIRS	3	515	275	-	.	.
C17730	114	115	C17730	WUBlastX	LTR_DIRS	3	515	275	-	.	.
C21560	91	92	C21560	WUBlastX	LINE_RTE-X	3	206	78	-	.	.
C22430	243	244	C22430	WUBlastX	DNA_TcMar-Tc2	230	493	60	+	.	.
C22430	246	247	C22430	WUBlastX	DNA_TcMar-Tc2	230	493	60	+	.	.
C22430	308	309	C22430	WUBlastX	DNA_TcMar-Tc2	230	493	60	+	.	.
C22430	317	318	C22430	WUBlastX	DNA_TcMar-Tc2	230	493	60	+	.	.
C22430	361	362	C22430	WUBlastX	DNA_TcMar-Tc2	230	493	60	+	.	.
C23428	1011	1012	C23428	WUBlastX	LTR_Ngaro	899	1960	318	-	.	.
C23428	1011	1012	C23428	WUBlastX	LTR_Ngaro	947	1723	319	-	.	.


#### Sparsely methylated loci

In [111]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {transposableElementsAll} \
| wc -l
!echo "sparsely methylated loci overlaps with transposable elements"

  161419
sparsely methylated loci overlaps with transposable elements


In [112]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {transposableElementsAll} \
> 2020-02-11-SparseMethLoci-TE-All.txt

In [113]:
!head 2020-02-11-SparseMethLoci-TE-All.txt

C14746	119	120	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14746	120	121	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14746	156	157	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14746	157	158	C14746	WUBlastX	LINE_RTE-BovB	1	267	182	+	.	.
C14944	213	214	C14944	WUBlastX	LINE_CR1-L2	18	275	160	-	.	.
C17636	381	382	C17636	WUBlastX	LTR_Gypsy	226	507	116	+	.	.
C17730	244	245	C17730	WUBlastX	LTR_DIRS	3	515	275	-	.	.
C17730	245	246	C17730	WUBlastX	LTR_DIRS	3	515	275	-	.	.
C18664	259	260	C18664	WUBlastX	DNA_Helitron	2	559	211	+	.	.
C18664	260	261	C18664	WUBlastX	DNA_Helitron	2	559	211	+	.	.


#### Unmethylated loci

In [114]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {transposableElementsAll} \
| wc -l
!echo "unmethylated loci overlaps with transposable elements"

  410023
unmethylated loci overlaps with transposable elements


In [115]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {transposableElementsAll} \
> 2020-02-11-UnMethLoci-TE-All.txt

In [116]:
!head 2020-02-11-UnMethLoci-TE-All.txt

C14906	113	114	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	113	114	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.
C14906	114	115	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	114	115	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.
C14906	154	155	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	154	155	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.
C14906	155	156	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	155	156	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.
C14906	175	176	C14906	WUBlastX	LINE_Penelope	4	258	29	+	.	.
C14906	175	176	C14906	WUBlastX	LINE_Penelope	46	258	56	+	.	.


### 4g. Putative promoters

#### All 10x CpGs

In [18]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {all10xCpGs} \
-b {putativePromoters} \
| wc -l
!echo "all 10x CpG loci overlaps with putative promoters"

  176156
all 5x CpG loci overlaps with putative promoters


In [19]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {all10xCpGs} \
-b {putativePromoters} \
> 2020-02-11-All5xCpGs-Putative-Promoters.txt

In [20]:
!head 2020-02-11-All5xCpGs-Putative-Promoters.txt

NC_035780.1	27969	27970	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	27979	27980	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	28082	28083	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID

#### Methylated loci

In [21]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {methylatedLoci} \
-b {putativePromoters} \
| wc -l
!echo "methylated loci overlaps with putative promoters"

  106111
methylated loci overlaps with putative promoters


In [22]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {methylatedLoci} \
-b {putativePromoters} \
> 2020-02-11-MethLoci-Putative-Promoters.txt

In [23]:
!head 2020-02-11-MethLoci-Putative-Promoters.txt

NC_035780.1	27969	27970	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	27979	27980	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	28082	28083	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID

#### Sparsely methylated loci

In [24]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {sparselyMethylatedLoci} \
-b {putativePromoters} \
| wc -l
!echo "sparsely methylated loci overlaps with putative promoters"

   22870
sparsely methylated loci overlaps with putative promoters


In [25]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {sparselyMethylatedLoci} \
-b {putativePromoters} \
> 2020-02-11-SparseMethLoci-Putative-Promoters.txt

In [26]:
!head 2020-02-11-SparseMethLoci-Putative-Promoters.txt

NC_035780.1	95674	95675	NC_035780.1	Gnomon	mRNA	95255	96254	.	-	.	ID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1
NC_035780.1	99251	99252	NC_035780.1	Gnomon	mRNA	98840	99839	.	+	.	ID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1
NC_035780.1	232223	232224	NC_035780.1	Gnomon	mRNA	231965	232964	.	-	.	ID

#### Unmethylated loci

In [27]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {unmethylatedLoci} \
-b {putativePromoters} \
| wc -l
!echo "unmethylated loci overlaps with putative promoters"

   47175
unmethylated loci overlaps with putative promoters


In [28]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {unmethylatedLoci} \
-b {putativePromoters} \
> 2020-02-11-UnMethLoci-Putative-Promoters.txt

In [29]:
!head 2020-02-11-UnMethLoci-Putative-Promoters.txt

NC_035780.1	28859	28860	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	28924	28925	NC_035780.1	Gnomon	mRNA	27961	28960	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	46515	46516	NC_035780.1	Gnomon	mRNA	46507	47506	.	-	.	ID=rna3;Parent=gene2;Dbxref=GeneID

### 4h. No overlaps

#### All 10x CpGs

In [30]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {all10xCpGs} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "all 10x CpG loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

  603597
all 5x CpG loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [31]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {all10xCpGs} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-All5xCpGs-NoOverlaps.txt

In [32]:
!head 2020-02-11-All5xCpGs-NoOverlaps.txt

NC_007175.2	48	49
NC_007175.2	49	50
NC_007175.2	50	51
NC_007175.2	51	52
NC_007175.2	87	88
NC_007175.2	88	89
NC_007175.2	146	147
NC_007175.2	147	148
NC_007175.2	173	174
NC_007175.2	192	193


#### Methylated loci

In [33]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {methylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

  372047
methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [34]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {methylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-MethLoci-NoOverlaps.txt

In [35]:
!head 2020-02-11-MethLoci-NoOverlaps.txt

NC_035780.1	9637	9638
NC_035780.1	9657	9658
NC_035780.1	10089	10090
NC_035780.1	10331	10332
NC_035780.1	11692	11693
NC_035780.1	11706	11707
NC_035780.1	11711	11712
NC_035780.1	12686	12687
NC_035780.1	12758	12759
NC_035780.1	13486	13487


#### Sparsely methylated loci

In [36]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {sparselyMethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "sparsely methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

   84582
sparsely methylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [37]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {sparselyMethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-SparseMethLoci-NoOverlaps.txt

In [38]:
!head 2020-02-11-SparseMethLoci-NoOverlaps.txt

NC_007175.2	1506	1507
NC_007175.2	4841	4842
NC_007175.2	13069	13070
NC_035780.1	421	422
NC_035780.1	1101	1102
NC_035780.1	1540	1541
NC_035780.1	3468	3469
NC_035780.1	9789	9790
NC_035780.1	9832	9833
NC_035780.1	9854	9855


#### Unmethylated loci

In [39]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {unmethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
| wc -l
!echo "unmethylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters"

  146968
unmethylated loci do not overlap with exons, introns, transposable elements (all), or putative promoters


In [40]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {unmethylatedLoci} \
-b {exonList} {intronList} {transposableElementsAll} {putativePromoters} \
> 2020-02-11-UnMethLoci-NoOverlaps.txt

In [41]:
!head 2020-02-11-UnMethLoci-NoOverlaps.txt

NC_007175.2	48	49
NC_007175.2	49	50
NC_007175.2	50	51
NC_007175.2	51	52
NC_007175.2	87	88
NC_007175.2	88	89
NC_007175.2	146	147
NC_007175.2	147	148
NC_007175.2	173	174
NC_007175.2	192	193


## 5. Identify methylation islands

To identify methylation islands using the method from Jeong et al. (2018), I need to define:

- starting size of the methylation window: 500 bp
- minimum fraction of methylated CpGs required within the window to be accepted: 0.02
- step size to extend the accepted window as long as the mCpG fraction is met: 50 bp
- mCpG file: input with mCpG chromosome and bp position

### 5a. Create mCpG input file

In [None]:
#Modify mCpG file by removing the third column that is not needed for methylation island analysis
!awk '{print $1"\t"$2}' 2020-02-11-All-10x-CpG-Loci-Methylated.bed \ 
> 2020-02-11-All-10x-CpG-Loci-Methylated-Reduced.bed

In [5]:
#Confirm file only has chromosome and start bp for mCpG
!head 2020-02-11-All-10x-CpG-Loci-Methylated-Reduced.bed

NC_035780.1	9253
NC_035780.1	9637
NC_035780.1	9657
NC_035780.1	10089
NC_035780.1	10331
NC_035780.1	11692
NC_035780.1	11706
NC_035780.1	11711
NC_035780.1	12686
NC_035780.1	12758


### 5b. Create methylation islands

In [22]:
#Identify methylation islands using 0.02 mCpG fraction
! ./methyl_island_sliding_window.pl 500 0.02 50 2020-02-11-All-10x-CpG-Loci-Methylated-Reduced.bed \
> 2020-02-11-Methylation-Islands-500_0.02_50.tab

In [23]:
#chr, star, end, number mCpG
#Number of methylation islands
!head 2020-02-11-Methylation-Islands-500_0.02_50.tab
!wc -l 2020-02-11-Methylation-Islands-500_0.02_50.tab

NC_035780.1	23585	23723	13
NC_035780.1	36000	36358	11
NC_035780.1	100558	101923	30
NC_035780.1	102593	103702	37
NC_035780.1	115832	116304	11
NC_035780.1	211199	211544	11
NC_035780.1	239676	240134	13
NC_035780.1	245717	248838	63
NC_035780.1	250197	351003	2024
NC_035780.1	352791	353232	10
   63483 2020-02-06-Methylation-Islands-500_0.02_50.tab


In [21]:
#Count max mCpG in an island
#Count min mCpG in an island
!awk 'NR==1{max = $4 + 0; next} {if ($4 > max) max = $4;} END {print max}' \
2020-02-11-Methylation-Islands-500_0.02_50.tab
!awk 'NR==1{min = $4 + 0; next} {if ($4 < min) min = $4;} END {print min}' \
2020-02-11-Methylation-Islands-500_0.02_50.tab

24777
10


In [25]:
#Filter by MI length
!awk '{if ($3-$2 >= 500) { print $1"\t"$2"\t"$3"\t"$4}}' 2020-02-11-Methylation-Islands-500_0.02_50.tab \
> 2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab
! wc -l 2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab

   37063 2020-02-06-Methylation-Islands-500_0.02_50-filtered.tab


In [26]:
#Count max mCpG in an island
#Count min mCpG in an island
!awk 'NR==1{max = $4 + 0; next} {if ($4 > max) max = $4;} END {print max}' \
2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab
!awk 'NR==1{min = $4 + 0; next} {if ($4 < min) min = $4;} END {print min}' \
2020-02-11-Methylation-Islands-500_0.02_50-filtered.tab

24777
11


## 5c. Create BEDfiles for IGV

In [29]:
#Identify files that need bedgraphs
!find *.tab

2020-02-06-Methylation-Islands-200_0.02_50.tab
2020-02-06-Methylation-Islands-200_0.03_50.tab
2020-02-06-Methylation-Islands-200_0.04_50.tab
2020-02-06-Methylation-Islands-200_0.05_50.tab
2020-02-06-Methylation-Islands-200_0.10_50.tab
2020-02-06-Methylation-Islands-200_0.15_50.tab
2020-02-06-Methylation-Islands-200_0.20_50.tab
2020-02-06-Methylation-Islands-200_0.25_50.tab
2020-02-06-Methylation-Islands-200_0.27_50.tab
2020-02-06-Methylation-Islands-200_0.30_50.tab
2020-02-06-Methylation-Islands-300_0.02_50.tab
2020-02-06-Methylation-Islands-300_0.03_50.tab
2020-02-06-Methylation-Islands-300_0.04_50.tab
2020-02-06-Methylation-Islands-300_0.05_50.tab
2020-02-06-Methylation-Islands-300_0.10_50.tab
2020-02-06-Methylation-Islands-300_0.15_50.tab
2020-02-06-Methylation-Islands-300_0.20_50.tab
2020-02-06-Methylation-Islands-300_0.25_50.tab
2020-02-06-Methylation-Islands-500_0.02_25-filtered.tab
2020-02-06-Methylation-Islands-500_0.02_25.tab
2020-02-06-Methylation-Islands-500_0.02_50-filtered

In [30]:
%%bash
for f in *.tab
do
    awk '{print $1"\t"$2"\t"$3}' ${f} > ${f}.bed
done

In [33]:
#Check the file to ensure loop worked
!head 2020-02-11-Methylation-Islands-200_0.02_50-filtered.tab.bed

NC_035780.1	19901	20081
NC_035780.1	21693	21915
NC_035780.1	23585	23723
NC_035780.1	27826	28082
NC_035780.1	36000	36358
NC_035780.1	37557	37672
NC_035780.1	68011	68137
NC_035780.1	87531	87595
NC_035780.1	99242	99377
NC_035780.1	100558	101923
