# Genomic Location of DML

In this notebook, I will identify the genomic locations of [sex-specific DML identified with `methylKit`](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/code/06-methylKit.R). 

2. Create BEDfiles for DML
3. Characterize genomic locations for DML
5. Identify overlaps between SNPs and DML

## 0. Set working directory

In [1]:
pwd

'/Users/yaaminivenkataraman/Documents/project-gigas-oa-meth/code'

In [2]:
cd ../output/

/Users/yaaminivenkataraman/Documents/project-gigas-oa-meth/output


In [3]:
#mkdir 10_DML-characterization

In [19]:
cd 10_DML-characterization/

/Users/yaaminivenkataraman/Documents/project-gigas-oa-meth/output/10_DML-characterization


In [16]:
bedtoolsDirectory = "/opt/homebrew/bin/"

## 1. Create BEDfiles for DML

My `methylKit` DML lists are `.csv` files. To identify genomic locations with `bedtools intersect`, I need BEDfiles.

In [22]:
#Look at csv file to determine what modifications need to be made
#Column 2: chr, Column 3: start, Column 4: end, Column 8: meth.diff
!head ../06-methylKit/DML/DML-pH-50-Cov5-All.csv

"","chr","start","end","strand","pvalue","qvalue","meth.diff"
"375","NC_047559.1",7867,7869,"*",1.62548959439652e-16,4.40724882587981e-13,65.1062155782848
"4686","NC_047559.1",741585,741587,"*",4.16159580479533e-06,0.000712875666228856,58.141592920354
"9527","NC_047559.1",1191754,1191756,"*",9.10615096150674e-12,8.55819210647154e-09,53.4323271665044
"20530","NC_047559.1",2792607,2792609,"*",2.77002835981994e-06,0.000508253755861752,51.8593644354293
"21934","NC_047559.1",3065822,3065824,"*",4.43002899193152e-05,0.00509735167303334,60.0770492972819
"27751","NC_047559.1",3611505,3611507,"*",4.05341454457055e-09,1.85108015554992e-06,53.448275862069
"32064","NC_047559.1",3913926,3913928,"*",1.08462926383698e-45,2.58061516171489e-40,-74.8831918092065
"58355","NC_047559.1",6327898,6327900,"*",1.03394422276363e-06,0.000220234759568014,65
"65969","NC_047559.1",6936424,6936426,"*",1.30058690916754e-10,9.0216751507499e-08,50.7046304285304


In [7]:
%%bash

#Replace , with tabs
#Remove extraneous quotes entries (can also be done in R)
#Print chr, start, end, meth.diff
#Remove header
#Save as BEDfile

for f in ../06-methylKit/DML/DML*csv
do
    tr "," "\t" < ${f} \
    | tr -d '"' \
    | awk '{print $2"\t"$3"\t"$4"\t"$8}' \
    | tail -n+2 \
    > ${f}.bed
done

In [8]:
%%bash

#Move BEDfiles to current working directory
mv ../06-methylKit/DML/*bed .

In [9]:
!head *bed

==> DML-pH-100-Cov5-Ind-unique-CT-SNPs.bed <==
NC_047559.1	738014	738016	-100
NC_047559.1	1011405	1011407	100
NC_047559.1	3874606	3874608	-100
NC_047559.1	4907314	4907316	-100
NC_047559.1	5430962	5430964	100
NC_047559.1	6601946	6601948	-100
NC_047559.1	6621879	6621881	100
NC_047559.1	7138119	7138121	-100
NC_047559.1	9537680	9537682	100
NC_047559.1	9548686	9548688	100

==> DML-pH-100-Cov5-Ind.csv.bed <==
NC_047559.1	738014	738016	-100
NC_047559.1	1006145	1006147	100
NC_047559.1	1011405	1011407	100
NC_047559.1	1715466	1715468	100
NC_047559.1	2193954	2193956	-100
NC_047559.1	3595157	3595159	-100
NC_047559.1	3613450	3613452	-100
NC_047559.1	3734205	3734207	-100
NC_047559.1	3874606	3874608	-100
NC_047559.1	4907314	4907316	-100

==> DML-pH-25-Cov5-All.csv.bed <==
NC_047559.1	7867	7869	65.1062155782848
NC_047559.1	559248	559250	-45.2380952380952
NC_047559.1	561284	561286	-28.5084643894621
NC_047559.1	589057	589059	29.4117647058824
NC_047559.1	606381	606383	25.0607

I imported the BEDfiles into [this IGV session](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/output/10_DML-characterization/dml.xml) to visualize them.

## 2. Overlaps between DML and unique C->T SNPs

I will count how many DML overlap with SNPs, then remove those overlapping DML before proceeding with analyses.

### 2a. Create BEDfile

In [5]:
!awk '{print $1"\t"$2"\t"$2}' /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/unique-CT-SNPs.tab \
> /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/unique-CT-SNPs.bed

In [6]:
!head /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/unique-CT-SNPs.bed
!wc -l /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/unique-CT-SNPs.bed

NC_001276.1	14669	14669
NC_047559.1	1000025	1000025
NC_047559.1	10001065	10001065
NC_047559.1	10001128	10001128
NC_047559.1	10001236	10001236
NC_047559.1	10003470	10003470
NC_047559.1	10003475	10003475
NC_047559.1	10004318	10004318
NC_047559.1	100045	100045
NC_047559.1	10004558	10004558
  300278 /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/unique-CT-SNPs.bed


### 2b. Identify overlaps

In [21]:
#Identify overlaps between SNPs and potential DML
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All.csv.bed \
-b /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/unique-CT-SNPs.bed \
> DML-pH-50-Cov5-All-unique-CT-SNPs.bed
!head DML-pH-50-Cov5-All-unique-CT-SNPs.bed
!wc -l DML-pH-50-Cov5-All-unique-CT-SNPs.bed

NC_047559.1	741585	741587	58.141592920354
NC_047559.1	3611505	3611507	53.448275862069
NC_047559.1	3913926	3913928	-74.8831918092065
NC_047559.1	7271820	7271822	-50.4714016341923
NC_047559.1	11690479	11690481	-65.2272727272727
NC_047559.1	14253544	14253546	50.8838383838384
NC_047559.1	19116966	19116968	-62.8409090909091
NC_047559.1	19816819	19816821	-54.2763157894737
NC_047559.1	20135187	20135189	52.9661016949153
NC_047559.1	21100425	21100427	-64.0404929577465
     315 DML-pH-50-Cov5-All-unique-CT-SNPs.bed


In [25]:
#Remove SNPs from DML list and save as a new file
!{bedtoolsDirectory}subtractBed \
-a DML-pH-50-Cov5-All.csv.bed \
-b /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/unique-CT-SNPs.bed \
> DML-pH-50-Cov5-All-NO-SNPs.bed
!head DML-pH-50-Cov5-All-NO-SNPs.bed
!wc -l DML-pH-50-Cov5-All-NO-SNPs.bed

NC_047559.1	7867	7869	65.1062155782848
NC_047559.1	1191754	1191756	53.4323271665044
NC_047559.1	2792607	2792609	51.8593644354293
NC_047559.1	3065822	3065824	60.0770492972819
NC_047559.1	6327898	6327900	65
NC_047559.1	6936424	6936426	50.7046304285304
NC_047559.1	7287685	7287687	50.5203619909502
NC_047559.1	8701415	8701417	53.5714285714286
NC_047559.1	9400941	9400943	-50.7792207792208
NC_047559.1	9400974	9400976	-77.0087976539589
    1284 DML-pH-50-Cov5-All-NO-SNPs.bed


In [26]:
#Count the number of hypomethylated DML
#Count the number of hypermethylated DML
!grep "-" DML-pH-50-Cov5-All-NO-SNPs.bed | wc -l
!grep -v "-" DML-pH-50-Cov5-All-NO-SNPs.bed | wc -l

     654
     630


## 3. Characterize genomic locations of DML

I will look at overlaps between genome features and either female- and indeterminate-DML, as well as those that overlap.

### 3a. Gene

In [27]:
#Find overlaps between DML and feature
#Look at output
#Count number of overlaps

!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_gene.gff \
> DML-pH-50-Cov5-All-Gene.bed
!head DML-pH-50-Cov5-All-Gene.bed
!wc -l DML-pH-50-Cov5-All-Gene.bed

NC_047559.1	1191754	1191756	53.4323271665044
NC_047559.1	2792607	2792609	51.8593644354293
NC_047559.1	3065822	3065824	60.0770492972819
NC_047559.1	6327898	6327900	65
NC_047559.1	6936424	6936426	50.7046304285304
NC_047559.1	8701415	8701417	53.5714285714286
NC_047559.1	10185079	10185081	-61.6666666666667
NC_047559.1	10868218	10868220	55.2532833020638
NC_047559.1	12290987	12290989	54.7191661841343
NC_047559.1	12382145	12382147	-62.5
    1181 DML-pH-50-Cov5-All-Gene.bed


In [28]:
#Find overlaps between DML and genes
#Include original entry from gene GFF for each overlap, which will be used in downstream enrichment analyses (wb)
#Look at output. Do not count overlaps because there are likely redundant entries

!{bedtoolsDirectory}intersectBed \
-wb \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_gene.gff \
> DML-pH-50-Cov5-All-Gene-wb.bed
!head DML-pH-50-Cov5-All-Gene-wb.bed

NC_047559.1	1191754	1191756	53.4323271665044	NC_047559.1	Gnomon	gene	1190707	1193919	.	-	.	ID=gene-LOC105318174;Dbxref=GeneID:105318174;Name=LOC105318174;gbkey=Gene;gene=LOC105318174;gene_biotype=protein_coding
NC_047559.1	2792607	2792609	51.8593644354293	NC_047559.1	Gnomon	gene	2781135	2798818	.	+	.	ID=gene-LOC117683699;Dbxref=GeneID:117683699;Name=LOC117683699;gbkey=Gene;gene=LOC117683699;gene_biotype=protein_coding
NC_047559.1	2792607	2792609	51.8593644354293	NC_047559.1	Gnomon	gene	2623361	2849124	.	-	.	ID=gene-LOC117683566;Dbxref=GeneID:117683566;Name=LOC117683566;gbkey=Gene;gene=LOC117683566;gene_biotype=protein_coding
NC_047559.1	3065822	3065824	60.0770492972819	NC_047559.1	Gnomon	gene	3060171	3074704	.	+	.	ID=gene-LOC105337362;Dbxref=GeneID:105337362;Name=LOC105337362;gbkey=Gene;gene=LOC105337362;gene_biotype=protein_coding
NC_047559.1	6327898	6327900	65	NC_047559.1	Gnomon	gene	6323929	6334496	.	+	.	ID=gene-LOC105342022;Dbxref=GeneID:105342022;Name=LOC105342022;gbkey=Gene;g

In [29]:
#Isolate column with gene IDs
#Translate  ; and = to tabs
#Isolate column with gene IDs
#Sort and identify unique gene IDs
#Count the number of unique gene IDs that contain DML

!cut -f13 DML-pH-50-Cov5-All-Gene-wb.bed \
| tr ";" "\t" \
| tr "=" "\t" \
| cut -f6 \
| sort | uniq \
| wc -l

     859


In [30]:
#Isolate gene ID information and save

!cut -f13 DML-pH-50-Cov5-All-Gene-wb.bed \
| tr ";" "\t" \
| tr "=" "\t" \
| cut -f6 \
> geneID-DML-overlap.tab
!head geneID-DML-overlap.tab

LOC105318174
LOC117683699
LOC117683566
LOC105337362
LOC105342022
LOC105317492
LOC105325155
LOC105333210
LOC105341494
LOC105341392


### 3b. Exon UTR

In [31]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_exonUTR.gff \
> DML-pH-50-Cov5-All-exonUTR.bed
!head DML-pH-50-Cov5-All-exonUTR.bed
!wc -l DML-pH-50-Cov5-All-exonUTR.bed

NC_047559.1	27249136	27249138	53.8240368027602
NC_047559.1	36792292	36792294	-50.7981220657277
NC_047559.1	39699232	39699234	-63.9125683060109
NC_047559.1	47929211	47929213	-51.4336917562724
NC_047559.1	51072178	51072180	74.3784111582777
NC_047559.1	51410313	51410315	51.9424871453057
NC_047559.1	54339651	54339653	71.640826873385
NC_047560.1	52650818	52650820	-63.55642530985
NC_047560.1	53356731	53356733	-54.9748354626403
NC_047561.1	5273876	5273878	57.1055381400209
      77 DML-pH-50-Cov5-All-exonUTR.bed


### 3c. CDS

In [32]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All.csv.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_CDS.gff \
> DML-pH-50-Cov5-All-CDS.bed
!head DML-pH-50-Cov5-All-CDS.bed
!wc -l DML-pH-50-Cov5-All-CDS.bed

NC_047559.1	741585	741587	58.141592920354
NC_047559.1	3611505	3611507	53.448275862069
NC_047559.1	6327898	6327900	65
NC_047559.1	10868218	10868220	55.2532833020638
NC_047559.1	11690479	11690481	-65.2272727272727
NC_047559.1	12290987	12290989	54.7191661841343
NC_047559.1	12516267	12516269	55.6078431372549
NC_047559.1	13502465	13502467	-67.0745272525028
NC_047559.1	13628415	13628417	-61.0915805546678
NC_047559.1	14249126	14249128	-52.8169014084507
     442 DML-pH-50-Cov5-All-CDS.bed


### 3d. Intron

In [33]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_intron.bed \
> DML-pH-50-Cov5-All-intron.bed
!head DML-pH-50-Cov5-All-intron.bed
!wc -l DML-pH-50-Cov5-All-intron.bed

NC_047559.1	1191754	1191756	53.4323271665044
NC_047559.1	2792607	2792609	51.8593644354293
NC_047559.1	3065822	3065824	60.0770492972819
NC_047559.1	6936424	6936426	50.7046304285304
NC_047559.1	8701415	8701417	53.5714285714286
NC_047559.1	10185079	10185081	-61.6666666666667
NC_047559.1	12382145	12382147	-62.5
NC_047559.1	12720430	12720432	58.499115826702
NC_047559.1	13012936	13012938	-56.4298724954463
NC_047559.1	13029203	13029205	-53.2212885154062
     783 DML-pH-50-Cov5-All-intron.bed


### 3e. Upstream flanks

In [34]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_upstream.gff \
> DML-pH-50-Cov5-All-upstream.bed
!head DML-pH-50-Cov5-All-upstream.bed
!wc -l DML-pH-50-Cov5-All-upstream.bed

NC_047559.1	20449763	20449765	-54.5018007202881
NC_047562.1	31533368	31533370	74.6376811594203
NC_047562.1	49672248	49672250	57.0588235294118
NC_047564.1	53826886	53826888	50.8620689655172
NC_047565.1	10566559	10566561	-57.6034259857789
NC_047568.1	1530485	1530487	-53.0120481927711
NC_047568.1	35034719	35034721	51.6505281690141
NC_047568.1	50346688	50346690	-52.2681954137587
NW_022994852.1	205599	205601	54.1818181818182
NW_022994852.1	205641	205643	57.4074074074074
      10 DML-pH-50-Cov5-All-upstream.bed


### 3f. Downstream flanks

In [35]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_downstream.gff \
> DML-pH-50-Cov5-All-downstream.bed
!head DML-pH-50-Cov5-All-downstream.bed
!wc -l DML-pH-50-Cov5-All-downstream.bed

NC_047559.1	7287685	7287687	50.5203619909502
NC_047559.1	9400941	9400943	-50.7792207792208
NC_047559.1	9400974	9400976	-77.0087976539589
NC_047559.1	17446551	17446553	-52.3654916512059
NC_047559.1	20449763	20449765	-54.5018007202881
NC_047561.1	687407	687409	52.6541764246682
NC_047561.1	3223414	3223416	-54.9426138467234
NC_047561.1	12531033	12531035	-52.9411764705882
NC_047561.1	17751068	17751070	-53.7037037037037
NC_047561.1	25420097	25420099	-75.750300120048
      61 DML-pH-50-Cov5-All-downstream.bed


### 3g. Intergenic regions

In [36]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_intergenic.bed \
> DML-pH-50-Cov5-All-intergenic.bed
!head DML-pH-50-Cov5-All-intergenic.bed
!wc -l DML-pH-50-Cov5-All-intergenic.bed

NC_047559.1	7867	7869	65.1062155782848
NC_047560.1	7329115	7329117	-53.3604680109534
NC_047560.1	37256697	37256699	-55.3846153846154
NC_047560.1	38292492	38292494	-60.2941176470588
NC_047560.1	68488679	68488681	-64.006106870229
NC_047561.1	688237	688239	68.2735543239569
NC_047561.1	30080184	30080186	-50.388198757764
NC_047562.1	17297242	17297244	50.6059613494923
NC_047562.1	34646047	34646049	-59.1729323308271
NC_047562.1	48518001	48518003	-71.9533275713051
      38 DML-pH-50-Cov5-All-intergenic.bed


### 3h. lncRNA

In [37]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_lncRNA.gff \
> DML-pH-50-Cov5-All-lncRNA.bed
!head DML-pH-50-Cov5-All-lncRNA.bed
!wc -l DML-pH-50-Cov5-All-lncRNA.bed

NC_047560.1	51414503	51414505	-62.1936274509804
NC_047561.1	57903862	57903864	-64.4886363636364
NC_047563.1	4249619	4249621	-63.0978499341817
NC_047563.1	22612083	22612085	50.9153318077803
NC_047565.1	20887795	20887797	51.1754668688975
NC_047566.1	30976378	30976380	54.1435185185185
NC_047567.1	31425444	31425446	63.5174418604651
NC_047568.1	48922872	48922874	53.1211317418214
NC_047568.1	48925153	48925155	58.7694900969237
       9 DML-pH-50-Cov5-All-lncRNA.bed


### 3i. Tranposable elements

In [38]:
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-50-Cov5-All-NO-SNPs.bed \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_rm.te.bed \
> DML-pH-50-Cov5-All-TE.bed
!head DML-pH-50-Cov5-All-TE.bed
!wc -l DML-pH-50-Cov5-All-TE.bed

NC_047559.1	2792607	2792609	51.8593644354293
NC_047559.1	13012936	13012938	-56.4298724954463
NC_047559.1	13419019	13419021	-56.3791874554526
NC_047559.1	17446551	17446553	-52.3654916512059
NC_047559.1	19992123	19992125	-53.7414965986395
NC_047559.1	21068177	21068179	55.8333333333333
NC_047559.1	22093659	22093661	-64.4886363636364
NC_047559.1	22103921	22103923	59.5784543325527
NC_047559.1	22863832	22863834	-50.5079825834543
NC_047559.1	22866650	22866652	-63.9322702613842
     434 DML-pH-50-Cov5-All-TE.bed


## 4. Combine line counts

This will make it easier for downstream analysis.

In [52]:
!find DML-pH-50-Cov5-All-*bed

DML-pH-50-Cov5-All-CDS.bed
DML-pH-50-Cov5-All-Gene-wb.bed
DML-pH-50-Cov5-All-Gene.bed
DML-pH-50-Cov5-All-NO-SNPs.bed
DML-pH-50-Cov5-All-TE.bed
DML-pH-50-Cov5-All-downstream.bed
DML-pH-50-Cov5-All-exonUTR.bed
DML-pH-50-Cov5-All-intergenic.bed
DML-pH-50-Cov5-All-intron.bed
DML-pH-50-Cov5-All-lncRNA.bed
DML-pH-50-Cov5-All-unique-CT-SNPs.bed
DML-pH-50-Cov5-All-upstream.bed


In [53]:
#Get line count for all DML overlap files
#Remove the 13th line (total entries)
#Remove the 4th line (true DML)
#Print in a tab-delimited format
#Save output

!wc -l DML-pH-50-Cov5-All-*bed \
| sed '13,$ d' \
| sed '4d' \
| awk '{print $1"\t"$2}' \
> DML-pH-50-Cov5-All-Overlap-counts.txt

In [56]:
!head -n12 DML-pH-50-Cov5-All-Overlap-counts.txt

442	DML-pH-50-Cov5-All-CDS.bed
1218	DML-pH-50-Cov5-All-Gene-wb.bed
1181	DML-pH-50-Cov5-All-Gene.bed
434	DML-pH-50-Cov5-All-TE.bed
61	DML-pH-50-Cov5-All-downstream.bed
77	DML-pH-50-Cov5-All-exonUTR.bed
38	DML-pH-50-Cov5-All-intergenic.bed
783	DML-pH-50-Cov5-All-intron.bed
9	DML-pH-50-Cov5-All-lncRNA.bed
315	DML-pH-50-Cov5-All-unique-CT-SNPs.bed
10	DML-pH-50-Cov5-All-upstream.bed
