# DML Analysis

In this notebook, I will examine the location of differentially methylated loci (DML) in the *C. gigas* genome. The DML were identified using `methylKit` in [this R script](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/analyses/2019-09-12-MethylKit/2019-09-12-MethylKit.Rmd).

Methods:

1. Prepare for Analyses
2. Locate Files and Set Variable Paths
3. Identify Overlaps between Genomic Feature Tracks

## 0. Prepare for Analyses

### 0a. Set Working Directory

In [1]:
pwd

'/Users/yaamini/Documents/project-gigas-oa-meth/notebooks'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/project-gigas-oa-meth/analyses


In [3]:
!mkdir 2019-09-15-DML-Analysis

In [4]:
ls -F

[34m2019-08-30-Bismark-Parameter-Testing[m[m/ [34m2019-09-15-DML-Analysis[m[m/
[34m2019-09-12-MethylKit[m[m/                 README.md
[34m2019-09-13-IGV-Verification[m[m/


In [5]:
cd 2019-09-15-DML-Analysis/

/Users/yaamini/Documents/project-gigas-oa-meth/analyses/2019-09-15-DML-Analysis


### 0b. Download Genome Feature Files

I will be using the following tracks from [this `eagle` directory](https://eagle.fish.washington.edu/trilobite/index.php?dir=Crassostrea_gigas_v9_tracks%2F):

1. Exon: Coding regions
2. Intron: Regions that are removed
3. Genes: This includes exons and introns, as well as constituent mRNA.
4. Promoters: Regions upstream of genes that could be important for regulation
5. Tranpsosable elements (_C. gigas_): Transposable elements located using information from _C. gigas_ only (see [Sam's notes](http://onsnetwork.org/kubu4/2018/08/28/transposable-element-mapping-crassostrea-virginica-genome-cvirginica_v300-using-repeatmasker-4-07/) for more information)
4. CG motifs: Regions with CGs where methylation can occur

In [6]:
!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_exon.gff \
> Cgigas_v9_exon.gff

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.7M  100 11.7M    0     0  6888k      0  0:00:01  0:00:01 --:--:-- 6964k


In [7]:
!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_intron.gff \
> Cgigas_v9_intron.gff

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.0M  100 12.0M    0     0  7933k      0  0:00:01  0:00:01 --:--:-- 7947k


In [8]:
!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_gene.gff \
> Cgigas_v9_gene.gff

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1777k  100 1777k    0     0  4896k      0 --:--:-- --:--:-- --:--:-- 5288k


In [9]:
!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_1k5p_gene_promoter.gff \
> Cgigas_v9_1k5p_gene_promoter.gff

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1848k  100 1848k    0     0  5306k      0 --:--:-- --:--:-- --:--:-- 5373k


In [20]:
!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_TE.gff \
> Cgigas_v9_TE.gff

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6325k  100 6325k    0     0  7365k      0 --:--:-- --:--:-- --:--:-- 7695k


In [10]:
!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_CG.gff \
> Cgigas_v9_CG.gff

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  932M  100  932M    0     0  9789k      0  0:01:37  0:01:37 --:--:-- 9541k


In [11]:
!ls Cgigas*

Cgigas_v9_1k5p_gene_promoter.gff Cgigas_v9_gene.gff
Cgigas_v9_CG.gff                 Cgigas_v9_intron.gff
Cgigas_v9_exon.gff


## 1. Locate Relevant Files and Set Variable Path Names

### 1a. Set Variable Path Names

Setting the variable path names allows me to reuse this script with different input files or different paths to programs without manually changing the file names each time.

In [12]:
bedtoolsDirectory = "/Users/Shared/bioinformatics/bedtools2/bin/"

In [13]:
DMLlist = "../2019-09-12-MethylKit/2019-09-12-DML-Destrand-10x-Locations-100diff-NoExtras.bed"

In [14]:
exonList = "Cgigas_v9_exon.gff"

In [15]:
intronList = "Cgigas_v9_intron.gff"

In [17]:
geneList = "Cgigas_v9_gene.gff"

In [18]:
promoterList = "Cgigas_v9_1k5p_gene_promoter.gff"

In [21]:
transposableElements = "Cgigas_v9_TE.gff"

In [22]:
CGMotifList = "Cgigas_v9_CG.gff"

### 1b. Confirm Variable Path Works and Characterize Files

The BEDfiles with DML and DMR can be viewed below. Columns are are the chromosome, start position, end position, strand, and fold difference with direction. The files only have DML and DMR that were at least 50% different between the two treatments (control and elevated pCO<sub>2</sub>).

In [23]:
#Previewing the files
!head {DMLlist}

C22384	1328	1330	-100
C22628	1621	1623	100
C28982	4929	4931	100
C29914	4052	4054	-100
C29914	4052	4054	-100
C29976	649	651	-100
C30322	3482	3484	-100
C30322	3599	3601	-100
C32984	5070	5072	-100
C33708	8307	8309	100


In [24]:
#Counting the number of lines to count DML
!wc -l {DMLlist}

     625 ../2019-09-12-MethylKit/2019-09-12-DML-Destrand-10x-Locations-100diff-NoExtras.bed


In [25]:
!head {exonList}

C16582	GLEAN	CDS	35	385	.	-	0	Parent=CGI_10000001;
C17212	GLEAN	CDS	31	363	.	+	0	Parent=CGI_10000002;
C17316	GLEAN	CDS	30	257	.	+	0	Parent=CGI_10000003;
C17476	GLEAN	CDS	104	257	.	-	0	Parent=CGI_10000004;
C17476	GLEAN	CDS	34	74	.	-	2	Parent=CGI_10000004;
C17998	GLEAN	CDS	196	387	.	-	0	Parent=CGI_10000005;
C18346	GLEAN	CDS	174	551	.	+	0	Parent=CGI_10000009;
C18428	GLEAN	CDS	286	546	.	-	0	Parent=CGI_10000010;
C18964	GLEAN	CDS	203	658	.	-	0	Parent=CGI_10000011;
C18980	GLEAN	CDS	30	674	.	+	0	Parent=CGI_10000012;


In [26]:
!wc -l {exonList}

  196691 Cgigas_v9_exon.gff


In [27]:
!head {intronList}

C17476	subtractBed	intrn	75	103	.	-	.	Parent=CGI_10000004;
C19392	subtractBed	intrn	184	451	.	+	.	Parent=CGI_10000015;
C20262	subtractBed	intrn	539	641	.	-	.	Parent=CGI_10000025;
C20262	subtractBed	intrn	650	871	.	-	.	Parent=CGI_10000025;
C20334	subtractBed	intrn	524	867	.	-	.	Parent=CGI_10000028;
C20412	subtractBed	intrn	215	409	.	-	.	Parent=CGI_10000029;
C20412	subtractBed	intrn	464	705	.	-	.	Parent=CGI_10000029;
C20462	subtractBed	intrn	50	271	.	+	.	Parent=CGI_10000030;
C20462	subtractBed	intrn	360	481	.	+	.	Parent=CGI_10000030;
C20462	subtractBed	intrn	577	822	.	+	.	Parent=CGI_10000030;


In [28]:
!wc -l {intronList}

  176049 Cgigas_v9_intron.gff


In [29]:
!head {geneList}

C16582	GLEAN	mRNA	35	385	0.555898	-	.	ID=CGI_10000001;
C17212	GLEAN	mRNA	31	363	0.999572	+	.	ID=CGI_10000002;
C17316	GLEAN	mRNA	30	257	0.555898	+	.	ID=CGI_10000003;
C17476	GLEAN	mRNA	34	257	0.998947	-	.	ID=CGI_10000004;
C17998	GLEAN	mRNA	196	387	1	-	.	ID=CGI_10000005;
C18346	GLEAN	mRNA	174	551	1	+	.	ID=CGI_10000009;
C18428	GLEAN	mRNA	286	546	0.555898	-	.	ID=CGI_10000010;
C18964	GLEAN	mRNA	203	658	0.999572	-	.	ID=CGI_10000011;
C18980	GLEAN	mRNA	30	674	0.555898	+	.	ID=CGI_10000012;
C19100	GLEAN	mRNA	160	681	0.999955	-	.	ID=CGI_10000013;


In [30]:
!wc -l {geneList}

   28027 Cgigas_v9_gene.gff


In [31]:
!head {promoterList}

C16582	flankbed	promoter	386	395	.	-	.	ID=CGI_10000001;
C17212	flankbed	promoter	1	30	.	+	.	ID=CGI_10000002;
C17316	flankbed	promoter	1	29	.	+	.	ID=CGI_10000003;
C17476	flankbed	promoter	258	491	.	-	.	ID=CGI_10000004;
C17998	flankbed	promoter	388	559	.	-	.	ID=CGI_10000005;
C18346	flankbed	promoter	1	173	.	+	.	ID=CGI_10000009;
C18428	flankbed	promoter	547	611	.	-	.	ID=CGI_10000010;
C18964	flankbed	promoter	659	714	.	-	.	ID=CGI_10000011;
C18980	flankbed	promoter	1	29	.	+	.	ID=CGI_10000012;
C19100	flankbed	promoter	682	743	.	-	.	ID=CGI_10000013;


In [32]:
!wc -l {promoterList}

   28023 Cgigas_v9_1k5p_gene_promoter.gff


In [33]:
!head {transposableElements}

C21242	TRF	Tandem_Repeat	38	100	72	+	.	.
C21306	TRF	Tandem_Repeat	35	143	112	+	.	.
C21306	TRF	Tandem_Repeat	574	947	208	+	.	.
C21306	TRF	Tandem_Repeat	574	901	313	+	.	.
C21372	TRF	Tandem_Repeat	643	671	58	+	.	.
C22542	TRF	Tandem_Repeat	1727	1774	96	+	.	.
C22728	TRF	Tandem_Repeat	426	491	105	+	.	.
C23428	TRF	Tandem_Repeat	130	415	202	+	.	.
C23796	TRF	Tandem_Repeat	547	608	97	+	.	.
C24440	TRF	Tandem_Repeat	1059	1089	62	+	.	.


In [34]:
!wc -l {transposableElements}

  119786 Cgigas_v9_TE.gff


In [35]:
!head {CGMotifList}

##gff-version 3
##sequence-region scaffold360 1 280
#!Date 2013-04-23
#!Type DNA
#!Source-version EMBOSS 6.5.7.0
scaffold360	fuzznuc	nucleotide_motif	60	61	2	+	.	ID=scaffold360.1;note=*pat pattern:CG
scaffold360	fuzznuc	nucleotide_motif	96	97	2	+	.	ID=scaffold360.2;note=*pat pattern:CG
scaffold360	fuzznuc	nucleotide_motif	120	121	2	+	.	ID=scaffold360.3;note=*pat pattern:CG
scaffold360	fuzznuc	nucleotide_motif	187	188	2	+	.	ID=scaffold360.4;note=*pat pattern:CG
##gff-version 3


In [36]:
!wc -l {CGMotifList}

 10035701 Cgigas_v9_CG.gff


## 2. Identify DML Overlaps with Genomic Feature Tracks

To identify the location of DML in the *C. gigas* genome, I will use `intersect` from `bedtools`. [The BEDtools suite](http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html) allows me to easily find overlapping regions of different BEDfiles.

In [36]:
! {bedtoolsDirectory}intersectBed -h


Tool:    bedtools intersect (aka intersectBed)
Version: v2.26.0
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>

	Note: -b may be followed with multiple databases and/or 
	wildcard (*) character(s). 
Options: 
	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.

	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f 

### 2a. Exons

In [26]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {DMLlist} \
-b {exonList} \
| wc -l
!echo "DML overlaps with exons"

     368
DML overlaps with exons


In [27]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {DMLlist} \
-b {exonList} \
> 2019-09-15-DML-Exon.txt

In [28]:
!head 2019-09-15-DML-Exon.txt

NC_035780.1	571138	571140	58	NC_035780.1	570942	571194
NC_035780.1	2538924	2538926	-50	NC_035780.1	2538624	2538955
NC_035780.1	2586508	2586510	-53	NC_035780.1	2586438	2586557
NC_035780.1	2589720	2589722	57	NC_035780.1	2589716	2589955
NC_035780.1	4286286	4286288	67	NC_035780.1	4286174	4286407
NC_035780.1	4286802	4286804	-62	NC_035780.1	4286783	4286927
NC_035780.1	4289628	4289630	-52	NC_035780.1	4288592	4290756
NC_035780.1	8693287	8693289	-52	NC_035780.1	8692509	8693320
NC_035780.1	9110274	9110276	-63	NC_035780.1	9109982	9111843
NC_035780.1	12631453	12631455	60	NC_035780.1	12630576	12631487


### 2b. Introns

In [36]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {DMLlist} \
-b {intronList} \
| wc -l
!echo "DML overlaps with introns"

     192
DML overlaps with introns


In [37]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {DMLlist} \
-b {intronList} \
> 2019-09-15-DML-Intron.txt

In [38]:
!head 2019-09-15-DML-Intron.txt

NC_035780.1	401630	401632	53	NC_035780.1	401604	401800
NC_035780.1	1882691	1882693	64	NC_035780.1	1882355	1882971
NC_035780.1	1885022	1885024	61	NC_035780.1	1884754	1886042
NC_035780.1	1933499	1933501	51	NC_035780.1	1932876	1933573
NC_035780.1	2541726	2541728	-54	NC_035780.1	2538955	2541768
NC_035780.1	2584492	2584494	56	NC_035780.1	2584153	2584504
NC_035780.1	4288213	4288215	-58	NC_035780.1	4288128	4288230
NC_035780.1	8833124	8833126	60	NC_035780.1	8832171	8833699
NC_035780.1	17488958	17488960	-57	NC_035780.1	17488942	17489178
NC_035780.1	22177828	22177830	-51	NC_035780.1	22154686	22178240


### 2c. Genes

In [45]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {DMLlist} \
-b {geneList} \
| wc -l
!echo "DML overlaps with genes"

     560
DML overlaps with genes


In [55]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {DMLlist} \
-b {geneList} \
> 2019-09-15-DML-Genes.txt

In [56]:
!head 2019-09-15-DML-Genes.txt

NC_035780.1	401630	401632	53	NC_035780.1	394983	409280
NC_035780.1	571138	571140	58	NC_035780.1	544088	573497
NC_035780.1	1882691	1882693	64	NC_035780.1	1882143	1890106
NC_035780.1	1885022	1885024	61	NC_035780.1	1882143	1890106
NC_035780.1	1933499	1933501	51	NC_035780.1	1928718	1940217
NC_035780.1	2538924	2538926	-50	NC_035780.1	2524425	2553408
NC_035780.1	2541726	2541728	-54	NC_035780.1	2524425	2553408
NC_035780.1	2584492	2584494	56	NC_035780.1	2554181	2599559
NC_035780.1	2586508	2586510	-53	NC_035780.1	2554181	2599559
NC_035780.1	2589720	2589722	57	NC_035780.1	2554181	2599559


I know how many overlaps there are, but I also want to know how many unique genes have DMLs in them. For this, I will use the following code:

`cut -f7 2019-09-15-DML-Genes.txt | sort | uniq -c`

`cut` is the command that isolates the column information. Each gene has a unique end position, so I'll look at unique entries in the seventh column (`-f7`). The column is piped into `sort`, then that output is counted for unique lines by `uniq`. Finally, I'll pipe this into `wc -l` to count the number of unique genes.

In [57]:
! cut -f7 2019-09-15-DML-Genes.txt | sort | uniq -c | wc -l
!echo "unique genes overlapping with DML"

     481
unique genes overlapping with DML


### 2c. Promoters

In [68]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {DMLlist} \
-b {promoterList} \
| wc -l
!echo "DML overlaps with promoters"

      57
DML overlaps with transposable elements (all)


In [69]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {DMLlist} \
-b {promoterList} \
> 2019-09-15-DML-Promoters.txt

In [70]:
!head 2019-09-15-DML-Promoters.txt

NC_035780.1	8833124	8833126	60	NC_035780.1	RepeatMasker	similarity	8833042	8833288	18.2	-	.	Target "Motif:CVA" 1 272
NC_035780.1	22177828	22177830	-51	NC_035780.1	RepeatMasker	similarity	22177766	22177877	22.3	-	.	Target "Motif:DNA9-6_CGi" 1 115
NC_035780.1	57337100	57337102	-54	NC_035780.1	RepeatMasker	similarity	57337042	57337128	18.6	-	.	Target "Motif:DNA2-2_CGi" 413 498
NC_035780.1	58135767	58135769	74	NC_035780.1	RepeatMasker	similarity	58135699	58135837	22.4	+	.	Target "Motif:BivaMD-SINE1_CrVi" 169 314
NC_035781.1	22439769	22439771	53	NC_035781.1	RepeatMasker	similarity	22439740	22439796	28.1	+	.	Target "Motif:Mariner-6_AMi" 698 754
NC_035781.1	29178318	29178320	-55	NC_035781.1	RepeatMasker	similarity	29177336	29178341	16.0	-	.	Target "Motif:CVA" 2 863
NC_035781.1	54151548	54151550	54	NC_035781.1	RepeatMasker	similarity	54150482	54151750	14.3	+	.	Target "Motif:CVA" 1 1018
NC_035781.1	59742649	59742651	-65	NC_035781.1	RepeatMasker	similarity	59742603	59742651	 4.2	+	.	Targe

### 2d. Transposable Elements

In [68]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {DMLlist} \
-b {transposableElements} \
| wc -l
!echo "DML overlaps with transposable elements"

      57
DML overlaps with transposable elements (all)


In [69]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {DMLlist} \
-b {transposableElements} \
> 2019-09-15-DML-TE.txt

In [70]:
!head 2019-09-15-DML-TE.txt

NC_035780.1	8833124	8833126	60	NC_035780.1	RepeatMasker	similarity	8833042	8833288	18.2	-	.	Target "Motif:CVA" 1 272
NC_035780.1	22177828	22177830	-51	NC_035780.1	RepeatMasker	similarity	22177766	22177877	22.3	-	.	Target "Motif:DNA9-6_CGi" 1 115
NC_035780.1	57337100	57337102	-54	NC_035780.1	RepeatMasker	similarity	57337042	57337128	18.6	-	.	Target "Motif:DNA2-2_CGi" 413 498
NC_035780.1	58135767	58135769	74	NC_035780.1	RepeatMasker	similarity	58135699	58135837	22.4	+	.	Target "Motif:BivaMD-SINE1_CrVi" 169 314
NC_035781.1	22439769	22439771	53	NC_035781.1	RepeatMasker	similarity	22439740	22439796	28.1	+	.	Target "Motif:Mariner-6_AMi" 698 754
NC_035781.1	29178318	29178320	-55	NC_035781.1	RepeatMasker	similarity	29177336	29178341	16.0	-	.	Target "Motif:CVA" 2 863
NC_035781.1	54151548	54151550	54	NC_035781.1	RepeatMasker	similarity	54150482	54151750	14.3	+	.	Target "Motif:CVA" 1 1018
NC_035781.1	59742649	59742651	-65	NC_035781.1	RepeatMasker	similarity	59742603	59742651	 4.2	+	.	Targe

### 2e. No overlaps

I also want to count the number of DML that do not overlap with any features (i.e. unannotated intergenic regions). To do this, I'll use the `-v` argument in `bedtools`, which reports "those entries in A that have no overlap in B." I can specify multiple files with `-b`. I'll use exons, introns, transposable elements, and putative promoter regions.

In [147]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {DMLlist} \
-b {exonList} {intronList} {transposableElements} {promoterList} \
| wc -l
!echo "DML do not overlap with exons, introns, transposable elements, or promoters"

      15
DML do not overlap with exons, introns, transposable elements (all), or putative promoters


In [148]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {DMLlist} \
-b {exonList} {intronList} {transposableElements} {promoterList} \
> 2019-09-15-No-Overlap-DML.txt

In [149]:
!head 2019-09-15-No-Overlap-DML.txt

NC_035781.1	20620123	20620125	57
NC_035781.1	30062222	30062224	60
NC_035781.1	39583208	39583210	-50
NC_035781.1	50711254	50711256	-71
NC_035782.1	58675230	58675232	52
NC_035782.1	65377028	65377030	51
NC_035784.1	2011997	2011999	-60
NC_035784.1	45667412	45667414	56
NC_035784.1	53515949	53515951	50
NC_035784.1	81666532	81666534	-65


## 3. Identify Overlaps between Other Genome Feature Tracks

### 3a. CG motifs

To fully understand my results, I also need to know where CG motifs are located with respect to the other features.

#### Exons

In [85]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {exonList} \
-b {CGMotifList} \
| wc -l
!echo "Exon overlaps with CG motifs"

   50331
Exon overlaps with transposable elements (all)


In [18]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {exonList} \
-b {CGMotifList} \
> 2019-09-15-Exon-CGmotifs.txt

In [19]:
!head 2019-09-15-Exon-CGmotifs.txt

NC_035780.1	108305	110077	NC_035780.1	RepeatMasker	similarity	109968	109996	 0.0	+	.	Target "Motif:(CCT)n" 1 29	29
NC_035780.1	164820	164941	NC_035780.1	RepeatMasker	similarity	164886	164914	 7.3	+	.	Target "Motif:(GAG)n" 1 29	29
NC_035780.1	165620	166793	NC_035780.1	RepeatMasker	similarity	166075	166280	32.8	+	.	Target "Motif:Harbinger1_DR" 1472 1676	206
NC_035780.1	165620	166793	NC_035780.1	RepeatMasker	similarity	166501	166566	30.3	+	.	Target "Motif:Harbinger-6_DR" 1152 1217	66
NC_035780.1	165620	166793	NC_035780.1	RepeatMasker	similarity	166598	166642	17.8	+	.	Target "Motif:hATw-1_HM" 2778 2822	45
NC_035780.1	219451	220204	NC_035780.1	RepeatMasker	similarity	220122	220199	24.7	-	.	Target "Motif:Gypsy-75_CQ-I" 1012 1091	78
NC_035780.1	227734	228033	NC_035780.1	RepeatMasker	similarity	227768	227819	25.0	+	.	Target "Motif:A-rich" 1 54	52
NC_035780.1	227734	228033	NC_035780.1	RepeatMasker	similarity	227768	227819	25.0	+	.	Target "Motif:A-rich" 1 54	52
NC_035780.1	227734	228033	

#### Introns

In [86]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {intronList} \
-b {CGMotifList} \
| wc -l
!echo "Intron overlaps with CG motifs"

  115151
Intron overlaps with transposable elements (all)


In [20]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {intronList} \
-b {CGMotifList} \
> 2019-09-15-Intron-CGmotifs.txt

In [21]:
!head 2019-09-15-Intron-CGmotifs.txt

NC_035780.1	32565	32958	NC_035780.1	RepeatMasker	similarity	32720	32819	18.2	+	.	Target "Motif:Crypton-9N1_CGi" 239 337	100
NC_035780.1	46506	64122	NC_035780.1	RepeatMasker	similarity	48463	48520	 8.8	+	.	Target "Motif:BivaMD-SINE1_CrVi" 280 337	58
NC_035780.1	46506	64122	NC_035780.1	RepeatMasker	similarity	48666	49000	10.9	-	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337	335
NC_035780.1	46506	64122	NC_035780.1	RepeatMasker	similarity	50251	50279	 0.0	+	.	Target "Motif:(GGTTAG)n" 1 29	29
NC_035780.1	46506	64122	NC_035780.1	RepeatMasker	similarity	50606	50760	21.3	+	.	Target "Motif:Harbinger-2N1_CGi" 1 166	155
NC_035780.1	46506	64122	NC_035780.1	RepeatMasker	similarity	50977	51034	 0.0	+	.	Target "Motif:(TA)n" 1 58	58
NC_035780.1	46506	64122	NC_035780.1	RepeatMasker	similarity	51456	51498	 0.0	+	.	Target "Motif:(AG)n" 1 43	43
NC_035780.1	46506	64122	NC_035780.1	RepeatMasker	similarity	51721	51922	21.8	+	.	Target "Motif:Harbinger-2N1_CGi" 2568 2776	202
NC_035780.1	46506	64122	NC_035780

#### Genes

In [92]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {geneList} \
-b {CGMotifList} \
| wc -l
!echo "gene overlaps with CG motifs"

   33739
gene overlaps with transposable elements (all)


In [16]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {geneList} \
-b {CGMotifList} \
> 2019-09-15-Genes-CGmotifs.txt

In [17]:
!head 2019-09-15-Genes-CGmotifs.txt

NC_035780.1	28961	33324	NC_035780.1	RepeatMasker	similarity	32720	32819	18.2	+	.	Target "Motif:Crypton-9N1_CGi" 239 337	100
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	48463	48520	 8.8	+	.	Target "Motif:BivaMD-SINE1_CrVi" 280 337	58
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	48666	49000	10.9	-	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337	335
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	50251	50279	 0.0	+	.	Target "Motif:(GGTTAG)n" 1 29	29
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	50606	50760	21.3	+	.	Target "Motif:Harbinger-2N1_CGi" 1 166	155
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	50977	51034	 0.0	+	.	Target "Motif:(TA)n" 1 58	58
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	51456	51498	 0.0	+	.	Target "Motif:(AG)n" 1 43	43
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	51721	51922	21.8	+	.	Target "Motif:Harbinger-2N1_CGi" 2568 2776	202
NC_035780.1	43111	66897	NC_035780

#### Promoters

In [92]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {geneList} \
-b {promoterList} \
| wc -l
!echo "promoter overlaps with CG motifs"

   33739
gene overlaps with transposable elements (all)


In [16]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {promoterList} \
-b {CGMotifList} \
> 2019-09-15-Promoter-CGmotifs.txt

In [17]:
!head 2019-09-15-Promoter-CGmotifs.txt

NC_035780.1	28961	33324	NC_035780.1	RepeatMasker	similarity	32720	32819	18.2	+	.	Target "Motif:Crypton-9N1_CGi" 239 337	100
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	48463	48520	 8.8	+	.	Target "Motif:BivaMD-SINE1_CrVi" 280 337	58
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	48666	49000	10.9	-	.	Target "Motif:BivaMD-SINE1_CrVi" 1 337	335
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	50251	50279	 0.0	+	.	Target "Motif:(GGTTAG)n" 1 29	29
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	50606	50760	21.3	+	.	Target "Motif:Harbinger-2N1_CGi" 1 166	155
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	50977	51034	 0.0	+	.	Target "Motif:(TA)n" 1 58	58
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	51456	51498	 0.0	+	.	Target "Motif:(AG)n" 1 43	43
NC_035780.1	43111	66897	NC_035780.1	RepeatMasker	similarity	51721	51922	21.8	+	.	Target "Motif:Harbinger-2N1_CGi" 2568 2776	202
NC_035780.1	43111	66897	NC_035780

#### Transposable Elements

In [48]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {transposableElements} \
-b {CGMotifList} \
| wc -l
!echo "Transposable element overlap with CG motifs"

 2828372
CG motif overlaps with transposable elements (all)


In [22]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {transposableElements} \
-b {CGMotifList} \
> 2019-09-15-TE-CGmotifs.txt

In [23]:
!head 2019-09-15-TE-CGmotifs.txt

NC_035780.1	5078	5080	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631	1
NC_035780.1	5159	5161	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631	2
NC_035780.1	5162	5164	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631	2
NC_035780.1	5174	5176	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631	2
NC_035780.1	5191	5193	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631	2
NC_035780.1	5220	5222	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631	2
NC_035780.1	5317	5319	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631	2
NC_035780.1	5357	5359	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Mot

#### No overlaps

In [157]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {CGMotifList} \
-b {exonList} {intronList} {transposableElements} {promoterList} \
| wc -l
!echo "CG motifs do not overlap with exons, introns, transposable elements, or putative promoters"

 4528757
CG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters


In [158]:
! {bedtoolsDirectory}intersectBed \
-v \
-a {CGMotifList} \
-b {exonList} {intronList} {transposableElementsAll} {promoterList} \
> 2019-09-15-No-Overlap-CGmotifs.txt

In [159]:
!head 2019-09-15-No-Overlap-CGmotifs.txt

NC_035780.1	28	30	CG_motif
NC_035780.1	54	56	CG_motif
NC_035780.1	75	77	CG_motif
NC_035780.1	93	95	CG_motif
NC_035780.1	103	105	CG_motif
NC_035780.1	116	118	CG_motif
NC_035780.1	134	136	CG_motif
NC_035780.1	159	161	CG_motif
NC_035780.1	209	211	CG_motif
NC_035780.1	224	226	CG_motif


### 3b. Transposable Elements

It's also good to know where transposable elements overlap with the other feature tracks.

#### Exons

In [24]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {exonList} \
-b {transposableElements} \
| wc -l
!echo "Exon overlaps with transposable elements"

   41511
Exon overlaps with transposable elements (Cg)


In [27]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {exonList} \
-b {transposableElements} \
> 2019-09-15-Exon-TE.txt

In [28]:
!head 2019-09-15-Exon-TE.txt

NC_035780.1	109967	109996	NC_035780.1	RepeatMasker	similarity	109968	109996	 0.0	+	.	Target "Motif:(CCT)n" 1 29
NC_035780.1	164885	164914	NC_035780.1	RepeatMasker	similarity	164886	164914	 7.3	+	.	Target "Motif:(GAG)n" 1 29
NC_035780.1	227767	227819	NC_035780.1	RepeatMasker	similarity	227768	227819	25.0	+	.	Target "Motif:A-rich" 1 54
NC_035780.1	227767	227819	NC_035780.1	RepeatMasker	similarity	227768	227819	25.0	+	.	Target "Motif:A-rich" 1 54
NC_035780.1	227767	227819	NC_035780.1	RepeatMasker	similarity	227768	227819	25.0	+	.	Target "Motif:A-rich" 1 54
NC_035780.1	233475	233478	NC_035780.1	RepeatMasker	similarity	233445	233478	10.1	+	.	Target "Motif:(CCTTT)n" 1 35
NC_035780.1	232863	233028	NC_035780.1	RepeatMasker	similarity	232798	233028	29.7	-	.	Target "Motif:ISL2EU-N8_CGi" 15 237
NC_035780.1	269562	269603	NC_035780.1	RepeatMasker	similarity	269563	269603	17.1	+	.	Target "Motif:(ATG)n" 1 42
NC_035780.1	258539	258574	NC_035780.1	RepeatMasker	similarity	258540	258574	16.3	+	.	

#### Introns

In [100]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {intronList} \
-b {transposableElements} \
| wc -l
!echo "Intron overlaps with transposable elements"

  107542
Intron overlaps with transposable elements (Cg)


In [101]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {intronList} \
-b {transposableElements} \
> 2019-09-15-Intron-TE.txt

In [102]:
!head 2019-09-15-Intron-TE.txt

NC_035780.1	32719	32819	NC_035780.1	RepeatMasker	similarity	32720	32819	18.2	+	.	Target "Motif:Crypton-9N1_CGi" 239 337
NC_035780.1	46753	46805	NC_035780.1	RepeatMasker	similarity	46754	46805	 6.8	+	.	Target "Motif:DNA-22_CGi" 631 722
NC_035780.1	50250	50279	NC_035780.1	RepeatMasker	similarity	50251	50279	 0.0	+	.	Target "Motif:(GGTTAG)n" 1 29
NC_035780.1	50605	50760	NC_035780.1	RepeatMasker	similarity	50606	50760	21.3	+	.	Target "Motif:Harbinger-2N1_CGi" 1 166
NC_035780.1	50976	51034	NC_035780.1	RepeatMasker	similarity	50977	51034	 0.0	+	.	Target "Motif:(TA)n" 1 58
NC_035780.1	51455	51498	NC_035780.1	RepeatMasker	similarity	51456	51498	 0.0	+	.	Target "Motif:(AG)n" 1 43
NC_035780.1	51720	51922	NC_035780.1	RepeatMasker	similarity	51721	51922	21.8	+	.	Target "Motif:Harbinger-2N1_CGi" 2568 2776
NC_035780.1	86839	86942	NC_035780.1	RepeatMasker	similarity	86840	86942	27.4	-	.	Target "Motif:Helitron-N14_CGi" 83 189
NC_035780.1	87408	87513	NC_035780.1	RepeatMasker	similarity	87409	87

#### Genes

In [97]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneList} \
-b {transposableElements} \
| wc -l
!echo "gene overlaps with transposable elements"

   32705
gene overlaps with transposable elements (Cg)


In [98]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {geneList} \
-b {transposableElements} \
> 2019-09-15-Gene-TE.txt

In [99]:
!head 2019-09-15-Gene-TE.txt

NC_035780.1	32719	32819	NC_035780.1	RepeatMasker	similarity	32720	32819	18.2	+	.	Target "Motif:Crypton-9N1_CGi" 239 337
NC_035780.1	46753	46805	NC_035780.1	RepeatMasker	similarity	46754	46805	 6.8	+	.	Target "Motif:DNA-22_CGi" 631 722
NC_035780.1	50250	50279	NC_035780.1	RepeatMasker	similarity	50251	50279	 0.0	+	.	Target "Motif:(GGTTAG)n" 1 29
NC_035780.1	50605	50760	NC_035780.1	RepeatMasker	similarity	50606	50760	21.3	+	.	Target "Motif:Harbinger-2N1_CGi" 1 166
NC_035780.1	50976	51034	NC_035780.1	RepeatMasker	similarity	50977	51034	 0.0	+	.	Target "Motif:(TA)n" 1 58
NC_035780.1	51455	51498	NC_035780.1	RepeatMasker	similarity	51456	51498	 0.0	+	.	Target "Motif:(AG)n" 1 43
NC_035780.1	51720	51922	NC_035780.1	RepeatMasker	similarity	51721	51922	21.8	+	.	Target "Motif:Harbinger-2N1_CGi" 2568 2776
NC_035780.1	86839	86942	NC_035780.1	RepeatMasker	similarity	86840	86942	27.4	-	.	Target "Motif:Helitron-N14_CGi" 83 189
NC_035780.1	87408	87513	NC_035780.1	RepeatMasker	similarity	87409	87

#### Promoters

In [84]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {promoterList} \
-b {transposableElements} \
| wc -l
!echo "promoter overlaps with transposable elements"

 2142774
CG motif overlaps with transposable elements (Cg)


In [52]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {promoterList} \
-b {transposableElements} \
> 2019-09-15-Promoter-TE.txt

In [53]:
!head 2019-09-15-Promoter-TE.txt

NC_035780.1	5079	5080	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	5159	5161	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	5162	5164	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	5174	5176	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	5191	5193	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	5220	5222	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	5317	5319	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	5357	5359	CG_motif	NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CG