# DML and DMR Analysis

In this notebook, I will examine the location of differentially methylated loci (DMLs) and regions (DMRs) in the *C. virginica* genome. The DMLs were identified using methylKit in [this R script](https://github.com/RobertsLab/project-virginica-oa/blob/master/analyses/2018-05-29-MethylKit-Full-Samples/2018-05-29-MethylKit-Analysis-Full-Samples.R). DMLs were then written out as a [bedfile](https://github.com/RobertsLab/project-virginica-oa/blob/master/analyses/2018-05-29-MethylKit-Full-Samples/2018-05-30-DML-Locations.bed). Using this file, I will begin the analysis derived from [Steven's  notebook](https://github.com/sr320/nb-2018/blob/master/C_virginica/21-Bedtools.ipynb).

1. Identify Overlaps with DMLs and Genomimc Feature Tracks
2. Calculate Overlap Proprtions
3. Gene Flanking
4. Enrichment Analysis

## 0. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/project-virginica-oa/notebooks'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/project-virginica-oa/analyses


In [3]:
pwd

'/Users/yaamini/Documents/project-virginica-oa/analyses'

In [4]:
!mkdir 2018-10-22-DML-Analysis

In [6]:
ls -F

[34m2018-01-23-MBDSeq-Labwork[m[m/
[34m2018-04-26-Gonad-Methylation-FastQC[m[m/
[34m2018-04-27-Bismark[m[m/
[34m2018-05-01-MethylKit[m[m/
[34m2018-05-29-MethylKit-Full-Samples[m[m/
[34m2018-06-11-DML-Analysis[m[m/
[34m2018-06-14-Gene-Enrichment-Analysis[m[m/
[34m2018-10-04-Bismark-Full-Samples-Revised-Parameters[m[m/
[34m2018-10-11-MethylKit-Parameter-Testing[m[m/
[34m2018-10-22-DML-Analysis[m[m/
README.md


In [7]:
cd 2018-10-22-DML-Analysis/

/Users/yaamini/Documents/project-virginica-oa/analyses/2018-10-22-DML-Analysis


## 1. Identify Overlaps with DMLs and Genomic Feature Tracks

To identify the location of DMLs in the *C. virginica* genome, I will use `intersect` from `bedtools`. [The BEDtools suite](http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html) allows me to easily find overlapping regions of different bed files.

### 1a. Locate BEDfiles for Analysis

The BEDfile with DMLs can be viewed below. Columns are are the chromosome, start position, end position, strand, and fold difference with direction. This file only has DMLs that were at least 50% different between the two treatments (control and elevated pCO2).

In [8]:
!head ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed

NC_035780.1	346071	346073	-	50
NC_035780.1	990995	990997	-	-51
NC_035780.1	1882691	1882693	-	52
NC_035780.1	1885022	1885024	-	61
NC_035780.1	1933499	1933501	-	53
NC_035780.1	1945182	1945184	+	55
NC_035780.1	1958998	1959000	-	53
NC_035780.1	1983256	1983258	-	-69
NC_035780.1	2538924	2538926	-	-50
NC_035780.1	2541652	2541654	-	-55


In [9]:
!wc -l ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed

    1398 ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed


I will be using the following Genome Feature Tracks:

1. Exon: Coding regions
2. Intron: Regions that are removed
3. mRNA: Code for proteins! The mRNA track includes both introns and exons.
4. CG motifs: Regions with CGs where methylation can occur

The links to these feature tracks can be found on the [Roberts Lab Genomic Resources wiki page](https://github.com/RobertsLab/resources/wiki/Genomic-Resources).

In [10]:
!head ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_exon.bed

NC_035780.1	13578	13603
NC_035780.1	14237	14290
NC_035780.1	14557	14594
NC_035780.1	28961	29073
NC_035780.1	30524	31557
NC_035780.1	31736	31887
NC_035780.1	31977	32565
NC_035780.1	32959	33324
NC_035780.1	66869	66897
NC_035780.1	64123	64334


In [11]:
!wc -l ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_exon.bed

  731279 ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_exon.bed


In [12]:
!head ../2018-06-11-DML-Analysis/C_virginica-3.0_intron.bed

NC_035780.1	28961	28961
NC_035780.1	29074	30524
NC_035780.1	31558	31736
NC_035780.1	31888	31977
NC_035780.1	32566	32959
NC_035780.1	43110	43112
NC_035780.1	44359	45913
NC_035780.1	46507	64123
NC_035780.1	64335	66869
NC_035780.1	85606	85606


In [13]:
!wc -l ../2018-06-11-DML-Analysis/C_virginica-3.0_intron.bed

  319262 ../2018-06-11-DML-Analysis/C_virginica-3.0_intron.bed


In [14]:
!head ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3

NC_035780.1	Gnomon	mRNA	28961	33324	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	Gnomon	mRNA	43111	66897	.	-	.	ID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1	Gnomon	mRNA	43111	46506	.	-	.	ID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporti

In [15]:
!wc -l ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3

   60201 ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3


In [16]:
!head ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed

NC_035780.1	28	30	CG_motif
NC_035780.1	54	56	CG_motif
NC_035780.1	75	77	CG_motif
NC_035780.1	93	95	CG_motif
NC_035780.1	103	105	CG_motif
NC_035780.1	116	118	CG_motif
NC_035780.1	134	136	CG_motif
NC_035780.1	159	161	CG_motif
NC_035780.1	209	211	CG_motif
NC_035780.1	224	226	CG_motif


In [17]:
!wc -l ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed

 14458703 ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed


### 1b. Use `intersect`

In [30]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed -h


Tool:    bedtools intersect (aka intersectBed)
Version: v2.26.0
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>

	Note: -b may be followed with multiple databases and/or 
	wildcard (*) character(s). 
Options: 
	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.

	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f 

#### Exons

In [Steven's notebook](https://github.com/che625/olson-ms-nb/blob/master/.ipynb_checkpoints/BiGo_dev-checkpoint.ipynb), I noticed he said that there are some exon regions that do not code for any mRNA! The exon and mRNA regions need to be merged. I'm not sure how to do that, so I'll do what I can and then post an issue.

In [18]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-u \
-a ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_exon.bed \
| wc -l
!echo "overlaps with exons"

     786
overlaps with exons


In [19]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-wb \
-a ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_exon.bed \
> 2018-10-22-DML-Exon.txt

In [20]:
!head 2018-10-22-DML-Exon.txt

NC_035780.1	346071	346073	-	50	NC_035780.1	345983	346125
NC_035780.1	990995	990997	-	-51	NC_035780.1	990854	991062
NC_035780.1	1958998	1959000	-	53	NC_035780.1	1958375	1959139
NC_035780.1	1958998	1959000	-	53	NC_035780.1	1958375	1959139
NC_035780.1	1958998	1959000	-	53	NC_035780.1	1958375	1959139
NC_035780.1	1958998	1959000	-	53	NC_035780.1	1958375	1959139
NC_035780.1	1958998	1959000	-	53	NC_035780.1	1958375	1959139
NC_035780.1	1958998	1959000	-	53	NC_035780.1	1958375	1959139
NC_035780.1	1983256	1983258	-	-69	NC_035780.1	1983248	1983390
NC_035780.1	2538924	2538926	-	-50	NC_035780.1	2538624	2538955


#### Introns

In [21]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-u \
-a ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_intron.bed \
| wc -l
!echo "overlaps with introns"

     498
overlaps with introns


In [24]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-wb \
-a ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_intron.bed \
> 2018-10-22-DML-Intron.txt

In [26]:
!head 2018-10-22-DML-Intron.txt

NC_035780.1	1882691	1882693	-	52	NC_035780.1	1882356	1882972
NC_035780.1	1885022	1885024	-	61	NC_035780.1	1884755	1886043
NC_035780.1	1933499	1933501	-	53	NC_035780.1	1932877	1933574
NC_035780.1	1945182	1945184	+	55	NC_035780.1	1945169	1946107
NC_035780.1	2541652	2541654	-	-55	NC_035780.1	2538956	2541769
NC_035780.1	2541726	2541728	+	-50	NC_035780.1	2538956	2541769
NC_035780.1	2541726	2541728	-	-58	NC_035780.1	2538956	2541769
NC_035780.1	2584492	2584494	+	56	NC_035780.1	2584154	2584505
NC_035780.1	2729868	2729870	-	-50	NC_035780.1	2716215	2733757
NC_035780.1	4288213	4288215	+	-56	NC_035780.1	4288129	4288231


#### mRNA

In [27]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-u \
-a ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3 \
| wc -l
!echo "overlaps with mRNA"

    1263
overlaps with mRNA


In [28]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-wb \
-a ../2018-10-11-MethylKit-Parameter-Testing/2018-10-22-DML-Locations.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3 \
> 2018-10-22-DML-mRNA.txt

In [29]:
!head 2018-10-22-DML-mRNA.txt

NC_035780.1	346071	346073	-	50	NC_035780.1	Gnomon	mRNA	341638	349379	.	-	.	ID=rna30;Parent=gene22;Dbxref=GeneID:111113503,Genbank:XM_022451800.1;Name=XM_022451800.1;gbkey=mRNA;gene=LOC111113503;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=F-box only protein 47-like;transcript_id=XM_022451800.1
NC_035780.1	990995	990997	-	-51	NC_035780.1	Gnomon	mRNA	984471	995318	.	-	.	ID=rna117;Parent=gene66;Dbxref=GeneID:111137104,Genbank:XM_022488366.1;Name=XM_022488366.1;gbkey=mRNA;gene=LOC111137104;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 3 Proteins%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments;product=SWI/SNF complex subunit SMARCC2-like;transcript_id=XM_022488366.1
NC_035780.1	1882691	1882693	-	52	NC_035780.1	Gnomon	mRNA	1882143	1890106	.	-	.	ID=rna155;Parent=gene95;Dbx

I know how many overlaps there are, but I also want to know how many unique genes have DMLs in them. For this, I will use the following code:

`cut -f14 2018-10-22-DML-mRNA.txt | sort | uniq -c`

`cut` is the command that isolates the column information. The column is piped into `sort`, then that output is counted for unique lines by `uniq`. I will save the output from this command as a new file.

In [30]:
! cut -f14 2018-10-22-DML-mRNA.txt | sort | uniq -c > 2018-10-22-Unique-Genes-in-DML-mRNA-Overlap.txt

In [31]:
!head 2018-10-22-Unique-Genes-in-DML-mRNA-Overlap.txt

   1 ID=rna10015;Parent=gene5875;Dbxref=GeneID:111118923,Genbank:XM_022458638.1;Name=XM_022458638.1;gbkey=mRNA;gene=LOC111118923;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=copper transporter 2-like;transcript_id=XM_022458638.1
   1 ID=rna10016;Parent=gene5876;Dbxref=GeneID:111118921,Genbank:XM_022458637.1;Name=XM_022458637.1;gbkey=mRNA;gene=LOC111118921;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 24 samples with support for all annotated introns;product=uncharacterized LOC111118921;transcript_id=XM_022458637.1
   1 ID=rna10055;Parent=gene5904;Dbxref=GeneID:111121117,Genbank:XM_022462246.1;Name=XM_022462246.1;gbkey=mRNA;gene=LOC111121117;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%

In [32]:
!wc -l 2018-10-22-Unique-Genes-in-DML-mRNA-Overlap.txt

    2683 2018-10-22-Unique-Genes-in-DML-mRNA-Overlap.txt


The DMLs overlap with 2683 unique genes.

It's also useful to understand where the CG regions are in relation to exons, introns, and mRNA!

In [34]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-u \
-a ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_exon.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed \
| wc -l
!echo "overlaps with exons"

  636270
overlaps with exons


Proportion exon overlap with CG motifs:

In [16]:
636270/14458703

0.04400602184027157

In [35]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-wb \
-a ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_exon.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed \
> 2018-10-22-Exon-CGmotif.txt

In [36]:
!head 2018-10-22-Exon-CGmotif.txt

NC_035780.1	13597	13599	NC_035780.1	13597	13599	CG_motif
NC_035780.1	28992	28994	NC_035780.1	28992	28994	CG_motif
NC_035780.1	29001	29003	NC_035780.1	29001	29003	CG_motif
NC_035780.1	29028	29030	NC_035780.1	29028	29030	CG_motif
NC_035780.1	30539	30541	NC_035780.1	30539	30541	CG_motif
NC_035780.1	30574	30576	NC_035780.1	30574	30576	CG_motif
NC_035780.1	30602	30604	NC_035780.1	30602	30604	CG_motif
NC_035780.1	30676	30678	NC_035780.1	30676	30678	CG_motif
NC_035780.1	30695	30697	NC_035780.1	30695	30697	CG_motif
NC_035780.1	30723	30725	NC_035780.1	30723	30725	CG_motif


In [39]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-u \
-a ../2018-06-11-DML-Analysis/C_virginica-3.0_intron.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed \
| wc -l
!echo "overlaps with introns"

  245500
overlaps with introns


Proportion intron overlap with CG motifs:

In [17]:
245500/14458703

0.016979392964915317

In [40]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-wb \
-a ../2018-06-11-DML-Analysis/C_virginica-3.0_intron.bed \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed \
> 2018-10-22-Intron-CGmotif.txt

In [41]:
!head 2018-10-22-Intron-CGmotif.txt

NC_035780.1	29180	29182	NC_035780.1	29180	29182	CG_motif
NC_035780.1	29203	29205	NC_035780.1	29203	29205	CG_motif
NC_035780.1	29221	29223	NC_035780.1	29221	29223	CG_motif
NC_035780.1	29295	29297	NC_035780.1	29295	29297	CG_motif
NC_035780.1	29323	29325	NC_035780.1	29323	29325	CG_motif
NC_035780.1	29326	29328	NC_035780.1	29326	29328	CG_motif
NC_035780.1	29412	29414	NC_035780.1	29412	29414	CG_motif
NC_035780.1	29452	29454	NC_035780.1	29452	29454	CG_motif
NC_035780.1	29672	29674	NC_035780.1	29672	29674	CG_motif
NC_035780.1	29758	29760	NC_035780.1	29758	29760	CG_motif


In [42]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-u \
-a ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3 \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed \
| wc -l
!echo "overlaps with mRNA"

   60195
overlaps with mRNA


Proportion mRNA overlap with CG motifs:

In [18]:
60195/14458703

0.004163236495002352

In [43]:
! /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-wb \
-a ../2018-06-11-DML-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3 \
-b ../2018-06-11-DML-Analysis/C_virginica-3.0_CG-motif.bed \
> 2018-10-22-mRNA-CGmotif.txt

In [44]:
!head 2018-10-22-mRNA-CGmotif.txt

NC_035780.1	Gnomon	mRNA	28993	28994	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1	NC_035780.1	28992	28994	CG_motif
NC_035780.1	Gnomon	mRNA	29002	29003	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1	NC_035780.1	29001	29003	CG_motif
NC_035780.1	Gnomon	mRNA	29029	29030	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:11112

## 2. Calculate Overlap Proportions

It's important to understand how many overlaps are present between various feature tracks and CG motifs. CG motifs are where we expect methylation to happen. If there are more overlaps present bewteen a certain feature and the CG motifs, we would expect to see most of our DMLs in that region. I also want to understand overlap proportions with DMLS. 

Here are the questions I will answer:

1. Out the total number of CG motifs, how many overlaped with a feature track?
2. What proportion of total overlaps does a certain feature track represent?
3. Out the total number of DMLs, how many overlaped with a feature track?

### 2a. CG motif Overlaps with Feature Tracks

I already calculated the numbers associated with the first question in the first section. I'll remind you of those numbers:

- Proportion exon overlap with CG motifs: 4.40% (0.04400602184027157)
- Proportion intron overlap with CG motifs: 1.70% (0.016979392964915317)
- Proportion mRNA overlap with CG motifs: 0.42% (0.004163236495002352)

### 2b. Proportion Total Overlaps by Feature Track

First, I need to calculate the total number of overlaps we had:

(exon overlap with CG motifs) + (intron overlap with CG motifs) + (mRNA overlap with CG motifs)

In [13]:
636270 + 245500 + 60195

941965

Now, I calculate the proportions:

Exons:

In [15]:
636270/941965

0.6754709569888477

Introns:

In [16]:
245500/941965

0.2606253947864305

mRNA:

In [17]:
60195/941965

0.06390364822472172

- Proportion exon overlap with total overlaps: 67.55% (0.6754709569888477)
- Proportion intron overlap with total overlaps: 26.06% (0.2606253947864305)
- Proportion mRNA overlap with total overlaps: 6.39% (0.06390364822472172)

### 2c. DML Overlaps with Feature Tracks

Exons:

In [47]:
786/1398

0.5622317596566524

Introns:

In [46]:
498/1398

0.3562231759656652

mRNA:

In [45]:
1263/1398

0.9034334763948498

- Proportion exon overlap with DMLs: 56.22% (0.5622317596566524)
- Proportion intron overlap with DMLs: 35.62% (0.3562231759656652)
- Proportion mRNA overlap with DMLs: 90.34% (0.9034334763948498)

# STILL DEVELOPING THE CONTENT BELOW NO GUARANTEES

## 3. Gene Flanking

### 3a. `closest`

After talking to Mac at PCSGA 2018, she suggested using BEDtools [`closest`](https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) instead of `flank`. `closest` will find the nearest genomic feature, but not necessarily a non-overlapping feature. I can modify the code as follows:

1. Path to `closestBed`
2. -io: Ignore features in b that overlap with a
3. -a: Path to mRNA gff
4. -b: Specify either DML or CG motif file
5. ">" filename: Redirect output to a .txt file

In [6]:
! /Users/Shared/bioinformatics/bedtools2/bin/closestBed \
-io \
-a C_virginica-3.0_Gnomon_mRNA.gff3 \
-b ../2018-05-29-MethylKit-Full-Samples/2018-05-30-DML-Locations.bed \
> 2018-09-26-mRNA-Closest-NoOverlap-DMLs.txt

Error: Sorted input specified, but the file C_virginica-3.0_Gnomon_mRNA.gff3 has the following out of order record
NC_035780.1	Gnomon	mRNA	2413594	2416601	.	-	.	ID=rna199;Parent=gene122;Dbxref=GeneID:111129373,Genbank:XM_022475729.1;Name=XM_022475729.1;gbkey=mRNA;gene=LOC111129373;model_evidence=Supporting evidence includes similarity to: 2 Proteins;product=mucin-2-like;transcript_id=XM_022475729.1


In [7]:
! /Users/Shared/bioinformatics/bedtools2/bin/closestBed \
-io \
-a C_virginica-3.0_Gnomon_mRNA.gff3 \
-b C_virginica-3.0_CG-motif.bed \
> 2018-09-26-mRNA-Closest-NoOverlap-CGmotifs.txt

Error: Sorted input specified, but the file C_virginica-3.0_Gnomon_mRNA.gff3 has the following out of order record
NC_035780.1	Gnomon	mRNA	2413594	2416601	.	-	.	ID=rna199;Parent=gene122;Dbxref=GeneID:111129373,Genbank:XM_022475729.1;Name=XM_022475729.1;gbkey=mRNA;gene=LOC111129373;model_evidence=Supporting evidence includes similarity to: 2 Proteins;product=mucin-2-like;transcript_id=XM_022475729.1


The file was created, but the mRNA file itself is unsorted. I need to see if this impacted the output.

# 4. Gene Enrichment Analysis

See this [R Markdown File](https://github.com/RobertsLab/project-virginica-oa/blob/master/analyses/2018-06-11-DML-Analysis/2018-06-14-Gene-Enrichment-Analysis.Rmd) for Gene Enrichment Analysis information.