# Learning Goals 
* What is a promoter? 
* What is an enhancer?
* Using 'genome arithmetic' to identify expressed genes from active promoter and strong enhancer sites
    * bedtools intersect 
    * bedtools closest 
 
* Comparing activity of promoters and enhancers across cell types 
    * bedtools subtract command 
* Visualizing promoter, enhancer, and gene coordinate data in the browser.



## TODO: add example from RNA-seq expression notebook (use same cell type) 


All cells in an organism contain all the organism's DNA, but we have multiple cell types (i.e. neurons, cardiomyocytes, etc.) 
 
Specific subsets of genes are turned on in different types of cells to determine cell type and function. 
 
There are two key players responsible for gene regulation:
 
**Transcription factors**
* Proteins
* Trans-acting elements: diffuse through the cytoplasm and bind to far-away regions of DNA 
 
**Motif sequences** 
* DNA sequences
* Cis-acting elements: act at fixed position along the DNA molecule 

We obtain a list of active promoters and strong enhancers for the H1-hESC cell line (embryonic stem cells) from ENCODE [here](http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeBroadHmm). Both files are in the [Bed6 format](https://genome.ucsc.edu/FAQ/FAQformat#format1). Let's examine the contents of the files: 

In [6]:
# The active promoter file: 
!head -n10 data/wgEncodeBroadHmmH1hescHMM.active_promoters.bed

chr1	28537	30137	1_Active_Promoter	0	.	28537	30137	255,0,0
chr1	713337	714937	1_Active_Promoter	0	.	713337	714937	255,0,0
chr1	762337	762937	1_Active_Promoter	0	.	762337	762937	255,0,0
chr1	1092937	1093137	1_Active_Promoter	0	.	1092937	1093137	255,0,0
chr1	1166937	1167337	1_Active_Promoter	0	.	1166937	1167337	255,0,0
chr1	1208737	1209337	1_Active_Promoter	0	.	1208737	1209337	255,0,0
chr1	1244137	1244737	1_Active_Promoter	0	.	1244137	1244737	255,0,0
chr1	1259737	1260137	1_Active_Promoter	0	.	1259737	1260137	255,0,0
chr1	1284937	1285337	1_Active_Promoter	0	.	1284937	1285337	255,0,0
chr1	1310137	1311337	1_Active_Promoter	0	.	1310137	1311337	255,0,0


In [7]:
#The strong enhancer file: 
!head -n10 data/wgEncodeBroadHmmH1hescHMM.strong_enhancers.bed

chr1	36337	36537	5_Strong_Enhancer	0	.	36337	36537	250,202,0
chr1	780737	781137	5_Strong_Enhancer	0	.	780737	781137	250,202,0
chr1	948337	949337	4_Strong_Enhancer	0	.	948337	949337	250,202,0
chr1	958937	960337	5_Strong_Enhancer	0	.	958937	960337	250,202,0
chr1	960337	960737	4_Strong_Enhancer	0	.	960337	960737	250,202,0
chr1	960737	960937	5_Strong_Enhancer	0	.	960737	960937	250,202,0
chr1	1093137	1093737	4_Strong_Enhancer	0	.	1093137	1093737	250,202,0
chr1	1093737	1094137	5_Strong_Enhancer	0	.	1093737	1094137	250,202,0
chr1	1112337	1112737	5_Strong_Enhancer	0	.	1112337	1112737	250,202,0
chr1	1240337	1241137	5_Strong_Enhancer	0	.	1240337	1241137	250,202,0


We also have a list of gene coordinates for the hg19 human reference genome. The column meanings are as follows: 
* column 1: Chromosome name 
* column 2: Start of transcription 
* column 3: End of transcription 
* column 4: Gene name 
* column 5: Place holder (you can ignore this) 
* column 6: Strand information

In [11]:
!head -n10 data/hg19.gene_coords.bed

chr19	58858171	58864865	A1BG	0	-
chr19	58863335	58866549	A1BG-AS1	0	+
chr10	52559168	52645435	A1CF	0	-
chr12	9220303	9268825	A2M	0	-
chr12	9217772	9220651	A2M-AS1	0	+
chr12	8975067	9029377	A2ML1	0	+
chr12	9381128	9386803	A2MP1	0	-
chr1	33772366	33786699	A3GALT2	0	-
chr22	43088117	43116916	A4GALT	0	-
chr3	137842559	137851229	A4GNT	0	-


We use the **wc** command to determine the total number of genes in teh reference genome: 

In [19]:
!cat data/hg19.gene_coords.bed| wc -l 

27778


We would like to see which genes are expressed in the H1-hESC cell type. In class 3, we learned about the [bedtools intersect](http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) command. We can now use this command to intersect the file of active promoters with the list of gene coordinates to determine which genes are being expressed. 

In [8]:
!bedtools intersect -u -wa -a data/hg19.gene_coords.bed  -b data/wgEncodeBroadHmmH1hescHMM.active_promoters.bed  > expressed_genes_H1.active_promoters.bed

Let's examine the resulting file to see which genes intersect active promoters and are therefore turned on in the H1 cell line:

In [9]:
!head -n10 expressed_genes_H1.active_promoters.bed

chr1	14361	29370	WASH7P	0	-
chr1	700244	714068	LOC100288069	0	-
chr1	761585	762902	LINC00115	0	-
chr1	1152287	1167447	SDF4	0	-
chr1	1189291	1209234	UBE2J2	0	-
chr1	1243959	1247057	PUSL1	0	+
chr1	1246964	1260067	INTS11	0	-
chr1	1309109	1310580	AURKAIP1	0	-
chr1	1334909	1337426	LOC148413	0	+
chr1	1337275	1342693	MRPL20	0	-


We can use the **wc** command to see how many genes are expressed in total: 

In [10]:
!cat expressed_genes_H1.active_promoters.bed |  wc -l

10081


Looks like there are 10,081 expressed genes in the cell line, which is slightly less than half of all reference genes. 

Now, let's try the same intersection for the strong enhancers: 

In [11]:
!bedtools intersect -u -wa -a data/hg19.gene_coords.bed -b data/wgEncodeBroadHmmH1hescHMM.strong_enhancers.bed > expressed_genes_H1.strong_enhancers.bed 

Let's see how many genes show up as intersecting a strong  enhancer:

In [12]:
!cat expressed_genes_H1.strong_enhancers.bed | wc -l 

4158


Note that we observe a much smaller number of genes -- only 4158 as opposed to 10081 when we examined intersection with active promoters. What could account for this difference? There are two possible explanations. 

Not every expressed gene will be associated with a strong enhancer. Some may be associated with a weak enhancer, or not  have an associated enhancer at all. 

Additionally, many enhancers are distal-acting -- they are located several hundred bases away from the target gene. After a transcription factor has bound to the enhancer region, the DNA must form a loop to bring the transcription factor into contact with the target gene: 
![Enhancers are several hundred bases away from target genes](enhancer_position.png)

So we don't expect most of the enhancers to intersect the target gene. However, we expect the enhancer to be fairly close to the target gene. Generally (but not always!), the closest gene to a strong enhancer is that enhancer's target gene. We can then identify expressed genes from our list of strong enhancers by using the [**bedtools closest**](http://bedtools.readthedocs.io/en/latest/content/tools/closest.html) command. 

*closest* searches for overlapping features in A and B. In the event that no feature in B overlaps the current feature in A, closest will report the nearest (that is, least genomic distance from the start or end of A) feature in B. For example, one might want to find which is the closest gene to a significant GWAS polymorphism. Note that closest will report an overlapping feature as the closest—that is, it does not restrict to closest non-overlapping feature. The following iconic “cheatsheet” summarizes the funcitonality available through the various optyions provided by the closest tool.
![bedtools closest cheat sheet](closest_cheat_sheet.png)


We would like to know how far the enhancer is from the target gene, so we add the -d flag to report this distance. We would also like all genes to be reported in the case of ties, so we use the *-t all* flag. 

In [13]:
!bedtools closest -d -t all -a data/wgEncodeBroadHmmH1hescHMM.strong_enhancers.bed -b data/hg19.gene_coords.bed > expressed_genes.closest.bed 

Let's examine the output:

In [14]:
!head -n10 expressed_genes.closest.bed 

chr1	36337	36537	5_Strong_Enhancer	0	.	36337	36537	250,202,0	chr1	34610	36081	FAM138F	0	-	257
chr1	36337	36537	5_Strong_Enhancer	0	.	36337	36537	250,202,0	chr1	34610	36081	FAM138A	0	-	257
chr1	780737	781137	5_Strong_Enhancer	0	.	780737	781137	250,202,0	chr1	762970	778984	LINC01128	0	+	1754
chr1	948337	949337	4_Strong_Enhancer	0	.	948337	949337	250,202,0	chr1	948846	949919	ISG15	0	+	0
chr1	958937	960337	5_Strong_Enhancer	0	.	958937	960337	250,202,0	chr1	955502	991499	AGRN	0	+	0
chr1	960337	960737	4_Strong_Enhancer	0	.	960337	960737	250,202,0	chr1	955502	991499	AGRN	0	+	0
chr1	960737	960937	5_Strong_Enhancer	0	.	960737	960937	250,202,0	chr1	955502	991499	AGRN	0	+	0
chr1	1093137	1093737	4_Strong_Enhancer	0	.	1093137	1093737	250,202,0	chr1	1102483	1102578	MIR200B	0	+	8747
chr1	1093737	1094137	5_Strong_Enhancer	0	.	1093737	1094137	250,202,0	chr1	1102483	1102578	MIR200B	0	+	8347
chr1	1112337	1112737	5_Strong_Enhancer	0	.	1112337	1112737	250,202,0	chr1	1109285	1133313	TTLL10	0	+	0


Perform a sort operation on the gene name column (column 13) to count the number of unique genes identified by the *bedtools closest* command: 

In [15]:
!cut -f13 expressed_genes.closest.bed| sort | uniq | wc -l 

6975


Caveat: Sometimes closest gene is not the gene that is targeted by the enhancer.! Give example. 

Now we see 6975 genes, as opposed to 4185 when we used the bedtools intersect command. 

How would the gene expression profile change if we examined a different cell type? We have downloaded data for the Hepg cell line (from the liver). We repeat our analysis from above: 

In [1]:
# Here is the file of active promoter regions in the Hepg2 cell line: 
!head -n10 data/wgEncodeBroadHmmHepg2HMM.active_promoters.bed

chr1	28337	29937	1_Active_Promoter	0	.	28337	29937	255,0,0
chr1	135737	136137	1_Active_Promoter	0	.	135737	136137	255,0,0
chr1	137537	138937	1_Active_Promoter	0	.	137537	138937	255,0,0
chr1	325537	326537	1_Active_Promoter	0	.	325537	326537	255,0,0
chr1	327137	328137	1_Active_Promoter	0	.	327137	328137	255,0,0
chr1	661537	662737	1_Active_Promoter	0	.	661537	662737	255,0,0
chr1	662937	664737	1_Active_Promoter	0	.	662937	664737	255,0,0
chr1	713337	715737	1_Active_Promoter	0	.	713337	715737	255,0,0
chr1	761737	763537	1_Active_Promoter	0	.	761737	763537	255,0,0
chr1	893737	894737	1_Active_Promoter	0	.	893737	894737	255,0,0


In [2]:
#YOUR CODE HERE: 
#Intersect the active promoters file with the genome coordinates file to get the list of expressed genes 
# in the Hepg2 cell line 

In [3]:
#Here is the file of strong enhancers in the Hegp2 cell line: 
!head -n10 data/wgEncodeBroadHmmHepg2HMM.strong_enhancers.bed

chr1	11537	11937	4_Strong_Enhancer	0	.	11537	11937	250,202,0
chr1	18137	19137	5_Strong_Enhancer	0	.	18137	19137	250,202,0
chr1	19137	21537	4_Strong_Enhancer	0	.	19137	21537	250,202,0
chr1	27537	27737	5_Strong_Enhancer	0	.	27537	27737	250,202,0
chr1	27737	28337	4_Strong_Enhancer	0	.	27737	28337	250,202,0
chr1	136537	137537	5_Strong_Enhancer	0	.	136537	137537	250,202,0
chr1	463537	464337	5_Strong_Enhancer	0	.	463537	464337	250,202,0
chr1	696537	697337	5_Strong_Enhancer	0	.	696537	697337	250,202,0
chr1	939937	941137	4_Strong_Enhancer	0	.	939937	941137	250,202,0
chr1	942137	942537	4_Strong_Enhancer	0	.	942137	942537	250,202,0


In [4]:
#YOUR CODE HERE: 
# Use the bedtools closest command to map strong enhancers to active genes  

We are now interested in the different genes that are expressed in the H1 cell line as compared to the Hepg2 cell line. We can use the [bedtools subtract](http://bedtools.readthedocs.io/en/latest/content/tools/subtract.html) command to identify entries that are present in one bed file but not present in another bed file. 

In [5]:
#Subtracting the promoter-intersected genes of the H1 cell line from 
#the promoter-intersected genes of the Hepg2 cell line. 

In [None]:
#Subtracting the promoter-interested genes of the Hepg2 cell line from 
#the promoter-intersected genes of the H1 cell line. 