# Big Data for Biologists: Decoding Genomic Function- Class 12

##  Learning Objectives
***Students should be able to***
<ol> 
<li><a href=#Regulator>Explain what is a regulatory element</a></li>
<li> <a href=#PromotersEnhancers>Explain what promoters and enhancers are</a></li>
<li> <a href=#creMapping>How can we identify all regulatory elements in the human genome?</a></li>
<li> <a href=#chromatinState>What is the chromatin state of a regulatory element?</a></li>
<li> <a href=#roadmap>Interpret genome-wide chromatin state maps in 100s of cell types and tissues from the ENCODE and Roadmap Epigenomics Projects</a></li>
<li> <a href=#Arithmetic>Use the bedtools closest and bedtools intersect functions be used to identify expressed genes from active promoter and strong enhancer sites.</a></li>
<li> <a href=#Subtract>Use the bedtools subtract command be used to compare the activity of promoters and enhancers across cell types. </a></li>
<li> <a href=#IntersectBed> Use the intersectBed command to extract sequences corresponding to promoter or enhancer regions for a gene from CHiP-seq data </a></li> 


# What are regulatory elements? <a name='Regulator'/>

Complexes of trancription factors bind specific genomic elements that contain their DNA sequence binding motifs. These elements are referred to as <b>regulatory elements</b> or cis-regulatory elements (cREs). The primary role of these cREs is to modulate gene transcription of nearby genes by either promoting or repressing recruitment of the RNA polymerase complex at the start of the gene. RNA polymerase is the protein complex that transcribes genes into messenger RNA.

There are millions of regulatory elements scattered all across the genome.

Each cell type / cell state in the human body has a specific gene expression program i.e. which genes are expressed and at what levels. Hence, each cell type also has a specific collection of cREs that regulate the gene expression program of the cell and define its identity and function.

Note that two factors that determine whether a cRE is functioning i.e actively regulating a gene in a specific cell type or cell state.
1. The collection of DNA sequence motifs corresponding to transcription factor binding sites encoded in the DNA sequence of the cRE
2. The availability of the transcription factor proteins to bind to the motifs. i.e. the genes encoding the transcription factors whose motifs are encoded in the cRE should be expressed and localized to the nucleus of the cell.

## What are promoters and enhancers ? <a name='PromotersEnhancers' />

A **promoter** is a cRE that is located just upstream of a gene's transcription start site. Promoters are located on the same strand as the gene, just upstream of the gene. Transcription factors (TFs) bind to sequence motifs embedded in the sequence of the promoter. They recruit RNA polymerase to initiate transcription from the transcription start site. Promoters are typically 100 - 1000 bases long. 

In contrast, an **enhancer** is a cRE that is located at some distance upstream of a gene. Some enhancers can be over a 1 Mb away from the gene they regulate. Hence, enhancers are also often referred to as distal cREs. Transcription factors bind to sequence motifs embedded in the DNA sequences of enhancers. These enhancers are brought in 3D physical proximity to the promoters of the genes they regulate through what is believed to be a result of DNA looping (See figure below). The TFs binding at the enhancer assist further with recruiting RNA polymerase at the promoter of the target gene thereby enhancing transcription i.e. production of mRNA.

In the human genome, on average each gene has about 10 enhancers. Enhancers are critical for cell-type specific regulation of genes.


![Promoter vs Enhancer](../Images/11_PromoterEnhancer.jpg)
source: https://www.nature.com/scitable/topicpage/gene-expression-14121669

![Multiple Enhancers](../Images/11_MultipleEnhancers.jpg)
source: https://www.nature.com/scitable/topicpage/gene-expression-14121669

## How can we identify all regulatory elements in the human genome? <a name='creMapping' />

As noted above, cREs are regions in the genome that are bound by different combinations of transcription factors. So we can think of two strategies for identifying cREs genome wide.

### 1. Computational motif scanning: 
One efficient computational strategy to identify cREs genome-wide would be to scan the human genome using motifs (PSSMs) of every possible transcription factor. This is infact a commonly used screening approach. But there a few drawbacks of this approach.

1. Deciding on optimal thresholds for motif matching scores to accurate discriminate between true and false TF binding sites is a difficult and unsolved problem.
2. Motif scanning would provide a static collection of all possible cREs in the human genome but would not reveal how these cREs are being dynamically utilized in different cell types and cell states.
3. The presence of a motif is not a sufficient condition for a region of the genome to act as a cRE. There are many other 'epigenomic' properties of DNA that can over-ride the sequence e.g. a cRE containing a activator TF's motif can be repressed by adding specific chemical modifications to it.

### 2. ChIP-seqing every TF in every cell state:
We could instead just perform ChIP-seq experiments for every transcription factor in every possible cellular state. This is obviously not feasible due to material, cost and time constraints but also due to the fact that we do not have high-specificity antibodies for every transcription factor.

## What is the chromatin state of a regulatory element? <a name='chromatinState'/>

In [1]:
from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSyvg4SNRzpyOk1FxDOPXIxy0Nwj47X0OCQpaY010G_dpupBS805LLVltYk-WuUwRbki9P-2one50ok/embed?start=false&loop=false&delayms=60000" frameborder="0" width="1000" height="800" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')


DNA is wrapped around structural protein complexes known as <b>nucleosomes</b> like a string wrapped around beads. Nucleosomes provide structural stability to DNA. DNA wrapped around nucleosomes is collectively referred to as the <b>chromatin</b> fibre. 

Nucleosomes are made up of a specific class of proteins known as <b>histones</b>. Specific amino acids residues in the tails of these histone proteins can be modified with a wide variety of chemical modifications such as acetylation, methylation etc. These modifications are referred to as <b>histone or chromatin modifications</b>. Some histone modifications are activating and others are repressive. Combinations of histone modifications are used by the cell to mark different types of functional elements and their activity states. E.g. Enhancers and promoters have different combinations of histone modifications. Active gene bodies have other modifications. Repressed domains have other modifications. These distinct combinations of histone modifications that define specific types of elements are called the <b>histone code</b> or <b>chromatin states</b>.

Cytosines in DNA can also be methylated. These modifications are referred to as <b>DNA methylation</b>. Note that DNA methylation is a direct modification to DNA wherease histone modifications are modifications to the histone proteins in the the nucleosomes that the DNA is wrapped around.

Each cell type in a specific cell state has a distinct and characteristic map of chromatin state and DNA methylation across the genome which is referred to as the <b>epigenome</b> of the cell type. The epigenome can be modified in response to changes in environment or cell state. Epigenomic modifications is the primary conduit through which the information encoded in the DNA (genetic code of genes and the motif code of regulatory elements) interacts and responds to the environment.

A region of DNA in the genome can behave differently in different cell states based its epigenomic/chromatin state even though the DNA sequence is identical in the different cell states. This is the primary basis by which the same genomic sequence can give rise to the diversity of cell types and tissues in the human body.

## Interpreting genome-wide chromatin state maps in 100s of cell types and tissues from the ENCODE and Roadmap Epigenomics Projects <a name="roadmap"/>

Analogous to the <a href="https://www.gtexportal.org/home/">GTEx Project</a>, whose goal was to map gene expression variation across celltypes/tissues and individuals, the NIH funded two large projects <a href="https://www.encodeproject.org/">The Encyclopedia of DNA Elements (ENCODE)</a> and <a href="http://www.roadmapepigenomics.org/">The Roadmap Epigenomics Project</a> to map chromatin states across the entire genome in 100s of diverse cell types and tissues.

Just as <b>ChIP-seq experiments</b> can be used to map genome-wide binding locations of transcription factors, the same technique can also be used to map genome-wide locations of different chromatin modifications. While there are 100s of different histone modifications, luckily, a few (~5-10) have been found to be sufficient to map most of the important classes of functional elements in the genome such as genes, promoters and enhancers. Hence, mapping histone modifications is a powerful and economical approach to map regulatory elements genome-wide and also obtain the cell type and tissue-specific activity states.

The ENCODE and Roadmap Epigenomics projects performed ChIP-seq experiments for 5-10s of histone modifications across 100s of diverse cell types and tissues. Computational approaches were used to identify the predominant combinations of histone modifications i.e. elucidate the key types of <b>chromatin states</b>. 

The projects were able to identify over <b>2 million putative enhancers</b> across the diverse cell types and tissues thereby providing the first reference epigenomic and regulatory map of the human genome.

<b>Clustering</b> analysis of the activity patterns of the 2 million enhancers revealed modular clusters of enhancers with highly tissue-specific activation patterns. (See slides above)

<b>Gene-Ontology enrichment analysis</b> of these enhancer clusters i.e. using genes in proximity to the enhancers in each cluster revealed that the enhancers regulate genes with highly tissue-specific functions. These enhancers hence define the regulatory circuitry that defines cell identity. (See slides above)

<b>Motif enrichment analysis</b> of the DNA sequences of enhancers in each cluster also reveal distinct motifs of tissue-specific transcription factors embedded in the enhancers. (See slides above)

We will visualize histone ChIP-seq data for several histone modifications in the IMR-90 fibroblast cell-line in the WashU browser http://epigenomegateway.wustl.edu/browser/?genome=hg19&session=srmdBPr0KW&statusId=1954345662

We will visualize compact chromatin state maps across all 127 cell types and tissues using the WashU epigenome browser http://epigenomegateway.wustl.edu/browser/?genome=hg19&session=jLXlxz7ZnG&statusId=2008294509

## Using the bedtools closest and bedtools intersect functions be used to identify expressed genes from active promoter and strong enhancer sites. <a name='Arithmetic' />

We obtain a list of promoters and enhancers for the H1-hESC cell line (embryonic stem cells) from ENCODE [here](http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeBroadHmm). Both files are in the [Bed6 format](https://genome.ucsc.edu/FAQ/FAQformat#format1). By comparing the location of active promoters and strong enhancers in the genome to the location of genes, we can get a good sense of which genes are active. 

Let's examine the contents of the active promoter and strong enhancer files: 

In [None]:
# The active promoter file: 
!head -n10 data/wgEncodeBroadHmmH1hescHMM.active_promoters.bed

In [None]:
#The strong enhancer file: 
!head -n10 data/wgEncodeBroadHmmH1hescHMM.strong_enhancers.bed

We also have a list of gene coordinates for the hg19 human reference genome. The column meanings are as follows: 
* column 1: Chromosome name 
* column 2: Start of transcription 
* column 3: End of transcription 
* column 4: Chromatin state 
* column 5: Place holder (you can ignore this) 
* column 6: Strand information

In [None]:
!head -n10 data/hg19.gene_coords.bed

We use the **wc** command to determine the total number of genes in the reference genome: 

In [None]:
!wc -l data/hg19.gene_coords.bed

We would like to see which genes are expressed in the H1-hESC cell type. In the pre-class assignment you looked at the documentation for the [bedtools intersect](http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) command. We can now use this command to intersect the file of active promoters with the list of gene coordinates to determine which genes are being expressed. 

In [None]:
!bedtools intersect -u -wa -a data/hg19.gene_coords.bed  -b data/wgEncodeBroadHmmH1hescHMM.active_promoters.bed  > expressed_genes_H1.active_promoters.bed

Let's examine the resulting file to see which genes intersect active promoters and are therefore turned on in the H1 cell line:

In [None]:
!head -n10 expressed_genes_H1.active_promoters.bed

We can use the **wc** command to see how many genes are expressed in total: 

In [None]:
! wc -l expressed_genes_H1.active_promoters.bed 

Looks like there are 10,081 expressed genes in the cell line, which is slightly less than half of all reference genes. 

Now, let's try the same intersection for the strong enhancers: 

In [None]:
!bedtools intersect -u -wa -a data/hg19.gene_coords.bed -b data/wgEncodeBroadHmmH1hescHMM.strong_enhancers.bed > expressed_genes_H1.strong_enhancers.bed 

Let's see how many genes show up as intersecting a strong  enhancer:

In [None]:
!wc -l expressed_genes_H1.strong_enhancers.bed

Note that we observe a much smaller number of genes -- only 4158 as opposed to 10081 when we examined intersection with active promoters. What could account for this difference? There are two possible explanations. 

Not every expressed gene will be associated with a strong enhancer. Some may be associated with a weak enhancer, or not  have an associated enhancer at all. 

Additionally, many enhancers are distal-acting -- they are located several hundred bases away from the target gene. After a transcription factor has bound to the enhancer region, the DNA must form a loop to bring the transcription factor into contact with the target gene: 
![Enhancers are several hundred bases away from target genes](enhancer_position.png)

So we don't expect most of the enhancers to intersect the target gene. However, we expect the enhancer to be fairly close to the target gene. Generally (but not always!), the closest gene to a strong enhancer is that enhancer's target gene. We can then identify expressed genes from our list of strong enhancers by using the [**bedtools closest**](http://bedtools.readthedocs.io/en/latest/content/tools/closest.html) command. 

*closest* searches for overlapping features in A and B. In the event that no feature in B overlaps the current feature in A, closest will report the nearest (that is, least genomic distance from the start or end of A) feature in B. Note that closest will report an overlapping feature as the closest—that is, it does not restrict to closest non-overlapping feature. The following iconic “cheatsheet” summarizes the funcitonality available through the various options provided by the closest tool.
![bedtools closest cheat sheet](../Images/11_bedtools_closest_cheatsheet.png)


We would like to know how far the enhancer is from the target gene, so we add the -d flag to report this distance. We would also like all genes to be reported in the case of ties, so we use the *-t all* flag. 

In [None]:
!bedtools closest -d -t all -a data/wgEncodeBroadHmmH1hescHMM.strong_enhancers.bed -b data/hg19.gene_coords.bed > expressed_genes.closest.bed 

Let's examine the output:

In [None]:
!head -n10 expressed_genes.closest.bed 

Perform a sort operation on the gene name column (column 13) to count the number of unique genes identified by the *bedtools closest* command: 

In [None]:
#cut column 13 from the bed file (this contains the gene names)
!cut -f13 expressed_genes.closest.bed > tmp1

#sort the gene names 
sort tmp1 > tmp2 
#get the unique gene names from the sorted data. 
uniq tmp2 > tmp3 

#count the number of lines in the file. 
wc -l tmp3 

Now we see 6975 genes, as opposed to 4185 when we used the bedtools intersect command. 

## Use the bedtools subtract command be used to compare the activity of promoters and enhancers across cell types. <a name='Subtract' />

How would the gene expression profile change if we examined a different cell type? We have downloaded data for the Hepg cell line (from the liver). We repeat our analysis from above: 

In [None]:
# Here is the file of active promoter regions in the Hepg2 cell line: 
!head -n10 data/wgEncodeBroadHmmHepg2HMM.active_promoters.bed

In [None]:
#YOUR CODE HERE: 
#Intersect the active promoters file with the genome coordinates file to get the list of expressed genes 
# in the Hepg2 cell line 

In [None]:
#Here is the file of strong enhancers in the Hegp2 cell line: 
!head -n10 data/wgEncodeBroadHmmHepg2HMM.strong_enhancers.bed

In [None]:
#YOUR CODE HERE: 
# Use the bedtools closest command to map strong enhancers to active genes  

We are now interested in the different genes that are expressed in the H1 cell line as compared to the Hepg2 cell line. We can use the [bedtools subtract](http://bedtools.readthedocs.io/en/latest/content/tools/subtract.html) command to identify entries that are present in one bed file but not present in another bed file. 

The syntax for this command is: 

bedtools subtract -a **fileA** -b **fileB**

Any region from fileB that overlaps a region in fileA will be subtracted from fileA: 

![bedtools closest cheat sheet](../Images/12_subtract.png)




In [None]:
#Subtracting the promoter-intersected genes of the H1 cell line from 
#the promoter-intersected genes of the Hepg2 cell line. 

In [None]:
#Subtracting the promoter-interested genes of the Hepg2 cell line from 
#the promoter-intersected genes of the H1 cell line.

## Use the intersectBed command to extract sequences corresponding to promoter or enhancer regions for a gene from ChIP-seq data <a name='IntersectBed'>

We have fetched a list of enhancer-like and promoter-like regions for the H1 embryonic stem cell line in humans (hg19)

These are stored in the following files: 

**ENCSR000DRY_predictions_promoter.bed** for the promoter file.

**ENCSR000ANP_predictions_enhancer.bed** for the enhnacer file. 

#### What file format is the data in? What is contained in each of the first five columns? 
Your answer: 
    
#### Bonus question -- can you explain the meaning of the remaining columns in the file (column 6 and up)? 
Your answer:

Prior studies have shown that the gene [*LIN28A*](http://www.genecards.org/cgi-bin/carddisp.pl?gene=LIN28A) is associated with cell differentiation, and you hypothesize that this gene is likely to be expressed in the H1 cell line, as these are embryonic cells that are undergoing differentiation.  Use th LIN28A Gene Cards link above to find the chromosome and position of the LIN28A gene.  Fill in the chromosome and the starting and ending coordinates of LIN28A below:

In [None]:
chrom_lin28a="" #replace "" with the chromosome of the gene 
startpos_lin28a=None #replace None with the gene start position 
endpos_lin28a=None #replace None with the gene end position


We will use the bedtools intersect command to find the H1  promoter and enhancer peaks that are within 50kb of the LIN28A gene. 

In [None]:
You need to store LIN28A's coordinates in a bed file in order to run the intersect command 
## Execute this code block to generate a bed file containing the LIN28A coordinates, including regions 50k 
## upstream & downstream of the gene 

#add 50kb to the start and end coordinates 
startpos_lin28a-=50000
endpos_lin28a+=50000

outf=open("LIN28A.coords.bed",'w')

#convert coordinates to string values in preparation for writing to output file
output_line=[str(i) for i in [chrom_lin28a,startpos_lin28a,endpos_lin28a,'lin28a'] ]

#write the coordinates to the output file. 
outf.write('\t'.join(output_line)+'\n')

#close the output file.
outf.close() 

In [None]:
## Now you can intersect the promoter files with the LIN28A coordinates 

## promoter peaks 
!bedtools intersect -wa -a *promoter.bed -b LIN28A.coords.bed 

## you can use the intersectBed shortcut, it will do the same thing 
!intersectBed -wa -a *promoter.bed -b LIN28A.coords.bed

##TODO: write a command to intersect the enhancer peak files with LIN28A coordinates 

