## HUMBIO51 Assignment Week 7

##  Learning Objectives
***Students should be able to***

### Identify the most enriched motifs in active enhancer regions.  

<ol> 
<li> <a href=#Sort>Use bedtools commands to sort ChiP-seq peak files by genome coordinates.</a></li>
<li> <a href=#Intersect>Use bedtools commands to intersect regions from multiple bed files.</a></li>
<li> <a href=#Dict>Use python dictionaries to generate frequencies of transcription factors in active enhancers.</a></li>
<li> <a href=#Plot>Generate Bar graphs to visualize the most over-represented transcription factors in GM12878.</a></li>
</ol> 

## Question 1: Using k-means analysis to cluster genomic regions by chromatin state 

We have sampled 10000 regions from the human reference genome at random. Each region is 200 base pairs long. At each of these regions, we have data from a CHIP-Seq experiment within the K562 cell line (leukemia cell line)  measuring the strength of 5 histone markers: 
* H3K4me3, 
* H3K4me1, 
* H3K36me3, 
* H3K9me3,
* H3K27me3. 

The data is stored in a file in your Week_7 folder called **region_x_chrom_mark.tsv**

This file contains 200 bp genomic regions along the y-axis and values of "0", "0.5", or "1" for each of the 5 histone marks along the x-axis. 0 indicates that the histone mark is not present, 0.5 indicates a weak presence, and 1 indicates a strong presence. 

In [18]:
#Change your working directory to Week_7
import os
os.chdir('/opt/humbio51/Weekly Assignments/Week_7')

### Question 1a. 
Perform principal component analysis on the data matrix in **region_x_chrom_mark.tsv** using 5 principal components since we have 5 features. Create a scatter plot to visualize PC1 along the x-axis and PC2 along the y-axis.

Hint: refer to the PCA code from class 8.

Hint: We are performing PCA on the genome regions, which are located in the rows of the matrix. Therefore, there is not need to transpose the matrix before running PCA. 

In [19]:
import pandas as pd
from sklearn.decomposition import PCA as sklearnPCA
from plotnine import * 

## YOUR CODE HERE ##


### Question 1b: 
Looking at the PCA plot, how many clusters would you expect to find in a K-means analysis of this data?

**Your answer here:**  

### Question 1c: 
Use the k-means clustering algorithm implementation from class 9 to cluster the 10,000 genomic regions. Use values of k = 2,4,12,15, 20. 

In [21]:
import sys
sys.path.append('/opt/humbio51/helpers')
from kmeans_helpers import * 
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,silhouette_samples

## YOUR CODE HERE ## 

## Question 1d

Which value of K gives the best clustering as measured by the silhouette score ? Is this answer surprising given what you observe in the PCA? Propose an explanation for the discrepancy. 

**ANSWER:**



### Question 1d. 
Store the results of the K-means clustering analysis for the optimal K in a list called **clusters**. 
Use your list **clusters** to plot a bar chart indicating how many genomic regions are assigned to each cluster.

In [34]:
## Generate a variable called clusters to store the output of K-means clustering with the optimal K from the previous question.

## YOUR CODE HERE ## 

## Generate a bar chart to indicate how many genomic regions are assigned to each cluster. 

### Question 1e. 
Use the function below to plot a heatmap of the clusters.
The x-axis shows the mean of the H3K4me3, H3K4me1, H3K36me3, H3K9me3, and H3K27me3 values for each cluster. The y-axis lists the clusters that you generated. 

In [36]:
import seaborn as sns
def plot_cluster_heatmap(data,clusters,k):
    '''
    k -- the value of k that gave the best clustering of the data. 
    data -- original data frame with genomic regions along the y-axis and histone marks along the x-axis 
    clusters -- list of cluster assignments from K-means clustering. 
    '''
    cluster_summary=None
    for c in range(k): 
        #extract the genomic regions assigned to the current cluster c. 
        cur_subset=data.iloc[np.where(clusters==c)]
        #calculate the mean value for each histone mark in cur_subset 
        mean_histone_vals=np.mean(cur_subset)
        if c==0: 
            cluster_summary=mean_histone_vals 
        else: 
            cluster_summary=pd.concat([cluster_summary,mean_histone_vals],axis=1,ignore_index=True)
    flatui = ["#FFFFFF", "#CCCCCC","#000000"]
    sns.heatmap(cluster_summary.transpose(),annot=False,cmap=sns.color_palette(flatui))

    
## YOUR CODE HERE ##

The heatmap shows that each cluster corresponds to a distinct pattern of chromatin histone marks. The ChromHMM project has mapped these combinations of histone marks to the 15 distinct chromatin states included in the annotation file **15_state_chromHMM_annotations.tsv** : 

http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/annotationEnrichment_RoadmapEp_coreMarks_15State.png.

Focus on the "Emissions" chart in the top left portion of the table: 


![ChromHMM 15 state Model](../Images/Weekly_HW_6_ChromHMM.png)

### Question 1f.### 
Which cluster maps to which chromatin state? Create a dictionary called "annotation_dict" that maps each cluster id to a ChromHMM annotation state.

Note: you might have more than 1 cluster mapping to a given chromatin state, and not all chromatin states may be used. This is because we are sampling 10,000 random regions from the genome -- it's possible we don't have good representation for some of the chromatin states in this rather small sample.


In [1]:
annotation_dict=dict() 
#For example: 
annotation_dict[0]="Enhancer"
#Fill in the other cluster - state mappings. 

## YOUR CODE HERE ## 

## Question 2

The ENCODE project has aggregated data from multiple transcription factor ChIP-seq experiments performed on the GM12878 lymphoblastoid cell line [(over 150 such experiments)!](https://www.encodeproject.org/search/?type=Experiment&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&biosample_type=immortalized+cell+line&organ_slims=blood&assembly=hg19&assay_title=ChIP-seq&biosample_term_name=GM12878&assay_title=ChIP-seq). We have downloaded and merged the peak calls from these experiments in the file **Week_7/GM12878_allTFBS.bed**. This is a standard bed file with columns for 

* Chromosome 
* Start of peak 
* End of peak 
* Name of TF (transcription factor) that binds to this site in the genome. 

Additionally, we have downloaded the ENCODE [active enhancers for the GM12878 cell line.](https://www.encodeproject.org/annotations/ENCSR580CJW/) These were obtained by analyzing H3K27ac ChIP-seq peaks throughout the genome, which are indicative of active enhancer regions. The annotations are stored in the bed file **Week_7/GM12878_enhancer_regions.bed**. 


Your mission is to identify the transcription factor motifs that are most enriched within the GM12878 active enhancer regions. 

### Question 2a


In [40]:
## Examine the format of both files by viewing the first ten lines

## YOUR CODE HERE ##

### Question 2b 
Use bedtools to sort both the Week_7/GM12878_enhancer_regions.bed file and the Week_7/GM12878_allTFBS.bed file.  Store the sorted files as "Week_7/GM12878.enh.sorted.bed" and "Week_7/GM12878.TFBS.sorted.bed" <a name='sort'>

In [41]:
## YOUR CODE HERE ##

### Question 2c. 
Use bedtools to identify all TF (transcription factor) binding sites that overlap an enhancer region in GM12878. Store these TF binding sites in a file called **Week_7/TFBS.enh.intersection.bed**.  Examine the contents of file Week_7/TFBS.enh.intersection.bed by viewing the first ten lines of the file.

In [42]:
## YOUR CODE HERE ## 

### Question 2d. 

Complete the function and fill in the code below to generate a dictionary called **TF_count_dict** with the following fields:

* Keys are the TF's that bind to enhancer regions within GM12878 (i.e. Pol2, Cmyc, Mxi1, ...) 
* Values are the number of times that each TF appears in a peak -- this is analogous to the number of lines in the file "TFBS.enh.intersection.bed" that contain this TF. 

It may be helpful to review the syntax for Python dictionaries from lecture 3. 

In [None]:
## YOUR CODE HERE 
def tally_TF_counts(TF_file_name): 
    #read in the TF_file_name and store the file lines in a list called 'data'
    data=# Your code here 
    tally_dict=dict() 
    #iterate through each line in data, split the string contained in each line by the '\t' delimiter 
    for line in data: 
        tokens=line.split('\t')
        #check whether the 4th value in the list tokens (i.e. the TF that binds this site in the genome) 
        # already exists in tally_dict. You may find the "if in" syntax to be useful. 
        #Your code here 
        
        #if it doesn't exist, create an entry for it in tally_dict, with key = TF and value =1 
        #Your code here 
        
        #if it does exist, increment the count for the TF in the dictionary
        #Your code here 
    return tally_dict 


## Execute the function tally_TF_counts on the file that contains the intersection of peaks and known TF binding sites.
## Store the dictionary it returns as TF_count_dict

### Question 2e.  
Sort the TF's in your dictionary so the TF from the TF that appears the largest number of times at the top of the file to the TF that appears the smallest number of times at the bottom.  

In [None]:
##hint: Use the function pandas.DataFrame.from_dict to convert your tally_dict into a pandas data frame, for easier sorting 
## use "help(pd.DataFrame.from_dict) to learn about the inputs and outputs of this function 

## YOUR CODE HERE ##



In [None]:
## now, execute the function, passing "TF_count_dict" as the input and setting orient="index" 
## Store the data frame in variable "df"
df=## Your code here 


In [None]:
## Sort the dataframe by column 0 (i.e. this column contains the counts for each TF)
## You may find that the pandas.DataFrame.sort_values function is useful.
## Make sure you sort in *descending* order so that the most common TF's can be found with the head command. 
sorted_df=## Your code here 


### Question 2f. 
What are the five most common transcription factors that bind in GM12878 ? 

**ANSWER:**

### Question 2g: 
Generate a bar graph representing the frequency of the 30 most common TF's. The x-axis should contain the name of the TF, and the y-axis should contain the number of times it occurs within an active enhancer in GM12878. 

In [69]:
## hint: You can use the .index function to obtain the TF names from the sorted_df data frame.

## hint: by default, the geom="bar" argument to qplot generates a histogram. To make it generate a bar graph with 
## pre-specified x and y values, add the argument stat="identity" to the qplot function. 

## hint: you might need to flip the x and y coordinates to make the bar graph axis labels legible

## YOUR CODE HERE ##
