# Big Data for Biologists: Decoding Genomic Function- Class 10

## What is GO term enrichment analysis ?

##  Learning Objectives
***Students should be able to***

<ol>
<li> <a href=#catwc> Use the Unix command wc (word count) to count the lines in a file.</a> </li>
 <li> <a href=#GOtermIntro> Describe how the Gene Ontology is organized and what a "GO term" means. </a> </li>
 <li> <a href=#GOtermenrichment> Explain what GO term enrichment is </a></li>
 <li> <a href=#GeneIDtoName> Convert GeneIDs to gene names using the unix grep command </a></li>
 <li> <a href=#GOrilla> Use the GOrilla website to identify GOterms enriched in a set of genes </a></li>
 </ol>
 
## What are transcription factors and motifs?

##  Learning Objectives
***Students should be able to***
<ol> 
    <li>  <a href=#TFMotif>Explain what a transcription factor binding motif is</a> </li>
 <li> <a  href=#PWM>Construct a position weight matrix (PWM) for a transcription factor by analyzing known transcription factor binding sequences </a></li>
 <li> <a href=#PSSM>Make a position-specific score matrices (PSSM) from a PWM to use for transcription factor motif-scanning  </a></li>
    <li><a href=#Scan>Motif scanning along a DNA sequence </a></li>
 <li> <a href=#Biopython>Become familiar with Biopython functions and modules: Align, Motif</a> </li>
 <li> <a href=#HOMER>Become familiar with common motif-scanning tools such as MEME. </a> </li>

</ol>

### Load data and import helper functions

In [None]:
#Imports helper functions for loading RNA_Seq data and kmeans algorithm 

%matplotlib inline
%load_ext autoreload
%autoreload 2
import matplotlib.pyplot as plt
import sys
sys.path.append('../helpers')
from kmeans_helpers import * 
from RNAseq_helpers import * 

## Use the Unix command wc (word count) to count the lines in a file.<a name='catwc' />

We can also use the wc (word count) command to quickly count the lines in a file.
The wc -l command prints the number of new lines in the file. Other flags can be used to print the number of words or characters.

In [None]:
#Count the number of genes in each cluster 
#Cluster 0 
!wc -l 0.txt

In [None]:
#Count the number of genes in clusters 1, 2 and 3
!wc -l 1.txt
!wc -l 2.txt 
!wc -l 3.txt

## How is the gene ontology organized and what is a GO term? <a name='GOtermIntro' />

An ontology represents knowledge about some subject domain. An ontology consists of two parts: 
* Well-defined terms 
* Relationships between the terms. 

[The Gene Onotology](http://www.geneontology.org/)  provides a way to annotate known information about genes. The gene ontology seeks to answer three questions about each gene: 


* Which functions does the gene product exert? ( **Molecular Function**) 

* With which biological process is the gene product associated ( **Biological Process** ) 

* Where and when is a particular gene product involved (cell part, cell type, body part, development stage)? (**Cellular Component**)


<img src="../Images/9-GOexplanation.png" style="width: 70%; height: 70%" align="center"//>

(figure credit: Rachel Huntley, "Introduction to the Gene Ontology and GO annotation resources", http://slideplayer.com/slide/7009132/)

Gene Ontology terms are organized in a hierarchy of 7 levels. The structure of GO can be described in terms of a graph, where each GO term is a node, and the relationships between the terms are edges between the nodes. GO is loosely hierarchical, with 'child' terms being more specialized than their 'parent' terms, but unlike a strict hierarchy, a term may have more than one parent term. 

For example: 
<img src="../Images/9-GOexample.png" style="width: 90%; height: 90%" align="center"//>


## What is GO term enrichment?  <a name='GOtermenrichment' />

In [None]:
from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRqyb_exm8Yzfe8_PfhCLGl5FwFLNerBoYJD7JVIsfnbNbEhu2_F8efs8UJCY9jTyB9SOTaw6a7eJWn/embed?start=false&loop=false&delayms=60000" frameborder="0" width="800" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

[GOrilla](http://cbl-gorilla.cs.technion.ac.il/) is a tool for identifying and visualizing enriched GO terms in ranked lists of genes.
It can be run in one of two modes:

*    Searching for enriched GO terms that appear densely at the top of a ranked list of genes or
*    Searching for enriched GO terms in a target list of genes compared to a background list of genes. We will use this mode to identify GO terms that are enriched in the set of 1543 differential genes identified in our four tissues of interest, compared with the background of all genes in the hg19 reference genome. 

First, we will run all 1543 differentially expressed genes through GORilla to determine if there are any significantly enriched GO terms as compared to the background of all genes in the hg19 reference genome. We have written the differential genes to an output file called 'differential_gene_ids.txt'. The file 'hg19.txt' contains a list of all gene id's and gene names in the hg19 reference genome.  

In [None]:
!head differential_gene_ids.txt

## Convert GeneIDs to Gene Names <a name='GeneIDtoName' />

GOrilla expects gene names, rather than gene id's as the input, so we must convert from the ENSEMBL id's to the official gene symbols. The code below will perform this conversion. 

In [None]:
#The file "gene_id_to_gene_name.txt" maps gene id's to gene names. Examine the contents of this file: 
!head differential_gene_id_to_gene_name.txt

In [None]:
#We iterate through the gene id's in our list and find the corresponding gene names. 
!grep -f differential_gene_ids.txt differential_gene_id_to_gene_name.txt > tmp.txt
#let's examine the output of the grep command: 
!cat tmp.txt 


In [None]:
#select the second column from the grep output
!cut -f2 tmp.txt > differential_gene_names.txt

#Examine the first 10 lines in the resulting file.
!head -n10 differential_gene_names.txt

In [None]:
# cut the second column from the file hg19.txt to get the names (rather than ENSEMBL id's) of all genes in hg19. 
#! cut -f2 hg19.txt >  hg19.names.txt 
!head -n10 hg19.names.txt

## Use the GOrilla website to identify GOterms enriched in a set of genes <a name='GOrilla' />

We are now ready to use GOrilla to check for enriched GO terms. 
First, navigate to the GOrilla portal: <a href="http://cbl-gorilla.cs.technion.ac.il/">http://cbl-gorilla.cs.technion.ac.il/</a>

Follow these steps in the GOrilla portal: 
* select "Homo sapiens" for "Choose organism" 
* Select "Two unranked lists of genes" from "Choose running mode" 
* Upload the file "differential_genes_names.txt" for the Target set. 
* Upload the file "hg19.names.txt" for the Background set. 
* Select "All" under "Choose an ontology". 
* Click on "Search Enriched GO terms"  
* Examine the output by clicking on "Process", "Function", and "Cellular Component" tabs.  



The top 5 hits for Process should be: 
![Process](../Images/10_process_go_allgenes.update.png)
The top 5 hits for Function should be: 
![Function](../Images/10_function_go_allgenes.update.png)
The top 5 hits for Cellular Component should be: 
![Cellular Component](../Images/10_component_go_allgenes.png)


GOrilla also generates the graph of inter-related GO terms, color-coding them by significance.

Re-run the GORilla analysis with each cluster of genes (upload 0.txt, 1.txt, 2.txt, and 3.txt to GORilla to compare the significant Process, Function, and Cellular Component GO terms that are returned.)

## Cluster 0 
From the heatmap, Cluster 0 appears to contain genes that are downregulated in Blood and up-regulated in Embryonic cells. 

### Process: 
![0_process](../Images/10_0.process.png)
### Function: 
![0_function](../Images/10_0.function.png)
### Cellular Component:
![0_component](../Images/10_0.component.png)

## Cluster 1 
From the heatmap, Cluster 1 contains genes that are moderately upregulated in Blood and the Immune system. 

### Process: 
![1.process](../Images/10_1.process.png)
### Function: 
![1.function](../Images/10_1.function.png)
### Cellular Component:
![1.component](../Images/10_1.component.png)

## Cluster 2 
From the heatmap, Cluster 2 contains genes that are strongly upregulated in Blood and the Immune system. 


### Process: 
![2.process](../Images/10_2.process.png)
### Function: 
![2.function](../Images/10_2.function.png) 
#### Cellular Component:
![2.component](../Images/10_2.component.png) 

## Cluster 3

From the heatmap, cluster 3 contains genes that are upregulated in the Respiratory system and Embryonic samples. 

### Process: 
![3.process](../Images/10_3.process.png) 
#### Function: 
![3.function](../Images/10_3.function.png) 
#### Cellular Component:
![3.component](../Images/10_3.component.png)


Web-based tools such as GOrilla are convenient to use for small numbers of queries. However, you may often perform a more complex analysis with dozens of clusters instead of just 4, as we have here. 

### ***This concludes our analysis of RNA-seq data. We will now transition to discussing the mehanisms that regulate gene expression.***

We will begin by examining transcription factors, proteins that control the rate of gene expression.  


## What is a transcription factor?<a name='TFmotif' />

In molecular biology, <b>a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence</b>. Their function is to regulate - turn on and off - genes in order to make sure that they are expressed in the right cell at the right time and in the right amount throughout the life of the cell and the organism. Groups of TF's function in a coordinated fashion to direct cell division, cell growth, and cell death throughout life; cell migration and organization (body plan) during embryonic development; and intermittently in response to signals from outside the cell, such as a hormone. There are up to 2600 TFs in the human genome.

TFs work alone or with other proteins in a complex, by promoting (as an <b>activator</b>), or blocking (as a <b>repressor</b>) the <b>recruitment of RNA polymerase</b> (the enzyme that performs the transcription of genetic information from DNA to RNA) to specific genes.

A defining feature of TFs is that they contain at least one <b>DNA-binding domain (DBD), which has affinity to specific DNA short subsequences (~4-20 bp) collectively represented as a binding motif (pattern)</b>. TFs are grouped into classes based on their DBDs. Other proteins such as coactivators, chromatin remodelers, histone acetyltransferases, histone deacetylases, kinases, and methylases are also essential to gene regulation, but lack DNA-binding domains, and therefore are not TFs.

TFs are of interest in medicine because TF mutations can cause specific diseases, and medications can be potentially targeted toward them.

(Adapted from <a href="https://en.wikipedia.org/wiki/Transcription_factor">https://en.wikipedia.org/wiki/Transcription_factor</a>)

## What is a transcription factor binding motif?<a name='TFmotif' />

![Transcription factor binds to DNA motif](images/tf_motif.png)

Each transcription factor (TF) has strong chemical binding affinity to some short DNA sequences and weaker affinity to others. Hence, <b>the motif of a TF is not a single deterministic sequence</b>. Instead, you should think of the binding motif of a TF as a collection of all possible subsequences (possibly of variable length), each of which has a different affinity score i.e. a distribution of affinity scores. This is clearly a clunky representation of a motif i.e. a very long list of subsequences with associated affinity scores. <b>Can we develop a compact representation of a binding motif that faithfully summarizes the variable affinity of the TF to different subsequences.</b>

Let us assume we were able to conduct a sequencing experiment in some cell type, that provided us a sample of subsequences in the genome to which a TF binds.

For example, we have found that the GATA transcription factor binds to the following 4 sequences: 

In [None]:
#tf_binding_sequences.fa is a file with known sequences that bind to the GATA transcription factor
sequences=open("tf_binding_sequences.fa",'r').read().strip().split('\n')
sequences 

In [None]:
#we are importing some printing and helper functions for this notebook don't worry about this code block
import pprint
pp = pprint.PrettyPrinter(indent=4)
import sys
sys.path.append('../helpers')
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from viz_sequence import * 

## Construct a transcription factor binding motif representation called a position weight matrix (PWM)<a name='PWM' />  

We are given a collection of binding instances (subsequences) in the genome to which the TF binds and we want to derive a compact representation of the TFs binding motif.

The main principles behind deriving a motif representation is as follows
1. The subsequences that a TF likes to bind to (high affinity) are typically quite similar with some mismatches, insertions and deletions (remember alignments?)
2. The frequency of a subsequence in the collection of binding instances is proportional to the affinity of the TF to that subsequence i.e. If a TF has high affinity to a specific subsequence, that subsequence will be found more often in the collection of binding instances.

Based on these two principles, deriving a compact representation of a motif involves
1. Using multiple sequence alignment to align all the binding instance subsequences (which could be of variable length) allowing for mismatches, insertions and deletions
2. Counting frequencies of aligned subsequences, which can alternatively be recorded as counting frequencies of each type of nucleotide (A,C,G,T) at each position across all the aligned subsequences.

## From Multiple Sequence Alignment to PWM

We use the [MUSCLE algorithm](http://www.ebi.ac.uk/Tools/msa/muscle/) to perform multiple sequence alignment and generate a consensus sequence. Refer to Tutorial 5 to refresh your knowledge of multiple sequence alignment. 

In [None]:
from Bio.Align.Applications import MuscleCommandline
muscle_cline = MuscleCommandline(input="tf_binding_sequences.fa",fastaout='tf_binding_sequences_out.afa')
muscle_cline()
!cat tf_binding_sequences_out.afa

#### We examine the aligned portions of the sequences using the AlignIO module:

In [None]:
#The BioPython module AlignIO is an interface for inputing and outputing sequence alignments
#We use it here to view a formated alignment

import io 
from Bio import AlignIO

align = AlignIO.read("tf_binding_sequences_out.afa","fasta")
print(align)

#### We remove poorly aligned positions

In [None]:
sub_alignment=align[:,3:8]
print(sub_alignment)

#### Now, we will record the frequencies of each nucleotide at each position in the aligned set of subsequences

In [None]:
from Bio.motifs import Motif,Instances


One problem is that the raw counts (frequencies) of each nucleotide at each position would depend on the total number of subsequences sampled.

To avoid this dependence, we can instead <b>record proportions or probabilities (ranging from 0 to 1)</b> of observing each nucleotide at each position. i.e. we need to divide (<b>normalize</b>) the raw frequencies (counts) of each nucleotide at each position by the total number of subsequences. This representation of the binding motif of a TF is known as its <b>position weight matrix (PWM)</b>.

We notice that the first column contains base C twice, and base G twice. The second column always contains base G, and the third, fourth, and fifth columns always contain the bases A,T, and A, respectively. We can quantify this information with a position weight matrix (PWM). 

In [None]:
#We iterate through each sequence a in our sub-alignment and extract the seq attribute
#This is the character string that encodes the DNA sequence from the alignment. 
aligned_sequences=[a.seq for a in sub_alignment]
print(aligned_sequences)

In [None]:
#We use the alphabet defined above to store the sequences from the alignment as an
#instance object. Instance is a bit vague -- in this case we are referring to an instance 
# of a motif. 
instances=Instances(instances=aligned_sequences)
m=Motif(instances=instances)
print(m)

In [None]:
#Print the consensus sequence for the motif 
m.consensus

In [None]:
#print the position weight matrix for the motif 
m.pwm

In [None]:
#convert the Pwm dictionary to an array and plot the PWM sequence logo 
import numpy as np 
pwm=np.asarray([m.pwm['A'],m.pwm['C'],m.pwm['G'],m.pwm['T']])
print(pwm)
plot_weights(pwm)

#### Question:

Why do the values of the rows in every column in the PWM sum to 1?

## Accounting for genomic background frequencies of nucleotides i.e. converting a PWM into a position specific scoring matrix (PSSM)<a name='PSSM' />

A PWM records the positional nucleotide probabilities based on an observed sample of binding instances (subsequences) from the genome. 

#### But what would these probabilities look like we just randomly sampled subsequences of the same length from the rest of the genome?

Imagine a genome that is very rich in Gs and Cs. Then if I randomly sampled subsequences from the genome, I would get high probabilities of Gs and Cs just by chance.

Hence, we need to <b>normalize</b> our PWM by these <b>random background probabilities</b> of nucleotides from the genome.

For simplicity, we assume that the bases A,C,G,T all occur with equal frequency (p=0.25). Note: this assumption rarely holds in reality! What you would actually do is count the total number of As, Cs, Gs and Ts in a particular genome and then divide by the size of the genome to obtain the background probabilities of each nucleotide.

In [None]:
m.background

The normalized PWM, which we will call a <b>Position-specific scoring matrix</b> can be represented a log-odds matrix i.e. at each position 'i' for each nucleotide, we record the log2 of ratio of the probability in the PWM divided by the probability in the background.

We calculate the PSSM in accorance with the following formula:![PSSM formula](images/pssm_formula.png)

#### Unlike the PWM which has all values between 0 and 1 (probabilities), the PSSM can have positive or negative values in each position.

Positive values indicate a positive log odds i.e. greater chance of observing that nucleotide relative to background.

Negative values indicate a negative log odds i.e. lower chance of observing that nucleotide relative to background.

In [None]:
m.pssm

We want to avoid the -inf values in our matrix, as mathematical operations break down for infinite values. To avoid this problem, we can add "pseudocounts" -- small numbers we add for each position in the PWM to avoid taking the log of 0. We add the pseudocounts to our "counts" matrix and recompute the PWM & PSSM. 

In [None]:
#get the original pseudocounts, they should be 0 
print("Original pseudocounts:")
print(m.pseudocounts)

#Update the pseudocounts to 0.001 
m.pseudocounts['A']=0.001
m.pseudocounts['C']=0.001
m.pseudocounts['G']=0.001
m.pseudocounts['T']=0.001

print("Updated pseudocounts:")
print(m.pseudocounts)


In [None]:
#Re-calculate the PSSM, we should no longer observe values if -inf 
pssm=m.pssm 
for base in ['A','C','G','T']: 
    pssm[base]=[round(val) for val in pssm[base]]
pssm

In [None]:
#plot the PSSM 
pssm=np.asarray([pssm['A'],pssm['C'],pssm['G'],pssm['T']])
print(pssm)
plot_weights(pssm)

## The inverse problem: Scanning a DNA sequence with a PSSM to identify good matches i.e. likely binding instances<a name='Scan' />

The PSSM encodes the log odds of the probability of observing a specific nucleotide at each position across a collection of subsequences that a TF would bind to, relative to genomic background. 

Now, lets invert the problem.

#### We are given a PSSM of a TF (of length 'k') and we are given a long sequence (say an entire chromosome). We want to find all the positions in the long sequence that are strong matches to the PSSM i.e. positions to which the TF is likely to bind.

Formally, we can rephrase this statement as a concrete set of steps (an algorithm):

We are given a PWM of a TF (of length 'k') and we are given a long sequence (say an entire chromosome) (of length 'N'). 
1. We scan this DNA sequence, one position at a time (i = 1 to N)
2. At each position 'i', we obtain a subsequence of length 'k', starting at 'i' and ending at 'i+k-1'
3. We score this subsequence in terms of how good of a match it is to the PSSM
4. We decide some threshold on this score to decide if it is a good match to the PSSM or not
5. Positions that pass the threshold are good matches (likely binding sites) of the TF.

Steps (1) and (2) are obvious. 

### Computing the match score of a subsequence 

How do we perform Step (3) i.e. how do we obtain a match score for a sequence of length 'k' to a PSSM of length 'k'.

We simply sum the log odds scores of each of the nucleotides observed in the sequence at each position.

Lets say our sequence was CGATA and we wanted to score this sequence relative to the PSSM of length 5 we obtained above. 
Position 1 in the sequence is a C. Looking up the log odds of C (row 2) in position 1 (col 1) of the PSSM we get a score of 1

Position 2 in the sequence is a G. Looking up the log odds of G (row 3) in position 2 (col 2) of the PSSM we get a score of 2

Position 3 in the sequence is a A. Looking up the log odds of A (row 1) in position 3 (col 3) of the PSSM we get a score of 2

Position 4 in the sequence is a T. Looking up the log odds of A (row 1) in position 4 (col 4) of the PSSM we get a score of 2

Position 5 in the sequence is a A. Looking up the log odds of A (row 1) in position 5 (col 5) of the PSSM we get a score of 2

Match score = 1+2+2+2+2 = 9

This is a high positive score! Which means CGATA has a greater odds of matching the PWM than the background of the genome.

So CGATA must be a good match to the PSSM. If you visually cross check with the PWM or PSSM logo you will notice that CGATA is infact a good match.

#### Question: 

Calculate the score for CGGGG. Is it a good match to the PSSM?

### One-hot Encoding

Going back to the five steps above, how can we write a program to scan a longer sequence with the PSSM to obtain a row of match scores for each position in the sequence?

To do this efficiently, we will use a slightly different representation for the DNA sequence called a 'one-hot-encoding'.

<b>'One-hot-encoding'</b> is a process that transforms a sequence of nucleotides of length 'N' into a matrix of 1's and 0's with 4 rows (representing A,C,G and T) and 'N' columns (representing positions).

In the graphic below, a colored square represents a '1', indicating the presence of the specified nucleotide at that position along the sequence. A blank square represents a '0', indicating the absence of the specified base at the position. 

![one-hot-encoding](images/one_hot.png)

In [None]:
#we write a function to perform one-hot encoding. 
def one_hot_encode(sequence): 
    encoding_dict=dict() 
    encoding_dict['A']=[1,0,0,0]
    encoding_dict['C']=[0,1,0,0]
    encoding_dict['G']=[0,0,1,0]
    encoding_dict['T']=[0,0,0,1]
    
    encoded_sequence=[] 
    for base in sequence: 
        encoded_sequence.append(encoding_dict[base])
    return np.array(encoded_sequence).transpose()

#We now one-hot encode a DNA sequence 
sequence="GCATTACCGATAA"
encoded_sequence=one_hot_encode(sequence)
encoded_sequence

### The inner product
We now scan the PSSM (length 5) across the entire sequence by aligning the PSSM with every subsequence of length 5 in the sequence starting at position and computing the match score as we did above.

Note that the match score for a subsequence is simply the sum of position-wise scores from the PSSM corresponding to each nucleotide in the subsequence.

An efficient command to obtain this match score of a subsequence aligned to a PSSM is an <b>inner product</b> of the PSSM matrix with the 1-hot encoded matrix of the aligned subsequence. 

An inner product is simply a sum of the element-by-element multiplication of the values in the two matrices i.e. the PSSM matrix (4 rows, 5 columns) and the one-hot encoded subsequence (4 rows, 5 columns).

An inner product is often represented mathematically as a < , > operator. E.g. if 'W' is the PSSM matrix and 'X' is the one-hot encoded subsequence, their inner-product is written as < W , X >

![Convolution product](images/convolution3.png)

#### Question:
Can you explain how the inner product of the PSSM and the one-hot encoded sequence is equivalent to the sum of position-wise scores from the PSSM corresponding to each nucleotide in the subsequence.

### Convolution operation

There is also a built in operation known as the <b>'convolution'</b> that can perform a scanning inner-product of one short matrix (PSSM) against a longer matrix (one-hot encoded longer sequence). The convolution directly outputs the match scores at each position in the sequence.

We can then simply restrict to positions with positive match scores (log-odds) as the positions that will be most likely bound by the TF.


![Convolution formula](images/optimal_convolution_product.png)


We can perfom the motif scan in Python by following the steps below:

In [None]:
from scipy.signal import fftconvolve
def scan_sequence(sequence,pssm): 
    convolution_product=fftconvolve(sequence[::-1,::-1],pssm,mode="same")[::-1,::-1][1]
    #get the starting position of the motif along the sequence 
    starting_pos=np.argmax(convolution_product)-pssm.shape[1]/2
    return convolution_product,starting_pos

convolution_product,starting_pos=scan_sequence(encoded_sequence,pssm)
print("Convolution product:"+str(convolution_product))
print("Motif starting position along the sequence:"+str(starting_pos))

## Motif scanning with Biopython <a name='Biopython' />

Luckily, Biopython has built-in functionality for motif scanning that performs the convolution operations above with a lot less code. To achieve the same result as above, we can simply run the following commands in Biopython: 

In [None]:
from Bio.Seq import Seq
seq_object=Seq(sequence, m.alphabet)
for position, score in m.pssm.search(seq_object, threshold=-100):
    print("Position %d: score = %5.3f" % (position, score))

The negative positions refer to instances of the motif found on the reverse strand of the test sequence, and follow the Python convention on negative indices. You can also calculate the scores at all positions along the sequence: 

In [None]:
m.pssm.calculate(seq_object)

In general, this is the fastest way to calculate PSSM scores. The scores returned by pssm.calculate are for the forward strand only. To obtain the scores on the reverse strand, you can take the reverse complement of the PSSM: 

In [None]:
rpssm=m.pssm.reverse_complement()
rpssm.calculate(seq_object)

If you want to use a less arbitrary way of selecting thresholds, you can explore the distribution of PSSM scores. Since the space for a score distribution grows exponentially with motif length, we are using an approximation with a given precision to keep computation cost manageable: 

In [None]:
distribution = m.pssm.distribution(background=m.background, precision=10**4)

The distribution object can be used to determine a number of different thresholds. We can specify the requested false-positive rate (probability of “finding” a motif instance in background generated sequence): 

In [None]:
threshold = distribution.threshold_fpr(0.01)
threshold

or the false-negative rate (probability of “not finding” an instance generated from the motif): 

In [None]:
threshold = distribution.threshold_fnr(0.1)
threshold

or a threshold (approximately) satisfying some relation between the false-positive rate and the false-negative rate (fnr/fpr≃ t): 

In [None]:
threshold = distribution.threshold_balanced(1000)
threshold

## Become familiar with common motif-scanning tools such as MEME <a name='HOMER' />

[The MEME suite](http://meme-suite.org/) allows you to scan regions of the genome for enriched motifs. There are a number of tools included in the MEME suite: 

<img src="../Images/10_MEME.png" style="width: 50%; height: 50%" align="center"//>

We will begin by using the "MEME" tool to discover motifs that are enriched in our list of peaks. 

Navigate to the MEME portal: http://meme-suite.org/tools/meme


In [None]:
#Examine the contents of peaks.txt
#these are 5 peak regions along chromosome 3 in the hg19 genome. 
#We have extracted these peaks from an ENCODE dataset. 
! cat peaks.txt

In [None]:
#extract the underlying fasta sequence for this peaks. 
#refer to tutorial 3 
!fastaFromBed  -fi /opt/data/hg19.genome.fa -bed peaks.txt > peaks.fa

In [None]:
!cat peaks.fa

Upload the **peaks.fa** file to MEME to search for enriched motifs since we have a small peak list, it might be faster to just paste in the sequences directly. 

When the job completes, you will see a screen that looks like the following: 

<img src="../Images/10_MEME_output.png" style="width: 50%; height: 50%" align="center"//>
Let's examine the result files that were generated. 

Click on **MEME HTML output**

<img src="../Images/10_MEME_zoomed.png" style="width: 50%; height: 50%" align="center"//>


Click on the **Submit/Download** link to compare the motifs found by MEME against a database of known motifs. The link will take you to the [TOMTOM tool](http://meme-suite.org//tools/tomtom) 
<img src="../Images/10_TOMTOM.png" style="width: 50%; height: 50%" align="center"//>
