## Big Data for Biologists: Decoding Genomic Function - Class 14

##  Learning Objectives
***Students should be able to***
<ol>
<li> <a href=#GWAS>Describe the experimental setup for a typical GWAS study. </li>
<li> <a href=#GWAS>Explain why linkage disequilibrium (LD) analysis may be necessary for identifying causal SNPs from GWAS studies </li>
<li> <a href=#GWAS>Interpret a plot or table with p-values (or -log<sub>10</sub>p-values) for SNPs to identify which SNPs may be associated with a disease </li>
<li> <a href=#GWAS>Discuss how GWAS studies can be used to identify candidate causal variants for a disease </a></li> 
<li> <a href=#GWAS>Recognize the limitations of identifying causal variants from GWAS studies </a></li> 
<li> <a href=#Workflow>Participate in a collaborative programming project to gain insights into the workflow for computational projects.</a></li>
<li> <a href=#Roles>Experience different roles in a computational project including code implementer and documentation provider.</a></li>
<li> <a href=#LD>Find variants in linkage disequilibrium (LD) with a target variant using tabix and PLINK.</a></li>
<li> <a href=#Dataanalysis> Apply data analysis methods from the course to new problems </a></li> 
</ol>

## Introduction to GWAS  <a name='GWAS'>

In [1]:
from IPython.display import HTML
HTML('<iframe src="https://drive.google.com/file/d/1zGkSZom9fB63QaHDPHkWSD6aoVn2nk3O/preview" width="1000" height="480"></iframe>')

WashU Browser link for FTO gene: http://epigenomegateway.wustl.edu/browser/?genome=hg19&session=EnEB3ADZOF&statusId=276059485

## Introduction to Course Project  <a name ='Workflow'> <a name ='Roles'> 


The course project will have three main steps.  

I.  You will be given the SNP identifiers (rs ids) for two variants in the human genome. For each of these variants you will: 

* write the code using the notebook guidelines below. 
* analyze the output of the code 
* create an initial draft of a report including:  
   <ol>
    <li> the likely causal variants </li>
    <li> explanation of reasoning for why they are the likely causal variants</li>
    <li> what you have learned about how the variant acts(ie. through what type of mutation in a coding region or what type of element in a non-coding region)</li>

II. You will proofreading and check your teammates work. 
* Switch variants and run through code.
* The outputs of your files should be identical.
* Add comments to annotate any code that may need clarification. 

III. Writing up the report (see rubric for guidelines). 

Deliverables: 
* Jupyter notebook with code 
* Writeup document with summary

### Grading Rubric: 

Will be posted on Canvas with the Assignment

### Project files: 
All project files can be found in the folder **/opt/data/project**

* /opt/data/project/1kg_phase1_all*   -- binary variant files
* /opt/data/project/gene_coords_hg19.bed -- bed file of gene coordinates 
* /opt/data/project/gencode.hg19.annotation.gtf -- gene annotation file 
* /opt/data/project/motifs.bed -- coordinates of all transcription factor-binding motifs in the genome. 
* /opt/data/project/active_promoters_across_cell_type.bed 
* /opt/data/project/active_enhancers_across_cell_type.bed 


## STEP 1: Given a target variant that has been linked to a disease, what are the candidate causal variants in LD with the target variant?   

In [1]:
#Find all single nucleotide polymorhphisms (SNPs)  in linkage disequilibrium (LD)  with your target variants.
#YOUR CODE HERE

## Linkage Disequilibrium Example with Tabix and PLINK <a name ='LD'>

[An article in the New England Journal of Medicine](http://www.nejm.org/doi/full/10.1056/NEJMoa1502214?rss=searchAndBrowse&#t=article) presented a GWAS in 52 participants  who were homozygous for the risk allele for the tag variant rs1558902. This variant occurs in an intron of the FTO gene, which has previously been linked to obesity. However, since a strong GWAS association of a variant with a phenotype is insufficient to deterimne causation, the authors checked whether other variants were in strong linkage disequilibrium with rs1558902 and are thus also potentially causal variants in obesity. 

We will examine below how the PLINK tool can be used to perform such a linkage disequilibrium analysis. 


We have downloaded variant files for the 1000 Genomes Project in the PLINK binary format: 
    
* **/opt/data/project/1kg_phase1_all.bed** -- binary encoding of subject genotypes (do not be fooled by the file extension, this is NOT the 4-column bed file format we have been using). 

* **/opt/data/project/1kg_phase1_all.bim** -- list of all variants in the subject population 
* **/opt/data/project/1kg_phase1_all.fam** -- list of all subject id's in the 1000's genome project

In [10]:
#This syntax will identify all variants that are in linkage disequilibrium with our tagged SNP rs1558902
!plink  --bfile /opt/data/project/1kg_phase1_all --r  --ld-snp rs1558902 --threads 10 --out r.for.rs1558902


PLINK v1.90b4.10 64-bit (3 Nov 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to r.for.rs1558902.log.
Options in effect:
  --bfile /opt/data/project/1kg_phase1_all
  --ld-snp rs1558902
  --out r.for.rs1558902
  --r
  --threads 10

7484 MB RAM detected; reserving 3742 MB for main workspace.
39728178 variants loaded from .bim file.
1092 people (525 males, 567 females) loaded from .fam.
Using up to 10 threads (change this with --threads).
Before main variant filters, 1083 founders and 9 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
commands treat these as missing.
Total genotyping rate is 0.999956.
39728178 variants and 1092 people pass filters and QC.
Note: No phenotypes present.
--r to r.for.rs1558902.ld ... 

The SNPs that are in linkage disequilibrium with our tagged SNP were saved to the file **r.for.rs1558902.ld**. Let's examine the contents of this file: 

In [11]:
!cat r.for.rs1558902.ld

 CHR_A         BP_A                 SNP_A  CHR_B         BP_B                 SNP_B            R 
    16     53803574             rs1558902     16     53803128            rs77955027   -0.0655586 
    16     53803574             rs1558902     16     53803156             rs8055197    -0.413274 
    16     53803574             rs1558902     16     53803187             rs1558901     0.757579 
    16     53803574             rs1558902     16     53803223            rs62048402            1 
    16     53803574             rs1558902     16     53803270           rs186754298   -0.0226154 
    16     53803574             rs1558902     16     53803332           rs189959143     0.091852 
    16     53803574             rs1558902     16     53803349           rs182131169   -0.0226154 
    16     53803574             rs1558902     16     53803415           rs139578493    0.0322946 
    16     53803574             rs1558902     16     53803452           rs187115215    0.0682658 
    16    

PLINK also allows us to compute the r^2 value for linkage disequilibrium. The command is the same as what we ran above, but replace "r" with "r^2".

In [12]:
!plink  --bfile /opt/data/project/1kg_phase1_all --r2  --ld-snp rs1558902 --threads 10 --out r2.for.rs1558902


PLINK v1.90b4.10 64-bit (3 Nov 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to r2.for.rs1558902.log.
Options in effect:
  --bfile /opt/data/project/1kg_phase1_all
  --ld-snp rs1558902
  --out r2.for.rs1558902
  --r2
  --threads 10

7484 MB RAM detected; reserving 3742 MB for main workspace.
39728178 variants loaded from .bim file.
1092 people (525 males, 567 females) loaded from .fam.
Using up to 10 threads (change this with --threads).
Before main variant filters, 1083 founders and 9 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
commands treat these as missing.
Total genotyping rate is 0.999956.
39728178 variants and 1092 people pass filters and QC.
Note: No phenotypes present.
--r2 to r2.for.rs1558902.ld

In [14]:
!cat r2.for.rs1558902.ld

 CHR_A         BP_A                 SNP_A  CHR_B         BP_B                 SNP_B           R2 
    16     53803574             rs1558902     16     53803187             rs1558901     0.573926 
    16     53803574             rs1558902     16     53803223            rs62048402            1 
    16     53803574             rs1558902     16     53803574             rs1558902            1 


The New England Journal article mentions that variant rs1421085 was found to be associated with rs1558902. The authors found that rs1421085 disrupted an ARID5B repressor motif, and was thus the most likely causal variant. 

## STEP 2:  Are any of the candidate causal variants in exons?  <a name ='Dataanalysis'>

Your task is to determine whether any of the LD variants are in protein coding regions. That is, do they overlap a known exon? 

We have provided an hg19 gene annotation file here: 

* /opt/data/project/gencode.hg19.annotation.gtf 

You should use the "grep" command to extract EXON regions from this file. 

In [None]:
#YOUR CODE HERE

Now, use one of the bedtools commands we have discussed to overlap the exon file with the coordinates of your LD variants. 

In [13]:
# YOUR CODE HERE 

## STEP 3:  Have any of the variants in protein coding regions been linked to a disease? If so, which one?  What is known about how the variant could affect transcription or translation?

It might help to visualize the variant in the [Global Biobank Engine](https://biobankengine.stanford.edu/) 

Analyze what you have observed in the Global Biobank Engine.

## STEP 4: If the variant is in a non-coding region, is it in a promoter region, if so, what is the relevant cell type? 

You may find the file **/opt/data/project/active_promoters_across_cell_type.bed ** useful in performing the tasks below. 

In [None]:
#Determine if any of the candidate variants are in promoter regions  
#YOUR CODE HERE

In [None]:
#List the cell types where the candidate variants are in active promoters
#YOUR CODE HERE

## STEP 5: If the variant is in a non-coding region, is it in an enhancer, if so, what is the relevant cell type? 

You may find the file **/opt/data/project/active_enhancers_across_cell_type.bed ** useful in performing the tasks below. 

In [None]:
#Determine if any of the candidate variants are in enhancer regions  
#YOUR CODE HERE

In [None]:
#List the cell types where the candidate variants are in active enhancers
#YOUR CODE HERE

## STEP 6: What Transcription Factors motifs overlap with the active promotors and or active enhancers identified in Step 4 and 5. 

You may find the  file **/opt/data/projects/motifs.bed** useful for performing the task below. 

In [None]:
#Determine which transcription factor motifs overlap with the active promotors or enhancers 
#identified in Step 4 and 5.  

#YOUR CODE HERE


## STEP 7: Look up the Transcription Factors identified in Step 6 in Gene Cards or another browser. What is known about the transcription factor?  

### Introduction to Gene Cards 

[Gene Cards](http://www.genecards.org/) is a database of information about human genes. It provides information about gene function, tissue-specific expression, as well as journal articles where a given gene is mentioned. 
Look up the following genes in gene cards. What is the function of each gene? 

 * IRX5 
 * FTO 

## STEP 8 Identify candidate target genes (genes that are in the vicinity of the variant).


You may find the file **/opt/data/project/gene_coords_hg19.bed** useful. 

In [15]:
## Your code here.

Look up the function of these genes in [Gene Cards](http://www.genecards.org/)

Visualize your variant in the [WashU Browser](http://epigenomegateway.wustl.edu/browser/) to determine which genes lie nearby. How near is each SNP to a candidate gene? 

To export screenshots from the WashU Browser, go to **Tracks** in the menu bar and select **Screenshot**
![WashU Screenshot](../Images/15_BrowserScreenshot.png)

Select **show track name** and click on **Take screenshot**.
![Browser Screenshot 2](../Images/15_BrowserScreenshot2.png)

## STEP 9 : Using all of the information together select your top 5 most likely causal variants. 