## Big Data for Biologists: Decoding Genomic Function

##  Learning Objectives
***Students should be able to***
<ol>
<li> <a href=#Workflow>Participate in a collaborative programming project to gain insights into the workflow for computational projects.</a></li>
<li> <a href=#Roles>Experience different roles in a computational project including code implementer and documentation provider.</a></li>
<li> <a href=#Dataanalysis> Apply data analysis methods from the course to new problems </a></li> 
</ol>


## Introduction to Course Project 

The course project will have three main steps.  

I.  You will be given the SNP identifiers (rs ids) for two variants in the human genome. For each of these variants you will: 

* write the code using the notebook guidelines below. 
* analyze the output of the code 
* create an initial draft of a report including:  
   <ol>
    <li> the likely causal variants </li>
    <li> explanation of reasoning for why they are the likely causal variants</li>
    <li> what you have learned about how the variant acts(ie. through what type of mutation in a coding region or what type of element in a non-coding region)</li>

II. You will proofreading and check your teammates work. 
* Switch variants and run through code.
* The outputs of your files should be identical.
* Add comments to annotate any code that may need clarification. 

III. Writing up the report (see rubric for guidelines). 

Deliverables: 
* Jupyter notebook with code 
* Writeup document with summary

### Grading Rubric: 

Will be posted on Canvas with the Assignment

### Project files: 
All project files can be found in the folder **/data/project**

* /data/project/1kg_phase1_all*   -- binary variant files
* /data/project/gene_coords_hg19.bed.gz -- bed file of gene coordinates 
* /data/project/gencode.hg19.annotation.gtf -- gene annotation file 
* /data/project/motifs.bed.gz -- coordinates of all transcription factor-binding motifs in the genome. 
* /data/project/active_promoters_across_cell_type.bed.gz
* /data/project/active_enhancers_across_cell_type.bed.gz

### General Suggestions: 

You will be working with some large files for your course project. Please use the !head command to examine these files rather than the !cat command to avoid printing very large amounts of text to your notebook. 

Some of the files we have provided are zipped in the gzip format (these end with the .gz suffix). To examine these files use the combination of zcat and head coommands, as below: 

```
!zcat /data/project/gene_coords_hg19.bed.gz | head 
    
```

To make your code easier to follow, you may find it helpful to add additional comments in the code boxes. You can decide which comments will be helpful to include. 


## STEP 1:  Are either of the candidate causal variants in coding regions?  <a name ='Dataanalysis'>

Your first task is to determine whether any of the candidate variants are in protein coding regions. That is, do they overlap a known protein coding region? 

We have provided an hg19 gene annotation file here: 

* **/data/project/gencode.hg19.annotation.gtf**

The annotations for CDS regions in this file  include the text "CDS".You should use the "grep" command to extract CDS regions from this file.  You should use a flag for the grep command that ensures you limit the output to lines with "CDS" only as a whole word. Otherwise lines with CDS embedded in other fields may also appear (see !grep --help for a list of flags). 

In [3]:
#BEGIN SOLUTION
!grep -w "CDS" /data/project/gencode.hg19.annotation.gtf | cut -f1,4,5 > cds.bed
#END SOLUTION

INSTRUCTOR NOTE: 
INSTRUCTIONS FOR GETTING GENOMIC COORDINATES FROM ENSEMBLE
https://www.internationalgenome.org/faq/can-i-find-genomic-position-list-dbsnp-rs-numbers-0/

In [9]:
#IMPORTING ALL VARIANTS TO CHECK THEM 
import pandas as pd
variant_genomic_coordinates=pd.read_table(filepath_or_buffer="CourseProjectVariants.tsv", header=0, delim_whitespace=True)
variant_genomic_coordinates.head(100)

Unnamed: 0,Variant_name,Variant_source,Chromosome_name,Chromosome_start(bp),Chromosome_end(bp)
0,rs1046089,dbSNP,6,31602967,31602967
1,rs11221332,dbSNP,11,128380974,128380974
2,rs11865038,dbSNP,16,31095171,31095171
3,rs121908745,dbSNP,7,117199644,117199646
4,rs121908767,dbSNP,7,117250651,117250656
5,rs12791769,dbSNP,11,34834268,34834268
6,rs12793173,dbSNP,11,34834204,34834204
7,rs1296028,dbSNP,8,11698747,11698747
8,rs13151961,dbSNP,4,123115502,123115502
9,rs1372659,dbSNP,8,135566779,135566779


In [10]:
#SUBTRACTING ONE TO GET zero-based numbering 
variant_genomic_coordinates['Chromosome_start(bp)']=variant_genomic_coordinates['Chromosome_start(bp)']-1
variant_genomic_coordinates.head()

Unnamed: 0,Variant_name,Variant_source,Chromosome_name,Chromosome_start(bp),Chromosome_end(bp)
0,rs1046089,dbSNP,6,31602966,31602967
1,rs11221332,dbSNP,11,128380973,128380974
2,rs11865038,dbSNP,16,31095170,31095171
3,rs121908745,dbSNP,7,117199643,117199646
4,rs121908767,dbSNP,7,117250650,117250656


In [12]:
#Making Course Project variants bed file 
variant_genomic_coordinates['Chromosome_name']='chr'+variant_genomic_coordinates['Chromosome_name'].apply(str)
SNPs=variant_genomic_coordinates.loc[:, ['Chromosome_name','Chromosome_start(bp)','Chromosome_end(bp)','Variant_name']]
SNPs.to_csv("courseproject_snps.bed",sep='\t',index=False,header=False) 

In [14]:
!head courseproject_snps.bed

chr6	31602966	31602967	rs1046089
chr11	128380973	128380974	rs11221332
chr16	31095170	31095171	rs11865038
chr7	117199643	117199646	rs121908745
chr7	117250650	117250656	rs121908767
chr11	34834267	34834268	rs12791769
chr11	34834203	34834204	rs12793173
chr8	11698746	11698747	rs1296028
chr4	123115501	123115502	rs13151961
chr8	135566778	135566779	rs1372659


In [1]:
#BEGIN SOLUTION

!bedtools intersect -wa -wb -a cds.bed -b courseproject_snps.bed > coding_variant_overlap.bed
!cat coding_variant_overlap.bed

#END SOLUTION

chr1	2526715	2526798	chr1	2526745	2526746	rs3748816
chr1	2526715	2526798	chr1	2526745	2526746	rs3748816
chr1	2526715	2526798	chr1	2526745	2526746	rs3748816
chr1	25291005	25291062	chr1	25291009	25291010	rs6672420
chr1	25291005	25291062	chr1	25291009	25291010	rs6672420
chr2	68607855	68608036	chr2	68607946	68607947	rs3816281
chr3	52833769	52833937	chr3	52833804	52833805	rs3617
chr3	52833769	52833937	chr3	52833804	52833805	rs3617
chr3	52833769	52833937	chr3	52833804	52833805	rs3617
chr4	951612	952281	chr4	951946	951947	rs34311866
chr4	951612	952281	chr4	951946	951947	rs34311866
chr4	951612	952281	chr4	951946	951947	rs34311866
chr4	951612	952281	chr4	951946	951947	rs34311866
chr5	433968	434946	chr5	434721	434722	rs34453673
chr5	433968	434946	chr5	434721	434722	rs34453673
chr5	433968	434946	chr5	434721	434722	rs34453673
chr5	433968	434946	chr5	434721	434722	rs34453673
chr6	26091138	26091332	chr6	26091178	26091179	rs1799945
chr6	26091069	26091332	chr6	26091178	26091179	rs1799945
chr6	26091069

In [15]:
import pandas as pd
coding_variants=pd.read_table(filepath_or_buffer="coding_variant_overlap.bed",header=None,delim_whitespace=True)
rsids=coding_variants[6].unique()
print(rsids)

['rs3748816' 'rs6672420' 'rs3816281' 'rs3617' 'rs34311866' 'rs34453673'
 'rs1799945' 'rs1046089' 'rs121908745' 'rs121908767' 'rs214936'
 'rs62054815' 'rs167479' 'rs429358']


Now, you will use one of the bedtools commands we have discussed to overlap the CDS file with the coordinates of your assigned variants. 

In [None]:
#BEGIN SOLUTION  
#END SOLUTION 

You should find that one of your variants is in a coding region. For this variant, you do NOT need to investigate it's linked variants, because the variant likely directly affecting the sequence of the protien that the gene encodes. However, if you find that your variant is in a non-coding portion of the genome, it's linked SNPs must then be examined to determine the variant's most likely mechanism of action. You can skip step 2 and proceed to step 3 for the variant that is in a coding region. 

## STEP 2: Given a target non-coding variant that has been linked to a disease, what are the candidate causal variants in LD with the target variant?   

To ensure the highest likelihood of discovering the causal SNP, please investigate multiple variants in LD with your non-coding SNP and discuss them in your writeup. If you find that your non-coding SNP has high LD with more than five SNPs, please investigate and discuss the five SNPs with the highest  r^2 LD score. 

In [None]:
#Find all single nucleotide polymorhphisms (SNPs)  in linkage disequilibrium (LD)  with your target variants.

#BEGIN SOLUTION 
#END SOLUTION 

Generate a Manhattan plot with SNP position along the x-axis and r^2 of all LD SNPs along the y-axis. 

In [None]:
#BEGIN SOLUTION 
#END SOLUTION

## STEP 3: Look up known variant-phenotype associations in the GWAS Catalog  

Have any of the variants in protein coding regions been linked to a disease? If so, which one?  What is known about how the variant could affect transcription or translation?

The [GWAS Catalog](https://www.ebi.ac.uk/gwas/) is a curated database of GWAS studies. Once you have identified a GWAS hit, it's worth checking the catalog to see if others have discovered it as well. 

Look up rs1558902 and rs1421085 in the catalog. What phenotypes have these variants been associated with? Click on the "Associations" link for each variant. 

**ANSWER HERE:**



## STEP 4:  Are any of the variants in high LD with the non-coding SNP located within an exon? 

Repeat the step 1 analysis, but this time search for EXONS and examine the SNPs in high LD with your non-coding SNP. 

In [16]:
## BEGIN SOLUTION 
!grep -w "EXON" /data/project/gencode.hg19.annotation.gtf | cut -f1,4,5 > exons.bed
## END SOLUTION 

In [17]:
##INSTRUCTOR checking overlap noncoding regions with exons 
!bedtools intersect -wa -wb -a exons.bed -b courseproject_snps.bed > non-coding-variant-exons.bed
!cat  non-coding-variant-exons.bed
##INSTRUCTOR

## STEP 5:  Have any of the variants in protein coding regions been linked to a disease? If so, which one?  What is known about how the variant could affect transcription or translation?

It might help to visualize the variant in the [Global Biobank Engine](https://biobankengine.stanford.edu/) 

Analyze what you have observed in the Global Biobank Engine.

## STEP 6: If the variant is in a non-coding region, is it in a promoter region, if so, what is the relevant cell type? 

You may find the file **/data/project/active_promoters_across_cell_type.bed.gz** useful in performing the tasks below. 

In [23]:
#Checking Promotor file
!zcat /data/project/active_promoters_across_cell_type.bed.gz | head 

chr1	10452	10563	chr1:10452-10563	0	.	  Placenta
chr1	118576	118668	chr1:118576-118668	0	.	  K562 Leukemia Cells
chr1	138501	138581	chr1:138501-138581	0	.	  K562 Leukemia Cells
chr1	139011	139077	chr1:139011-139077	0	.	  K562 Leukemia Cells
chr1	139281	139335	chr1:139281-139335	0	.	  K562 Leukemia Cells
chr1	244145	244214	chr1:244145-244214	0	.	  Primary monocytes from peripheral blood
chr1	523755	523811	chr1:523755-523811	0	.	  Placenta
chr1	540605	540676	chr1:540605-540676	0	.	  Stomach Smooth Muscle
chr1	540605	540676	chr1:540605-540676	0	.	  Osteoblast Primary Cells
chr1	540605	540676	chr1:540605-540676	0	.	  ES-I3 Cells

gzip: stdout: Broken pipe


In [22]:
!ls /data/project/

1kg_phase1_all.bed			      gencode.hg19.annotation.gtf
1kg_phase1_all.bim			      gene_coords_hg19.bed.gz
1kg_phase1_all.fam			      gene_coords_hg19.bed.gz.tbi
active_enhancers_across_cell_type.bed.gz      motifs.bed.gz
active_enhancers_across_cell_type.bed.gz.tbi  motifs.bed.gz.tbi
active_promoters_across_cell_type.bed.gz      README
active_promoters_across_cell_type.bed.gz.tbi


In [24]:
#Determine if any of the candidate variants are in promoter regions  
#BEGIN SOLUTION 
!bedtools intersect -wa -wb -a /data/project/active_promoters_across_cell_type.bed.gz -b courseproject_snps.bed > promotor_overlap.bed
#END SOLUTION 

[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 1


In [None]:
#List the cell types where the candidate variants are in active promoters
#BEGIN SOLUTION 
#END SOLUTION 

## STEP 7: If the variant is in a non-coding region, is it in an enhancer, if so, what is the relevant cell type? 

You may find the file **/data/projects/active_enhancers_across_cell_type.bed.gz** useful in performing the tasks below. 

In [27]:
#Determine if any of the candidate variants are in enhancer regions  
#BEGIN SOLUTION 
!bedtools intersect -wa -wb -a /data/project/active_enhancers_across_cell_type.bed.gz -b courseproject_snps.bed > enhancer_overlap.bed
#END SOLUTION 

[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 1


In [None]:
#List the cell types where the candidate variants are in active enhancers
#BEGIN SOLUTION 
#END SOLUTION 

## STEP 8: What Transcription Factors motifs overlap with the SNP?

You may find the  file **/data/motifs.bed.gz** useful for performing the task below. 

Note: If you don't fine any transcription factors overlap with the specific SNP, you can expand the search to include transcription factors that overlap with the active promoters/enhancers from steps 6 and 7. 

In [None]:
#Determine which transcription factor motifs overlap with the SNPs.
#BEGIN SOLUTION 
#END SOLUTION 

## STEP 9: Look up the Transcription Factors identified in Step 8 in Gene Cards or another browser. What is known about the transcription factor?  

### Introduction to Gene Cards 

[Gene Cards](http://www.genecards.org/) is a database of information about human genes. It provides information about gene function, tissue-specific expression, as well as journal articles where a given gene is mentioned. 
Look up the following genes in gene cards. What is the function of each gene? 

 * IRX5 
 * FTO 

## STEP 10 Identify candidate target genes (genes that are in the vicinity of the variant).


You may find the file **/data/project/gene_coords_hg19.bed.gz** useful. 

In [None]:
## BEGIN SOLUTION 
## END SOLUTION 

Look up the function of these genes in [Gene Cards](http://www.genecards.org/)

Visualize your variant in the [WashU Browser](http://epigenomegateway.wustl.edu/browser/) to determine which genes lie nearby. How near is each SNP to a candidate gene? 

To export screenshots from the WashU Browser, go to **Tracks** in the menu bar and select **Screenshot**
![WashU Screenshot](../Images/15_BrowserScreenshot.png)

Select **show track name** and click on **Take screenshot**.
![Browser Screenshot 2](../Images/15_BrowserScreenshot2.png)

## STEP 11: Using all of the information together select your top 5 most likely causal variants. 