## Big Data for Biologists: Replication of GWAS studies in different  populations? - Class 16
##  Learning Objectives
***Students should be able to***
 <ol>
 <li> <a href=#LD>Find variants in linkage disequilibrium (LD) with a target variant using tabix and PLINK.</a></li>
 <li> <a href=#GeneCards>Use GeneCards to find out information about a gene.</a></li>
 <li><a href=#PCA23andme>Use PCA to predict ancestry from a genetic dataset from 23 and me </a></li>
 <li><a href=#projectFiles>Use reference datasets of genome and epigenome information to investigate function of coding and non-coding variants.</a></li>
 


## Linkage Disequilibrium Example with Tabix and PLINK <a name ='LD'>

[An article in the New England Journal of Medicine](http://www.nejm.org/doi/full/10.1056/NEJMoa1502214?rss=searchAndBrowse&#t=article) presented a GWAS in 52 participants  who were homozygous for the risk allele for the tag variant rs1558902. This variant occurs in an intron of the FTO gene, which has previously been linked to obesity. However, since a strong GWAS association of a variant with a phenotype is insufficient to deterimne causation, the authors checked whether other variants were in strong linkage disequilibrium with rs1558902 and are thus also potentially causal variants in obesity. 

We will examine below how the PLINK tool can be used to perform such a linkage disequilibrium analysis. 


We have downloaded variant files for the 1000 Genomes Project in the PLINK binary format: 
    
* **/opt/data/project/1kg_phase1_all.bed** -- binary encoding of subject genotypes (do not be fooled by the file extension, this is NOT the 4-column bed file format we have been using). 

* **/opt/data/project/1kg_phase1_all.bim** -- list of all variants in the subject population 
* **/opt/data/project/1kg_phase1_all.fam** -- list of all subject id's in the 1000's genome project

In [1]:
#This syntax will identify all variants that are in linkage disequilibrium with our tagged SNP rs1558902
!plink  --bfile /opt/data/project/1kg_phase1_all --r  --ld-snp rs1558902 --threads 10 --out r.for.rs1558902


PLINK v1.90b4.10 64-bit (3 Nov 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to r.for.rs1558902.log.
Options in effect:
  --bfile /opt/data/project/1kg_phase1_all
  --ld-snp rs1558902
  --out r.for.rs1558902
  --r
  --threads 10

7484 MB RAM detected; reserving 3742 MB for main workspace.
39728178 variants loaded from .bim file.
1092 people (525 males, 567 females) loaded from .fam.
Using up to 10 threads (change this with --threads).
Before main variant filters, 1083 founders and 9 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
commands treat these as missing.
Total genotyping rate is 0.999956.
39728178 variants and 1092 people pass filters and QC.
Note: No phenotypes present.
--r to r.for.rs1558902.ld ... 

The SNPs that are in linkage disequilibrium with our tagged SNP were saved to the file **r.for.rs1558902.ld**. Let's examine the contents of this file: 

In [2]:
!cat r.for.rs1558902.ld

 CHR_A         BP_A                 SNP_A  CHR_B         BP_B                 SNP_B            R 
    16     53803574             rs1558902     16     53803128            rs77955027   -0.0655586 
    16     53803574             rs1558902     16     53803156             rs8055197    -0.413274 
    16     53803574             rs1558902     16     53803187             rs1558901     0.757579 
    16     53803574             rs1558902     16     53803223            rs62048402            1 
    16     53803574             rs1558902     16     53803270           rs186754298   -0.0226154 
    16     53803574             rs1558902     16     53803332           rs189959143     0.091852 
    16     53803574             rs1558902     16     53803349           rs182131169   -0.0226154 
    16     53803574             rs1558902     16     53803415           rs139578493    0.0322946 
    16     53803574             rs1558902     16     53803452           rs187115215    0.0682658 
    16    

PLINK also allows us to compute the r^2 value for linkage disequilibrium. The command is the same as what we ran above, but replace "r" with "r^2".

In [3]:
!plink  --bfile /opt/data/project/1kg_phase1_all --r2  --ld-snp rs1558902 --threads 10 --out r2.for.rs1558902


PLINK v1.90b4.10 64-bit (3 Nov 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to r2.for.rs1558902.log.
Options in effect:
  --bfile /opt/data/project/1kg_phase1_all
  --ld-snp rs1558902
  --out r2.for.rs1558902
  --r2
  --threads 10

7484 MB RAM detected; reserving 3742 MB for main workspace.
39728178 variants loaded from .bim file.
1092 people (525 males, 567 females) loaded from .fam.
Using up to 10 threads (change this with --threads).
Before main variant filters, 1083 founders and 9 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
commands treat these as missing.
Total genotyping rate is 0.999956.
39728178 variants and 1092 people pass filters and QC.
Note: No phenotypes present.
--r2 to r2.for.rs1558902.ld

In [5]:
!cat r2.for.rs1558902.ld

 CHR_A         BP_A                 SNP_A  CHR_B         BP_B                 SNP_B           R2 
    16     53803574             rs1558902     16     53803187             rs1558901     0.573926 
    16     53803574             rs1558902     16     53803223            rs62048402            1 
    16     53803574             rs1558902     16     53803574             rs1558902            1 


The New England Journal article mentions that variant rs1421085 was found to be associated with rs1558902. The authors found that rs1421085 disrupted an ARID5B repressor motif, and was thus the most likely causal variant.


## Use Gene Cards to find out information about a gene <a name ='GeneCards'>

[Gene Cards](http://www.genecards.org/) is a database of information about human genes. It provides information about gene function, tissue-specific expression, as well as journal articles where a given gene is mentioned. 

Look up the following genes in gene cards. What is the function of each gene? 

 * IRX5 
 * FTO 

## Use PCA to predict ancestry from a genetic dataset from 23 and me <a name ='PCA23andme'>

In [1]:
from IPython.display import HTML
HTML('<iframe src="https://cambridgespark.com/content/tutorials/genetic-ancestry-analysis-python/index.html" width="1000" height="480"></iframe>')


## Overview of Course Project Files <a name='projectFiles'>

All project files can be found in the folder **/opt/data/project**

* /opt/data/project/1kg_phase1_all*   -- binary variant files
* /opt/data/project/gene_coords_hg19.bed -- bed file of gene coordinates 
* /opt/data/project/gencode.hg19.annotation.gtf -- gene annotation file 
* /opt/data/project/motifs.bed -- coordinates of all transcription factor-binding motifs in the genome. 
* /opt/data/project/active_promoters_across_cell_type.bed 
* /opt/data/project/active_enhancers_across_cell_type.bed 


### Binary variant files

* /opt/data/project/1kg_phase1_all*   -- binary variant files

These files store the genotypes of all subjects in phase 1 of the 1000 genomes project in a compressed binary format. You can use these with the PLINK tool to identify variants in linkage disequilibrium with your variant of interest. 

In [30]:
## Use the plink --r command to identify all the variants in linkage disequilibrium with target variant rs150021059. 

## How many such variants are there? 


In [None]:
#! cat r.for.150021059.ld

### Gene coordinate file 
* /opt/data/project/gene_coords_hg19.bed -- bed file of gene coordinates 

Use this files to find the closest gene to a variant of interest. 

In [None]:
!head -n20 /opt/data/project/gene_coords_hg19.bed

In [None]:
## What are the coordinates of gene 'FTO'? 
##Hint: use the grep command. 

### Gene annotation file 
* /opt/data/project/gencode.hg19.annotation.gtf -- gene annotation file 
Use this file to identify exons and transcription start sites of genes. 

In [None]:
!head -n20  /opt/data/project/gencode.hg19.annotation.gtf 

In [None]:
## How many exons are there in the FTO gene? 
## Hint: grep is useful here too! 

### Motif coordinates file

* /opt/data/project/motifs.bed 

Use this file to find the motif that is present at a particular region in the genome.

In [None]:
!head -n20 /opt/data/project/motifs.bed


In [28]:
## What motif is present at coordinates chr1	53495	53504	? 
!echo "chr1\t53495\t53504" > region_for_motif.bed 

## Hint use the bedtools intersect command to find the motif 

### Active promoters and enhancers across cell type 
* /opt/data/project/active_promoters_across_cell_type.bed 
* /opt/data/project/active_enhancers_across_cell_type.bed 

Use these files to determine whether your variant of interest is in an active enhancer or promoter region 

In [None]:
!head -n20 /opt/data/project/active_enhancers_across_cell_type.bed 

In [None]:
!head -n20 /opt/data/project/active_promoters_across_cell_type.bed 

From our linkage disequilibrium analysis above, we know that variant rs1558902 is at position: 

    chr16 53803574

We generate a bed file for this SNP: 


In [24]:
! echo "chr16\t53803574\t53803575" > rs1558902.bed

In [25]:
!cat rs1558902.bed

chr16	53803574	53803575


We can now use bedtools intersect to check whether the variant falls into an active promoter or enhancer in any cell type. 

In [32]:
## Use the bedtools intersect command to check whether the variant above falls into an active promoter or enhancer region. 