## Big Data for Biologists: Decoding Genomic Function - Class 14


##  Learning Objectives
***Students should be able to***
 <ol>   
 <li><a href=#geneticVariation> Identify  different types of genetic variation that can occur across individuals of a species</a></li>
 <li><a href=#geneticVariation> Describe the goals of the 1000 Genomes Project </a></li>
 <li><a href=#vcf>Understand how to use data in the variant call format (VCF) file format.</a></li>
 <li><a href=#tabix>Use the tabix tool to query a VCF file.  </a></li>



## What is Genetic Variation across individuals of a species <a name ='geneticVariation'>

In [1]:
from IPython.display import HTML

HTML('<iframe src="https://drive.google.com/file/d/0B_ssVVyXv8ZSSkJ1SktQTnk2MUU/preview" width="1000" height="480"></iframe>')

In tutorial 4, we learned how to use the [Burrows-Wheeler aligner](http://bio-bwa.sourceforge.net/) to map FASTQ reads to a reference genome. The resulting alignment can serve as a starting point for identifying genetic variants in the genomic sequence data. We have followed the workflow below to identify variants in a yeast dataset: 
![pipeline](../Images/pipeline.png)

## Working with  Variant Call Format (VCF) files <a name='vcf'>

A whole genome sequencing experiment was performed on some yeast cells. The sequenceing was paired-end with output FASTQ files **y1.fastq** and **y2.fastq**. These were aligned to the yeast reference genome, stored in file **yeast.fasta**, and variants were called in accordance with the pipeline detailed above. The resulting variant file, in VCF format, is **yeast_vars.vcf.gz**

In [11]:
!zcat yeast_vars.vcf.gz | head -n 50

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.6+htslib-1.6
##samtoolsCommand=samtools mpileup -g -o output.bcf -f yeast.fasta output.sorted.bam
##reference=file://yeast.fasta
##contig=<ID=I,length=230218>
##contig=<ID=II,length=813184>
##contig=<ID=III,length=316620>
##contig=<ID=IV,length=1531933>
##contig=<ID=IX,length=439888>
##contig=<ID=Mito,length=85779>
##contig=<ID=V,length=576874>
##contig=<ID=VI,length=270161>
##contig=<ID=VII,length=1090940>
##contig=<ID=VIII,length=562643>
##contig=<ID=X,length=745751>
##contig=<ID=XI,length=666816>
##contig=<ID=XII,length=1078177>
##contig=<ID=XIII,length=924431>
##contig=<ID=XIV,length=784333>
##contig=<ID=XV,length=1091291>
##contig=<ID=XVI,length=948066>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximu

The columns in the vcf file can be interpreted as described [here](https://faculty.washington.edu/browning/beagle/intro-to-vcf.html)

We use the **tabix_index** command to generate an index of the vcf file for rapid querying. 

In [12]:
import pysam
pysam.tabix_index("yeast_vars.vcf.gz", '-f',preset="vcf")

'yeast_vars.vcf.gz'

Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:



In [13]:
!bcftools stats -F yeast.fasta -s - yeast_vars.vcf.gz > output.vcf.stats


print the statistics: 

In [14]:
!cat yeast_vars.vcf.stats

# This file was produced by bcftools stats (1.3.1+htslib-1.3.2) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  -F yeast.fasta -s - output.vcf.gz
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	output.vcf.gz
# SN, Summary numbers:
# SN	[2]id	[3]key	[4]value
SN	0	number of samples:	1
SN	0	number of records:	39
SN	0	number of no-ALTs:	0
SN	0	number of SNPs:	31
SN	0	number of MNPs:	0
SN	0	number of indels:	8
SN	0	number of others:	0
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
# TSTV, transitions/transversions:
# TSTV	[2]id	[3]ts	[4]tv	[5]ts/tv	[6]ts (1st ALT)	[7]tv (1st ALT)	[8]ts/tv (1st ALT)
TSTV	0	16	15	1.07	16	15	1.07
# ICS, Indel context summary:
# ICS	[2]id	[3]repeat-consistent	[4]repeat-inconsistent	[5]not applicable	[6]c/(c+i) ratio
ICS	0	0	0	8	0.0000
# ICL, Indel context by length:
# ICL	[2]id	[3]length of repeat element	[4]repeat-consistent deletions)	[5]repeat-inconsist

A number of summary plots are generated. Of most interest to us is the tally of base substitutions and insertions/deletions (indels) observed in the data. 

Substitutions:
![substitutions tally](../Images/substitutions.0.png)
Indels: 
![indels tally](../Images/indels.0.png)

Not all variants are high quality. We want to apply filters to the vcf file to keep only variants with high quality scores (i.e. QUAL > 10). We can do this by passing filter arguments to **bcftools**. 

In [15]:
!bcftools filter -O z -o yeaset_vars.filtered.vcf.gz -s LOWQUAL -i'%QUAL>10' yeast_vars.vcf.gz 


## tabix <a name='tabix'>

The tabix tool can be used to index into a vcf file and select variants that fall within a region of interest. For example: 

In [16]:
#load the filtered vcf file into tabix 
import tabix
tb=tabix.open("yeast_vars.vcf.gz")

In [17]:
# A query returns an iterator over the results.
records = tb.query("II",1,325188)
for record in records: 
    print(record)

['II', '111730', '.', 'C', 'T', '12.4325', '.', 'DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=60', 'GT:PL', '0/1:40,3,0']
['II', '325186', '.', 'C', 'A', '12.4325', '.', 'DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=60', 'GT:PL', '0/1:40,3,0']


A file must first be indexed with pytabix before it can be queried. 