## Big Data for Biologists: Using Python to analyze genomes - Class 12


##  Learning Objectives
***Students should be able to***
 <ol>
 <li><a href=#SeqAlignIntro> Understand the standard pipeline for calling variants in sequence data </a></li>
 <li><a href=#Import>Use the BWA aligner to align whole genome sequence reads to a reference genome.</a></li>
 <li><a href=#Package>Manipulate the output from BWA with samtools, specifically using the sort and index commands </a></li>
 <li><a href=#Align2>use samtools and bcftools to call and filter variants from BWA alignment files  </a></li>
 <li><a href=#AlignMuscle>Understand how to use data in the variant call format (VCF) file format.</a></li>
 <li><a href=#FASTQ>Use the tabix tool to query a VCF file.  </a></li>


In tutorial 4, we learned how to use the [Burrows-Wheeler aligner](http://bio-bwa.sourceforge.net/) to map FASTQ reads to a reference genome. Now we will examine how the resulting alignment can serve as a starting point for identifying varyings in the genomic sequence data. We will follow the standard samtools workflow to identify variants in a yeast dataset: 
![pipeline](pipeline.png)

## Downloading the data with the curl command

We use the linux **curl** command  to download the whole genome sequence in fastq format as well as the reference genome for yeast (Saccharomyces_cerevisiae). 
Use the **gzip -d** command to unzip the compressed data file. 
Use the ** head -10000 ** command to read in the first 100,0000 lines from the file. Save the files as y1.fastq and y2.fastq -- this is a paired end whole genome sequence file, hence we generate two files.

In [3]:
#Download the paired-end whole genome sequence for a yeast sample. 
!curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507778/SRR507778_1.fastq.gz | gzip -d | head -100000 > y1.fastq
!curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507778/SRR507778_2.fastq.gz | gzip -d | head -100000 > y2.fastq
    
#Download the yeast reference genome in fasta format. 
!curl ftp://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz | gzip -d > yeast.fasta

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  537M    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
gzip: stdout: Broken pipe
curl: (23) Failed writing body (960 != 1448)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  559M    0  394k    0     0   133k      0  1:11:44  0:00:02  1:11:42  133k
gzip: stdout: Broken pipe
curl: (23) Failed writing body (2328 != 15928)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3782k  100 3782k    0     0  1082k      0  0:00:03  0:00:03 --:--:-- 1082k


##### Recall the structure of a FASTQ file. We are reading in 100,000 lines from the file. How many individual reads does that correspond to? 
##### Your answer here: 

## Align the whole genome sequence reads to the yeast reference genome with BWA 

First, we must index the reference genome with the "BwaIndexCommandline" function. This function will generate the indices that the BWA aligner will need for rapid alignment of the WGS (whole genome sequence) reads. 

In [45]:
from Bio.Sequencing.Applications import BwaIndexCommandline
reference_genome = "yeast.fasta"
index_cmd = BwaIndexCommandline(infile=reference_genome, algorithm="bwtsw")
output=index_cmd()


In [5]:
#The following files were generated by the above indexing command: 
!ls yeast.fasta.*

yeast.fasta.amb  yeast.fasta.bwt  yeast.fasta.pac
yeast.fasta.ann  yeast.fasta.fai  yeast.fasta.sa


Now that the BWA index has been generated, we can run the BWA command to produce a paired end alignment of the WGS sample to the reference sequence. 

In [43]:
#First, we generate the .sai files for each WGS sample. 
from Bio.Sequencing.Applications import BwaAlignCommandline
reference_genome = "yeast.fasta"
read_file = "y1.fastq"
output_sai_file = "y1.sai"
align_cmd = BwaAlignCommandline(reference=reference_genome,read_file=read_file)
print(align_cmd)
output=align_cmd(stdout=output_sai_file)

bwa aln yeast.fasta y1.fastq


In [44]:
#Repeat for the paired reads in y2.fastq 
read_file="y2.fastq"
output_sai_file="y2.sai"
align_cmd = BwaAlignCommandline(reference=reference_genome,read_file=read_file)
print(align_cmd)
output=align_cmd(stdout=output_sai_file)


bwa aln yeast.fasta y2.fastq


To call variants, we would like to convert our alignments from the binary .sai format to the .sam format that is compatible with samtools. 

In [40]:
#Import the BWA alignment tool from Biopython 
from Bio.Sequencing.Applications import BwaSampeCommandline
reference_genome = "yeast.fasta"
read_file1 = "y1.fastq"
read_file2 = "y2.fastq"
sai_file1 = "y1.sai"
sai_file2 = "y2.sai"
output_sam_file = "output.sam"
sampe_cmd = BwaSampeCommandline(reference=reference_genome,
                                 sai_file1=sai_file1, sai_file2=sai_file2,
                                 read_file1=read_file1, read_file2=read_file2)
print(sampe_cmd)
output=sampe_cmd(stdout=output_sam_file)

bwa sampe yeast.fasta y1.sai y2.sai y1.fastq y2.fastq


The samtools file format is human readable, so we can examine the **output.sam** file that was generated by the command above. 

In [9]:
!head -n20 output.sam

@SQ	SN:I	LN:230218
@SQ	SN:II	LN:813184
@SQ	SN:III	LN:316620
@SQ	SN:IV	LN:1531933
@SQ	SN:IX	LN:439888
@SQ	SN:Mito	LN:85779
@SQ	SN:V	LN:576874
@SQ	SN:VI	LN:270161
@SQ	SN:VII	LN:1090940
@SQ	SN:VIII	LN:562643
@SQ	SN:X	LN:745751
@SQ	SN:XI	LN:666816
@SQ	SN:XII	LN:1078177
@SQ	SN:XIII	LN:924431
@SQ	SN:XIV	LN:784333
@SQ	SN:XV	LN:1091291
@SQ	SN:XVI	LN:948066
@PG	ID:bwa	PN:bwa	VN:0.7.12-r1039	CL:bwa sampe yeast.fasta y1.sai y2.sai y1.fastq y2.fastq
SRR507778.1	97	Mito	20158	37	36M	=	16941	-3181	ANTATAATATTATCCCCACGAGGGCCACACATGTGT	?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7	XT:A:U	NM:i:1	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:1	XO:i:0	XG:i:0	MD:Z:1A34
SRR507778.1	145	Mito	16941	37	36M	=	20158	3181	TTCATAGTACCCAAATTTAATTTAAATAAAGTGAGA	GFFCFDDG<DIIGHIIEEE>EBAGG@GBDDDBG??B	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:36


We can interpret the fields above as follows:
![SAM alignment format](sam_alignment_format.png)

Next, we sort the output.sam file into chromosome-position order, and convert to the binary .bam format for more efficient parsing. This can be accomplished with the **pysam** module. 

In [10]:
import pysam 
#sort the file "output.sam" and write the output to a binary file "output.sorted.bam" 
pysam.sort("-o","output.sorted.bam","output.sam")

''

Finally, we index the resulting **output.sorted.bam** file. 

In [11]:
pysam.index("output.sorted.bam")

''

As a sanity check, we can report some statistics about the alignment. 

In [12]:
#this command tells us how many reads in the fastq files did not align to the reference genome 
pysam.view("output.sorted.bam","-c","-f","4").strip()

'3656'

In [13]:
#this command tells us how many reads in the fastq files aligned to the reference genome 
pysam.view("output.sorted.bam","-c","-F","4").strip()

'46344'

This suggests that 92.6% of the reads in the FASTQ files align to the reference genome -- the data appears to be good quality!. 

Now that we have aligned the samples to the reference genome, sorted the alignment file, converted it to binary format, and indexed it, we are ready to call variants with samtools. 


## Variant calling with Samtools  

To convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. 

In [14]:
#make sure to include the catch_stdout flag to avoid printing a long output message that will slow down the Jupyter Noteobok. 
pysam.mpileup("output.sorted.bam","-g","-o","output.bcf","-f","yeast.fasta",catch_stdout=False)

We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call. We pass the following flags to the bcftools call command: 
* -v (--variants-only) output variant sites only (as opposed to all sites in the genome) 
* -m (--multiallelic-caller) alternative model for mullti-allelic and rare variant calling
* -O (--optimize) iteratively estimate the fraction of aberrant cells, down to the given fraction. Lowering this value from the default 1.0 to say, 0.3, can help discover more events but also increases noise
* z zip the output 

In [15]:
from pysam import bcftools 
bcftools.call("-vmO", "z" ,"-o","output.vcf.gz","output.bcf",catch_stdout=False)


We can examine the variant call format (VCF) file that was generated: 

In [16]:
!zcat output.vcf.gz | head -n 50

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.5+htslib-1.5
##samtoolsCommand=samtools mpileup -g -o output.bcf -f yeast.fasta output.sorted.bam
##reference=file://yeast.fasta
##contig=<ID=I,length=230218>
##contig=<ID=II,length=813184>
##contig=<ID=III,length=316620>
##contig=<ID=IV,length=1531933>
##contig=<ID=IX,length=439888>
##contig=<ID=Mito,length=85779>
##contig=<ID=V,length=576874>
##contig=<ID=VI,length=270161>
##contig=<ID=VII,length=1090940>
##contig=<ID=VIII,length=562643>
##contig=<ID=X,length=745751>
##contig=<ID=XI,length=666816>
##contig=<ID=XII,length=1078177>
##contig=<ID=XIII,length=924431>
##contig=<ID=XIV,length=784333>
##contig=<ID=XV,length=1091291>
##contig=<ID=XVI,length=948066>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximu

The columns in the vcf file can be interpreted as described [here](http://samtools.github.io/hts-specs/VCFv4.3.pdf)

We use the **tabix_index** command to generate an index of the vcf file for rapid querying. 

In [17]:
pysam.tabix_index("output.vcf.gz", '-f',preset="vcf")

'output.vcf.gz'

Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:



In [18]:
stats=bcftools.stats("-F","yeast.fasta","-s","-","output.vcf.gz")
outf=open('output.vcf.stats','w')
outf.write(stats)

print the statistics: 

In [19]:
!cat output.vcf.stats

Plot the statistics: 

In [20]:
!plot-vcfstats -p . output.vcf.stats


Parsing bcftools stats output: output.vcf.stats
Sanity check failed: was this file generated by bcftools stats? at /usr/local/bin/plot-vcfstats line 76.
	main::error("Sanity check failed: was this file generated by bcftools stats?") called at /usr/local/bin/plot-vcfstats line 574
	main::parse_vcfstats1(HASH(0x1633620), 0) called at /usr/local/bin/plot-vcfstats line 283
	main::parse_vcfstats(HASH(0x1633620)) called at /usr/local/bin/plot-vcfstats line 43


A number of summary plots are generated. Of most interest to us is the tally of base substitutions and insertions/deletions (indels) observed in the data. 

Substitutions:
![substitutions tally](substitutions.0.png)
Indels: 
![indels tally](indels.0.png)

Not all variants are high quality. We want to apply filters to the vcf file to keep only variants with high quality scores (i.e. QUAL > 10). We can do this by passing filter arguments to **bcftools**. 

In [26]:
import pysam 
from pysam import bcftools
filtered=bcftools.filter("-O","z","-o","output.filtered.vcf.gz","-s", "LOWQUAL", "-i'%QUAL>10'","output.vcf.gz")
outf=open("output.filtered.vcf.gz",'w')
outf.write(filtered)

## tabix 

The tabix tool can be used to index into a vcf file and select variants that fall within a region of interest. For example: 

In [35]:
#load the filtered vcf file into tabix 
import tabix
tb=tabix.open("output.vcf.gz")

In [38]:
# A query returns an iterator over the results.
records = tb.query("II",1,325188)
for record in records: 
    print(record)

['II', '111730', '.', 'C', 'T', '10.7923', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=60', 'GT:PL', '1/1:40,3,0']
['II', '325186', '.', 'C', 'A', '10.7923', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=60', 'GT:PL', '1/1:40,3,0']


A file must first be indexed with pytabix before it can be queried. 