
 <li><a href=#pipeline> Understand the standard pipeline for calling variants in sequence data </a></li>
 <li><a href=#bwa>Use the BWA aligner to align whole genome sequence reads to a reference genome.</a></li>
 <li><a href=#samtools>use samtools and bcftools to call and filter variants from BWA alignment files  </a></li>

In tutorial 4, we learned how to use the [Burrows-Wheeler aligner](http://bio-bwa.sourceforge.net/) to map FASTQ reads to a reference genome. Now we will examine how the resulting alignment can serve as a starting point for identifying varyings in the genomic sequence data. We will follow the standard samtools workflow to identify variants in a yeast dataset: 
![pipeline](pipeline.png)

We begin the variant-calling pipeline with three files: 

* The yeast reference genome, stored in fasta format in the file **yeast.fasta**
* First pair in yeast whole genome sequencing file, stored in fastq format, in the file **y1.fastq** 
* Second pair in yeast whole genome sequencing file, stored in fastq format, in the file **y2.fastq** 

## Align the whole genome sequence reads to the yeast reference genome with BWA <a name='bwa'>

First, we must index the reference genome with the "BwaIndexCommandline" function. This function will generate the indices that the BWA aligner will need for rapid alignment of the WGS (whole genome sequence) reads. 

In [None]:
from Bio.Sequencing.Applications import BwaIndexCommandline
reference_genome = "yeast.fasta"
index_cmd = BwaIndexCommandline(infile=reference_genome, algorithm="bwtsw")
output=index_cmd()


In [None]:
#The following files were generated by the above indexing command: 
!ls yeast.fasta.*

Now that the BWA index has been generated, we can run the BWA command to produce a paired end alignment of the WGS sample to the reference sequence. 

In [None]:
#First, we generate the .sai files for each WGS sample. 
from Bio.Sequencing.Applications import BwaAlignCommandline
reference_genome = "yeast.fasta"
read_file = "y1.fastq"
output_sai_file = "y1.sai"
align_cmd = BwaAlignCommandline(reference=reference_genome,read_file=read_file)
print(align_cmd)
output=align_cmd(stdout=output_sai_file)

In [None]:
#Repeat for the paired reads in y2.fastq 
read_file="y2.fastq"
output_sai_file="y2.sai"
align_cmd = BwaAlignCommandline(reference=reference_genome,read_file=read_file)
print(align_cmd)
output=align_cmd(stdout=output_sai_file)


To call variants, we would like to convert our alignments from the binary .sai format to the .sam format that is compatible with samtools. 

In [None]:
#Import the BWA alignment tool from Biopython 
from Bio.Sequencing.Applications import BwaSampeCommandline
reference_genome = "yeast.fasta"
read_file1 = "y1.fastq"
read_file2 = "y2.fastq"
sai_file1 = "y1.sai"
sai_file2 = "y2.sai"
output_sam_file = "output.sam"
sampe_cmd = BwaSampeCommandline(reference=reference_genome,
                                 sai_file1=sai_file1, sai_file2=sai_file2,
                                 read_file1=read_file1, read_file2=read_file2)
print(sampe_cmd)
output=sampe_cmd(stdout=output_sam_file)

The samtools file format is human readable, so we can examine the **output.sam** file that was generated by the command above. 

In [None]:
!head -n20 output.sam

We can interpret the fields above as follows:
![SAM alignment format](sam_alignment_format.png)

Next, we sort the output.sam file into chromosome-position order, and convert to the binary .bam format for more efficient parsing. This can be accomplished with the **pysam** module. 

In [None]:
import pysam 
#sort the file "output.sam" and write the output to a binary file "output.sorted.bam" 
pysam.sort("-o","output.sorted.bam","output.sam")

Finally, we index the resulting **output.sorted.bam** file. 

In [None]:
pysam.index("output.sorted.bam")

As a sanity check, we can report some statistics about the alignment. 

In [None]:
#this command tells us how many reads in the fastq files did not align to the reference genome 
pysam.view("output.sorted.bam","-c","-f","4").strip()

In [None]:
#this command tells us how many reads in the fastq files aligned to the reference genome 
pysam.view("output.sorted.bam","-c","-F","4").strip()

This suggests that 92.6% of the reads in the FASTQ files align to the reference genome -- the data appears to be good quality!. 

Now that we have aligned the samples to the reference genome, sorted the alignment file, converted it to binary format, and indexed it, we are ready to call variants with samtools. 


## Variant calling with Samtools  <a name='samtools'>

To convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. 

In [None]:
#make sure to include the catch_stdout flag to avoid printing a long output message that will slow down the Jupyter Noteobok. 
pysam.mpileup("output.sorted.bam","-g","-o","output.bcf","-f","yeast.fasta",catch_stdout=False)

We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call. We pass the following flags to the bcftools call command: 
* -v (--variants-only) output variant sites only (as opposed to all sites in the genome) 
* -m (--multiallelic-caller) alternative model for mullti-allelic and rare variant calling
* -O (--optimize) iteratively estimate the fraction of aberrant cells, down to the given fraction. Lowering this value from the default 1.0 to say, 0.3, can help discover more events but also increases noise
* z zip the output 