# Mapping by sequencing pipeline
## Parental strain analysis - Hawaiian tutorial
* [Amhed Missael Vargas Velazquez](https://www.researchgate.net/profile/Amhed-Vargas-Velazquez)
* Post-doctoral fellow, [SGB lab](https://syngenbio.kaust.edu.sa/), [KAUST](https://www.kaust.edu.sa/en)

## Description
This jupyter notebook will guide you through the computation used to identify genomic variants in the highly polymorphic *C. elegans* strain CB4856 which was isolated in Hawaii (hence its alias as *Hawaiian*). This step is essential in the [mapping by sequencing](https://www.genetics.org/content/204/2/451) pipeline only if the mapping strain, *a.k.a*. the parental strain, had not been analized before, *i.e.*, you don't have a .vcf file containing background genomic variants. 

## Software requirements
The core instance running this script is Python. However, most of the analysis are performed by other programs (handled by `system calls !`) which have to be installed or, for portability convenience, be present in a folder (check `WGS-Software_configuration` notebook for further details). Note that software in **bold** are the only ones we distribute, the rest must be installed in the computer running this notebook.

* [samtools](http://www.htslib.org/)
* [bwa](https://github.com/lh3/bwa)
* [**GATK**](https://gatk.broadinstitute.org/hc/en-us) [3.7+](https://console.cloud.google.com/storage/browser/gatk-software/package-archive/gatk/); note that current version of GATK (4.0+) uses different nomenclature and have different tools which makes it not compatible with this pipeline. We describe in detail why we stick to this software in the notebook `WGS-Software_configuration`.
* [**picard**](https://broadinstitute.github.io/picard/)
* [**snpEff**](https://pcingola.github.io/SnpEff/)
* [fastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
* [SRAToolkit](https://github.com/ncbi/sra-tools)
* wget
* zcat
* [R](https://cran.r-project.org/)
* perl

### Python libraries
* os
* IPython.display
* numpy
* pandas
* matplotlib

### R libraries
* ggplot2

## Getting started
Run the cells below to create a working directory, load the necessary python libraries, and to verify you have the required software.

### Load python libraries
Run the cell below to load essential libraries for the pipeline to work:

In [None]:
## Load libraries
#os to move within directories
import os
#IPython.display for markdown
from IPython.display import display, Markdown
#matplotlib, pandas, and numpy for plotting and data analysis respectively
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Produce a folder that can be accesed by the pipeline  
Before starting any analysis, make sure to select a folder where the analysis will be performed. *Unless* stated otherwise, the analysis will be performed on the **same folder as this notebook**:

In [None]:
##Set working directory
#Same location as script
path = os.getcwd()
#or somewhere else, e.g.:
#path = '/home/jupyter-user/Workstation/user/parental'

##Move to path
os.chdir(path)

##Show current directory to user
display(Markdown('<div class=\"alert alert-block alert-info\">Directory for analysis:<br><b>' + os.getcwd() + '</b></div>'))

### Make sure to have the "stand alone" software
This pipeline requires progams that run in a java environment, and so they come as "stand-alone" software. Make sure to have a directory containing the following ones: 

- GATK +3.7 (GenomeAnalysisTK.jar)
- Picard (picard.jar)
- SnpEff (folder with both snpEff.jar and SnpSift.jar) 

For your convenience, we distribute a folder already containing these programs. Just make sure to set properly the path to them, e.g.:

In [None]:
##Path to software folder
softpath = '/home/Wormstation/Software'

##GATK v3.8.1.0
GATKpath = (softpath + '/gatk-3.8.1.0')

##Picard v2.23.6 
PiKpath = (softpath + '/picard2.23.6')

##SnpEff v.0
Snpath = (softpath + '/SnpEff-5.0/snpEff')

##Alternative paths
#GATKpath = ''
#PiKpath = ''
#Snpath = ''

##Check if jar files are there
##Notify user if GATK .jar is present or not
if os.path.isfile(GATKpath + '/GenomeAnalysisTK.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b GATK</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>GATK not found in: ' + GATKpath +'</div>'))

##Notify user if Picard .jar is present or not
if os.path.isfile(PiKpath + '/picard.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Picard</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Picard not found in: ' + PiKpath +'</div>'))

##Notify user if SnpEff .jar is present or not
if os.path.isfile(Snpath + '/snpEff.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b SnpEff</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>SnpEff not found in: ' + GATKpath +'</div>'))

In the case that a new version of a program is needed or the software configuration has changed, please check the jupyter notebook `WGS-Software_configuration`. This notebook indicates where and how to set up the programs needed for the pipeline.

### Make sure to have a reference genome
In order to run this pipeline, a *C. elegans* reference genome is needed (preferentially in fasta format). The cells below allows you to prepare your reference file or to download the *C. elegans* ce11/WS235 version from Ensemble.

First, download or specify the location of your reference genome:

In [None]:
##Specify the location of the genome of Reference
#e.g.
#RefFile=('/home/Wormstation/data/Caenorhabditis_elegans.fa')

###OR

##Download ce11 into a new folder called data
!mkdir -p {path}/data

os.chdir(path+'/data')

!wget -q ftp://ftp.ensembl.org/pub/release-99/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz

!zcat Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz > Caenorhabditis_elegans.WBcel235.99.softmasked.fa

RefFile=(path+'/data/'+'Caenorhabditis_elegans.WBcel235.99.softmasked.fa')

##Notify user if file present or not
if os.path.isfile(RefFile):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>Reference Genome :</b>\n'+RefFile+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Reference file: '+RefFile+' does not exists</div>'))

Then it need to be indexed by samtools:

In [None]:
!samtools faidx {RefFile}

##Notify user if this step was sucessfull
if os.path.isfile(RefFile+'.fai'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

And bwa:

In [None]:
!bwa index {RefFile} > /dev/null 2>&1

##Notify user if this step was sucessfull
if os.path.isfile(RefFile+'.bwt'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

And by picard tools:

In [None]:
RefFilebase=''.join(RefFile.split('.')[:-1])
!java -jar {softpath}/picard2.23.6/picard.jar CreateSequenceDictionary R={RefFile} O={RefFilebase +'.dict'} > /dev/null 2>&1

##Notify user if this step was sucessfull
if os.path.isfile(RefFilebase +'.dict'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

Now your reference genome is ready for the analysis in this pipeline.

In [None]:
ReferenceGenome = RefFile

## Brief introduction
Whole Genome Sequencing (WGS) has revolutionized how we identify genes that act in the development of a phenotype. For a short historical review check this [timeline of *C. elegans* research](https://www.hobertlab.org/wp-content/uploads/2013/03/Ankeny_2001.pdf) and for a better understanding read the Wormbook chapter on [NGS methods to map mutations in *C. elegans*](https://www.genetics.org/content/204/2/451).

This jupyter notebook will guide you through the computation used to identify genomic variants in the highly polymorphic *C. elegans* strain CB4856 which was isolated in Hawaii (hence its alias as Hawaiian). Once we obtain the genomic variants present in the Hawaiian strain, we would be able to use the mapping by sequencing technique (check the `WGS-Mutant_strain-Hawaiian_tutorial` notebook for the next step).

### Hawaiian strain

CB4856 was first described in [Koch *et al.*](https://genome.cshlp.org/content/10/11/1690.full.pdf) as one of the most evolutionary divergent strains of *C. elegans* (possibly due to its isolation from the mainland). Indeed, it has been calculated that differs in average 1 SNP every 840 bp in comparison to the *C. elegans* reference strain N2 ([Maydan *et al.*](https://genome.cshlp.org/content/17/3/337.long)). This notebook contains the commands used to identify the genomic variants between these two strains. First, we require to obtain the sequencing data of the Hawaiian strain. For this pipeline, we will be using the data stored in NCBI SRA coming from [Erik Andernsen's lab](https://andersenlab.org/).

We will use `wget` to download the CB4856 SRA file and the `SRAToolkit` to convert it back to fastq files:

In [None]:
## SRR2003569: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2003569
!wget -q https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2003569/SRR2003569

##Dump files into fastq files
!fastq-dump --split-files SRR2003569

Alternatively, you can copy the sequencing reads from our internal folder:

In [None]:
#!cp /home/Wormstation/data/SRR2003569_1.fastq .
#!cp /home/Wormstation/data/SRR2003569_2.fastq .

In [None]:
##Verify if files were properly obtained
#SRR2003569_1.fastq
if os.path.isfile('SRR2003569_1.fastq'):
    display(Markdown('<b>SRR2003569_1.fastq \b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>The file SRR2003569_1.fastq is not present in current path: '+os.getcwd()+'</div>'))

#SRR2003569_2.fastq
if os.path.isfile('SRR2003569_2.fastq'):
    display(Markdown('<b>SRR2003569_2.fastq \b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>The file SRR2003569_2.fastq is not present in current path: '+os.getcwd()+'</div>'))

## Data analysis workflow
The common pipeline consist on:
* Quality assesment of sequencing reads via FastQC
* Mapping of reads with bwa
* Filtering and processing of alignment file with samtools
* Realigment with GATK
* Variant calling with GATK HC and UG
* Mutational analysis with SnpEff


But before that, lets create a directory where the analysis will be executed:

In [None]:
##Move to that directory
os.chdir(path)
##Create directory to place data
!mkdir -p {path}/analysis
##Move to that directory
os.chdir( path + '/analysis' )

And lets define the parameters for the pipeline:

In [None]:
##Inputs
#Name
SampleName = 'CB4856'
#File 1
FastqF1 = (path + '/data/SRR2003569_1.fastq') 
#File 2
FastqF2 = (path + '/data/SRR2003569_2.fastq')
#Reference Indexed
#ReferenceGenome = (path + '/data/Caenorhabditis_elegans.WBcel235.99.softmasked.fa')
#Reference library for SnpEff
SnpEffGen = 'WBcel235.99'

##Parameters
#Minimal quality for bam
minMaqQforBam = 1
#Minimum quality for vcf filtering
minMaqQforVcf = 10
#Minimum basequality for variant calling
minBaseQforVcf = 10
#Minimum depth at position
minVarCall = 10
#Number of threats
Ncpu=4
#Ram for java
ramG=20

### Quality assesment of Fastq files
A simple way to verify the quality of your sequencing runs is via the FastQC program. The following cells will make a directory to run the analysis and output its results.

In [None]:
##Create directory to place data
!mkdir -p FastQC
##Move to that directory
os.chdir(path + '/analysis/FastQC')

In [None]:
##Perform fastQC analysis
!fastqc -t {Ncpu} {FastqF1} {FastqF2} -o . > /dev/null 2>&1

##Notify user if this step was sucessfull
tempname1 = FastqF1.split("/")[-1]
tempname1 = tempname1.split(".fastq")[0]
if os.path.isfile(tempname1 + '_fastqc.html'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">'))
    !unzip -o -qq \*.zip
    display(Markdown('<b>Metrics:</b>'))
    tempname1 = FastqF1.split("/")[-1]
    tempname1 = tempname1.split(".fastq")[0]
    print(tempname1)
    !cat {tempname1}_fastqc/summary.txt
    tempname2 = FastqF2.split("/")[-1]
    tempname2 = tempname2.split(".fastq")[0]
    print(tempname2)
    !cat {tempname2}_fastqc/summary.txt
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>FASTQC report not produced in current directory: '+os.getcwd()+'</div>'))


#### HTML reports
Run the cell below to link the FastQC results to this notebook.

In [None]:
display(Markdown('Full reports (open a new tab to see them): \n' + '* [' + tempname1 + '](./analysis/FastQC/' + tempname1 +'_fastqc.html)\n' + '* [' + tempname2 + '](./analysis/FastQC/' + tempname2 +'_fastqc.html)\n'))

### Mapping reads to reference genome using bwa

After assesed the quality of the Fastq reads, we will map them to the reference genome using bwa. For that running the cells belows will produce a directory where the analysis will be performed.

In [None]:
##Move to analysis
os.chdir(path + '/analysis/')
##Create directory to place data
!mkdir -p Alignments
##Move to that directory
os.chdir(path + '/analysis/Alignments')

In [None]:
!bwa mem -t {Ncpu} -M {ReferenceGenome} {FastqF1} {FastqF2} -o {SampleName}.sam > /dev/null 2>&1

##Notify user if previous step was sucessfull
if os.path.isfile(SampleName+'.sam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Sam file not present in current path: '+os.getcwd()+'</div>'))

Now lets transform to bam

In [None]:
!samtools view -@ {Ncpu} -S -bh {SampleName}.sam | samtools sort -@ {Ncpu} - > {SampleName}.bam

if os.path.isfile(SampleName+'.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Sam file not present in current path: '+os.getcwd()+'</div>'))

Now lets index

In [None]:
!samtools index {SampleName}.bam

if os.path.isfile(SampleName+'.bam.bai'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not present in current path: '+os.getcwd()+'</div>'))

Filter for mapping quality

In [None]:
!samtools view -@ {Ncpu} -q {minMaqQforBam} -bh {SampleName}.bam > {SampleName}.SF.bam

if os.path.isfile(SampleName+'.SF.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Filtered file not present in current path: '+os.getcwd()+'</div>'))

Then add and replace groups via picard

In [None]:
!java -Xmx{ramG}g -jar {softpath}/picard2.23.6/picard.jar AddOrReplaceReadGroups I={SampleName}.SF.bam O={SampleName}.RG.bam RGID={SampleName} RGLB=LB RGPL=illumina RGPU=PU RGSM={SampleName} > /dev/null 2>&1

if os.path.isfile(SampleName+'.RG.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Read groups file not present in current path: '+os.getcwd()+'</div>'))

Then we mark duplicates

In [None]:
!java -Xmx{ramG}g -jar {softpath}/picard2.23.6/picard.jar MarkDuplicates I={SampleName}.RG.bam O={SampleName}.Dup.bam M={SampleName}.dedupMetrics REMOVE_DUPLICATES=true > /dev/null 2>&1

if os.path.isfile(SampleName+'.Dup.bam'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Aligment withouth duplicates file not present in current path: '+os.getcwd()+'</div>'))

And finally, index the resulting bam file

In [None]:
!samtools index {SampleName}.Dup.bam

if os.path.isfile(SampleName+'.Dup.bam.bai'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not present in current path: '+os.getcwd()+'</div>'))

### Local realigment of reads around indels via GATK
Let's now proceed to re-align the reads to do better calling around indels. For that, we start by identifying the location of possible indes via GATK

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt {Ncpu} -R {ReferenceGenome} -I {SampleName}.Dup.bam -o {SampleName}.Indel.intervals > /dev/null 2>&1

if os.path.isfile(SampleName+'.Indel.intervals'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>File with realigned indels not present in current path: '+os.getcwd()+'</div>'))

And then we proceed to perform realigment around the spotted regions

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T IndelRealigner -R {ReferenceGenome} -I {SampleName}.Dup.bam -targetIntervals {SampleName}.Indel.intervals -o {SampleName}.realigned.bam > /dev/null 2>&1

if os.path.isfile(SampleName+'.realigned.bam'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'</div>'))

Check files present and remove temporal files

In [None]:
if os.path.isfile(SampleName+'.realigned.bam'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Mapping step complete</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'<br>Either the path is wrong or cells above have to be run again</div>'))

Now is time for variant calling

Though feel free to check the mapping file using tablet/igv. You can download the bam and the index file from:

In [None]:
display(Markdown('Download bam files (right-click and \"save as\"): \n' + '* [' + SampleName + '.realigned.bam](./analysis/Alignments/' + SampleName +'.realigned.bam)\n' + '* [' + SampleName + '.realigned.bam.bai](./analysis/Alignments/' + SampleName +'.realigned.bam.bai)\n'))

### Variant calling
Now that our reads are properly aligned, lets do variant calling. To start, let's make a new directory where the analysis will be run and the results reported.

In [None]:
##Move to analysis
os.chdir(path + '/analysis/')
##Create directory to place data
!mkdir -p VariantCalling
##Move to that directory
os.chdir(path + '/analysis/VariantCalling')

This pipeline uses two different GATK variant callers, those being Unified Genotyper (UG) and Haplotype Caller (HC). Both should produce good calls though some pipelines preffer the use of HC given its confidence metrics.

#### Unified genotyper
The code below produces a vcf via GATK's unified genotyper tool

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R {ReferenceGenome} -nt {Ncpu} -l INFO -glm BOTH -I {path}/analysis/Alignments/{SampleName}.realigned.bam -o {SampleName}.UG.vcf -mbq {minBaseQforVcf} -stand_call_conf {minVarCall} > /dev/null 2>&1

if os.path.isfile(SampleName+'.UG.vcf'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'</div>'))

#### Haplotype caller
Before calling variants vie GATK haplotype caller, we need to produce confidence values for each site and store them into a g.vcf file

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T HaplotypeCaller --emitRefConfidence GVCF -R {ReferenceGenome} -l INFO -I {path}/analysis/Alignments/{SampleName}.realigned.bam -o {SampleName}.g.vcf -mbq {minBaseQforVcf} -mmq {minMaqQforVcf} -stand_call_conf {minVarCall} > /dev/null 2>&1

if os.path.isfile(SampleName+'.g.vcf'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'</div>'))

After that, we can proceed to call the vcf from the g.vcf

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R {ReferenceGenome} -nt {Ncpu} -l INFO -V {SampleName}.g.vcf -dt none -o {SampleName}.HC.vcf  > /dev/null 2>&1

if os.path.isfile(SampleName+'.HC.vcf'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'</div>'))

In [None]:
## Check if HC vcf is done
if os.path.isfile(SampleName+'.HC.vcf'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Haplotype Caller complete</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'<br>Either the path is wrong or cells above have to be run again</div>'))
    
## Check if UG vcf is done
if os.path.isfile(SampleName+'.UG.vcf'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Unified Genotypifier complete</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'<br>Either the path is wrong or cells above have to be run again</div>'))


Now, for each vcf let's plot the distribution of the analized variants with python. First the vcf produced with haplotype caller

In [None]:
df = pd.read_csv((SampleName+'.HC.vcf'),delimiter='\t',comment='#', names=["chr","pos","id","ref","alt","qual","filter","info","format",SampleName])

strarray= df[SampleName]

RefD=[]
AltD=[]
Freq=[]
for i in range(len(strarray)):
    deps=(strarray[i].split(':'))[1].split(',')
    RefD.append(int(deps[0]))
    AltD.append(int(deps[1]))
    tmpfre=int(deps[0]) + int(deps[1])
    if tmpfre > 0 :
        Freq.append(int(deps[1])/(int(deps[0]) + int(deps[1])))
    else :
        Freq.append(0)

df['RefD']=RefD
df['AltD']=AltD
df['Freq']=Freq

##Plot as separated interactive plot
%matplotlib notebook
#%matplotlib inline
Window = 50000
AFreq = .8
display(Markdown('<b>'+SampleName+' Haplotype Caller; Number of variants above an allele frequency >'+str(AFreq)+' in '+str(Window)+'bp non-overlapping windows</b>'))

for chrname in df["chr"].unique() :
    chr1=df.loc[df["chr"]==chrname]
    chr1 = chr1.loc[chr1["Freq"]>AFreq]
    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
    plt.figure()
    plt.stem(np.array(range(len(chr1H)))*Window,chr1H)
    plt.xlabel("Genomic Location, Window Size = " + str(Window))
    plt.ylabel("No. of variants with AF > " + str(AFreq))
    plt.title(chrname)
plt.show() 

#Plot all together
#%matplotlib notebook
#%matplotlib inline
#Window = 50000
#AFreq = .8
#con = 0
#figure, axis = plt.subplots((len(df["chr"].unique()) + 3)//4, 4)
#for chrname in df["chr"].unique() :
#    chr1=df.loc[df["chr"]==chrname]
#    chr1 = chr1.loc[chr1["Freq"]>AFreq]
#    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
#    axis[con//4, con%4].stem(np.array(range(len(chr1H)))*Window,chr1H)
#    axis[con//4, con%4].set_title(chrname)
#    con = con + 1
#plt.show() 

And then Unified genotyper

In [None]:
df = pd.read_csv((SampleName+'.UG.vcf'),delimiter='\t',comment='#', names=["chr","pos","id","ref","alt","qual","filter","info","format",SampleName])

strarray= df[SampleName]

RefD=[]
AltD=[]
Freq=[]
for i in range(len(strarray)):
    deps=(strarray[i].split(':'))[1].split(',')
    RefD.append(int(deps[0]))
    AltD.append(int(deps[1]))
    tmpfre=int(deps[0]) + int(deps[1])
    if tmpfre > 0 :
        Freq.append(int(deps[1])/(int(deps[0]) + int(deps[1])))
    else :
        Freq.append(0)

df['RefD']=RefD
df['AltD']=AltD
df['Freq']=Freq

##Plot as separated interactive plot
%matplotlib notebook
#%matplotlib inline
Window = 50000
AFreq = .8
display(Markdown('<b>'+SampleName+' Unified Genotyper; Number of variants above an allele frequency >'+str(AFreq)+' in '+str(Window)+'bp non-overlapping windows</b>'))

for chrname in df["chr"].unique() :
    chr1=df.loc[df["chr"]==chrname]
    chr1 = chr1.loc[chr1["Freq"]>AFreq]
    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
    plt.figure()
    plt.stem(np.array(range(len(chr1H)))*Window,chr1H)
    plt.xlabel("Genomic Location, Window Size = " + str(Window))
    plt.ylabel("No. of variants with AF > " + str(AFreq))
    plt.title(chrname)
plt.show() 

#Plot all together
#%matplotlib notebook
#%matplotlib inline
#Window = 50000
#AFreq = .8
#con = 0
#figure, axis = plt.subplots((len(df["chr"].unique()) + 3)//4, 4)
#for chrname in df["chr"].unique() :
#    chr1=df.loc[df["chr"]==chrname]
#    chr1 = chr1.loc[chr1["Freq"]>AFreq]
#    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
#    axis[con//4, con%4].stem(np.array(range(len(chr1H)))*Window,chr1H)
#    axis[con//4, con%4].set_title(chrname)
#    con = con + 1
#plt.show() 

Feel free to download the raw VCFs produced by GATK algorithms by running the cell below. However, they are more useful when annotated (next step).

In [None]:
display(Markdown('Download bam files (right-click and \"save as\"): \n' + '* [' + SampleName + '.HC.vcf](./analysis/VariantCalling/' + SampleName +'.HC.vcf)\n' + '* [' + SampleName + '.UG.vcf](./analysis/VariantCalling/' + SampleName +'.UG.vcf)\n'))

### Variant annotation with SnpEff
The last step is to identify if any mutation observed has an effect over any coding sequence. For that, we will use SnpEff suite. Lets start by creating a new folder and going there:

In [None]:
##Move to analysis
os.chdir(path + '/analysis/')
##Create directory to place data
!mkdir -p VariantPrediction
##Move to that directory
os.chdir(path + '/analysis/VariantPrediction')

Then, let's run SnpEff on both vcfs produced by GATK algorithms.

In [None]:
###Produce Annotations
!java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/snpEff.jar {SnpEffGen} -stats {SampleName}.HC.html {path}/analysis/VariantCalling/{SampleName}.HC.vcf > {SampleName}.HC.ann.vcf
!java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/snpEff.jar {SnpEffGen} -stats {SampleName}.UG.html {path}/analysis/VariantCalling/{SampleName}.UG.vcf > {SampleName}.UG.ann.vcf 

if os.path.isfile(SampleName+'.HC.ann.vcf'):
    if os.path.isfile(SampleName+'.UG.ann.vcf'):
        display(Markdown('<b>\b</b>'))
    else:
        display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated UG vcf not present in current path: '+os.getcwd()+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated HC vcf not present in current path: '+os.getcwd()+'</div>'))

In [None]:
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - -s "," -e "." CHROM POS REF ALT QUAL GEN[{SampleName}].GT GEN[{SampleName}].DP GEN[*].AD "ANN[*].GENEID" "ANN[*].ALLELE" "ANN[*].EFFECT" "ANN[*].IMPACT" "ANN[*].GENE" "ANN[*].FEATURE" "ANN[*].FEATUREID" "ANN[*].BIOTYPE" "ANN[*].RANK" "ANN[*].HGVS_C" "ANN[*].HGVS_P" "ANN[*].CDNA_POS" "ANN[*].CDNA_LEN" "ANN[*].CDS_POS" "ANN[*].CDS_LEN" "ANN[*].AA_POS" "ANN[*].AA_LEN" "ANN[*].DISTANCE" "LOF[*].GENE" "LOF[*].GENEID" "NMD[*].GENE" "NMD[*].GENEID" > {SampleName}.HC.ann.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - -s "," -e "." CHROM POS REF ALT QUAL GEN[{SampleName}].GT GEN[{SampleName}].DP GEN[*].AD "ANN[*].GENEID" "ANN[*].ALLELE" "ANN[*].EFFECT" "ANN[*].IMPACT" "ANN[*].GENE" "ANN[*].FEATURE" "ANN[*].FEATUREID" "ANN[*].BIOTYPE" "ANN[*].RANK" "ANN[*].HGVS_C" "ANN[*].HGVS_P" "ANN[*].CDNA_POS" "ANN[*].CDNA_LEN" "ANN[*].CDS_POS" "ANN[*].CDS_LEN" "ANN[*].AA_POS" "ANN[*].AA_LEN" "ANN[*].DISTANCE" "LOF[*].GENE" "LOF[*].GENEID" "NMD[*].GENE" "NMD[*].GENEID" > {SampleName}.UG.ann.txt

if os.path.isfile(SampleName+'.HC.ann.txt'):
    if os.path.isfile(SampleName+'.UG.ann.txt'):
        display(Markdown('<b>\b</b>'))
    else:
        display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated UG vcf not present in current path: '+os.getcwd()+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated HC vcf not present in current path: '+os.getcwd()+'</div>'))

Finally, let's sort the annotations by it's importance (see the HTML reports for more information):

In [None]:
## Filter mutations by importance
!echo "#High impact mutations" > {SampleName}.HC.ann.sort.txt
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'HIGH'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.HC.ann.sort.txt
!echo "#Moderate impact mutations" >> {SampleName}.HC.ann.sort.txt
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'MODERATE'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.HC.ann.sort.txt
!echo "#Low impact mutations" >> {SampleName}.HC.ann.sort.txt
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'LOW'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.HC.ann.sort.txt

!echo "#High impact mutations" > {SampleName}.UG.ann.sort.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'HIGH'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.UG.ann.sort.txt
!echo "#Moderate impact mutations" >> {SampleName}.UG.ann.sort.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'MODERATE'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.UG.ann.sort.txt
!echo "#Low impact mutations" >> {SampleName}.UG.ann.sort.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'LOW'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.UG.ann.sort.txt

!echo "Unified Genotyper"
!head {SampleName}.UG.ann.sort.txt
!echo "Haplotype Caller"
!head {SampleName}.HC.ann.sort.txt

## Produce links to HTML reports
display(Markdown('Full reports (open a new tab to see them): \n' + '* [' + SampleName + ' Unified Genotyper results](./analysis/VariantPrediction/' + SampleName +'.UG.html)\n' + '* [' + SampleName + ' Haplotype Caller results](./analysis/VariantPrediction/' + SampleName +'.HC.html)\n'))

### Results and final notes
With all the analysis performed, let's just make a final directory with links to the most important outputs per step: 

In [None]:
#Analysis completed, going back to main directory
os.chdir(path)
##Create directory to place copy of results and check if files are there
!mkdir -p {path}/results
##Move to that directory
os.chdir( path + '/results' )

Make link to main results and compress folder

In [None]:
##Make symbolic links
!ln -s {path}/analysis/FastQC/*.zip .
!ln -s {path}/analysis/Alignments/*.realigned.bam .
!ln -s {path}/analysis/VariantCalling/*.vcf .
!ln -s {path}/analysis/VariantPrediction/*.txt .

## Compress results
os.chdir( path )
!zip -r {SampleName}_results.zip results

## Make link to download
## Produce links to HTML reports
display(Markdown('Download compress folder: \n' + '* [' + SampleName + '](./' + SampleName +'_results.zip)\n' ))

### Further analysis
Congratulatios for completing this first tutorial. Now that we have the genomic variants seen in the Hawaiian strain, we're ready to perform mapping by sequencing. Save the location of your preffered VCF and go to the next nootebook of the Hawaiian tutorial `WGS-Mutant_strain-Hawaiian_tutorial`.