# Reference indexing, mapping, coverage, and variant calling
* [Amhed Missael Vargas Velazquez](https://www.researchgate.net/profile/Amhed-Vargas-Velazquez)
* Post-doctoral fellow, [SGB lab](https://syngenbio.kaust.edu.sa/), [KAUST](https://www.kaust.edu.sa/en)

## Description
This jupyter notebook contains commands to identify genomic variants in any polymorphic *C. elegans* strain. The description has been shortened to ease its reading. 

## Getting started
Run the cells below to create a working directory, load the necessary python libraries, and to verify you have the required software.

### Load python libraries
Run the cell below to load essential libraries for the pipeline to work:

In [None]:
## Load libraries
#os to move within directories
import os
#IPython.display for markdown
from IPython.display import display, Markdown
#matplotlib, pandas, and numpy for plotting and data analysis respectively
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Produce a folder that can be accesed by the pipeline  
Before starting any analysis, make sure to select a folder where the analysis will be performed (*unless* stated otherwise, the analysis will be performed on the **same folder as this notebook**):

In [None]:
##Set working directory
#Same location as script
path = os.getcwd()
#or somewhere else, e.g.:
#path = '/home/jupyter-user/Workstation/user/parental'

##Move to path
os.chdir(path)

##Show current directory to user
display(Markdown('<div class=\"alert alert-block alert-info\">Directory for analysis:<br><b>' + os.getcwd() + '</b></div>'))

### Make sure to have the "stand alone" software
Most of the programs used within this pipeline have "stand-alone" versions that allow users to run their analysis on any computer they want. However, first you have to make sure to have those programs. Particularly, make sure to have a directory containing the following ones: 

- GATK (GenomeAnalysisTK.jar)
- Picard (picard.jar)
- SnpEff (folder with both snpEff.jar and SnpSift.jar, and another folder with its database; more below)

For your convenience, there is a folder already containing these programs. Just make sure to set properly the path to them, e.g.:

In [None]:
##Path to software folder
softpath = '/home/WGS_pipeline/Software'

##GATK v3.8.1.0
GATKpath = (softpath + '/gatk-3.8.1.0')

##Picard v2.23.6 
PiKpath = (softpath + '/picard2.23.6')

##SnpEff v.0
Snpath = (softpath + '/SnpEff-5.0/snpEff')

##Alternative paths
#GATKpath = ''
#PiKpath = ''
#Snpath = ''

##Check if jar files are there
##Notify user if GATK .jar is present or not
if os.path.isfile(GATKpath + '/GenomeAnalysisTK.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b GATK</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>GATK not found in: ' + GATKpath +'</div>'))

##Notify user if Picard .jar is present or not
if os.path.isfile(PiKpath + '/picard.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Picard</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Picard not found in: ' + PiKpath +'</div>'))

##Notify user if SnpEff .jar is present or not
if os.path.isfile(Snpath + '/snpEff.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b SnpEff</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>SnpEff not found in: ' + GATKpath +'</div>'))

### Make sure to have a reference genome
In order to run this pipeline, a *C. elegans* reference genome is needed (preferentially in fasta format). The cells below allows you to prepare your reference file or to download the *C. elegans* ce11/WS235 version from Ensemble.

First, download or specify the location of your reference genome:

In [None]:
##Specify the location of the genome of Reference
#e.g.
RefFile=('/home/jupyter-amhed/Workstation/Amhed/Caenorhabditis_elegans.fa')

###OR

##To download ce11 uncomment (remove the # sign) from the lines below

#!mkdir -p {path}/data

#os.chdir(path+'/data')

#!wget -q ftp://ftp.ensembl.org/pub/release-99/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz

#!zcat Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz > Caenorhabditis_elegans.WBcel235.99.softmasked.fa

#RefFile=(path+'/data/'+'Caenorhabditis_elegans.WBcel235.99.softmasked.fa')

##Notify user if file present or not
if os.path.isfile(RefFile):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>Reference Genome :</b>\n'+RefFile+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Reference file: '+RefFile+' does not exists</div>'))

Then it need to be indexed by samtools:

In [None]:
##Check first if file is indexed already
if os.path.isfile(RefFile+'.fai'):
    display(Markdown('<b>\b</b>'))
else:
    !samtools faidx {RefFile}
    ##Notify user if this step was sucessfull
    if os.path.isfile(RefFile+'.fai'):
        display(Markdown('<b>\b</b>'))
    else:
            display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

And bwa

In [None]:
##Check first if file is indexed already
if os.path.isfile(RefFile+'.bwt'):
    display(Markdown('<b>\b</b>'))
else:
    !bwa index {RefFile} > /dev/null 2>&1
    ##Notify user if this step was sucessfull
    if os.path.isfile(RefFile+'.bwt'):
        display(Markdown('<b>\b</b>'))
    else:
        display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

And by picard tools

In [None]:
RefFilebase=''.join(RefFile.split('.')[:-1])

##Check first if file is indexed already
if os.path.isfile(RefFilebase +'.dict'):
    display(Markdown('<b>\b</b>'))
else:
    !java -jar {softpath}/picard2.23.6/picard.jar CreateSequenceDictionary R={RefFile} O={RefFilebase +'.dict'} > /dev/null 2>&1
    ##Notify user if this step was sucessfull
    if os.path.isfile(RefFilebase +'.dict'):
        display(Markdown('<b>\b</b>'))
    else:
        display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

Now your reference genome is ready for the analysis in this pipeline (**as well as many others**).

In [None]:
ReferenceGenome = RefFile

## Data analysis workflow
The common pipeline consist on:
* Quality assesment of sequencing reads via FastQC
* Mapping of reads with bwa
* Filtering and processing of alignment file with samtools
* Coverage analysis with Samtools
* Realigment with GATK
* Variant calling with GATK HC and UG
* Mutational analysis with SnpEff

Lets now define parameters for pipeline:

In [None]:
### Input parameters
##Name of your sample
SampleName = 'CFJ125'
##Name of the directory that will contain the results of the analysis
DirectoryName = 'CFJ125_analysis'
##First Fastq file (.fq, .fastq, .fastq.gz extensions are allowed)
FastqF1 = ('/home/jupyter-newuser/Workstation/CFJ125/CFJ125_R1.fastq') 
##Second Fastq file (.fq, .fastq, .fastq.gz extensions are allowed)
FastqF2 = ('/home/jupyter-newuser/Workstation/CFJ125/CFJ125_R2.fastq')

### Default parameters : Modify to tailor your analysis
## Minimal mapping quality for bam filtering
minMaqQforBam = 1
## Minimum mapping quality for variant calling
minMaqQforVcf = 10
## Minimum basequality for variant calling
minBaseQforVcf = 10
## Minimum depth for variant calling
minVarCall = 10
## Number of threats (make sure to have enough available resources)
Ncpu = 4
## Ram allocation for java virtual environment (make sure to have enough available resources)
ramG = 20
## SnpEff database
SnpEffGen = 'WBcel235.99'

### Process inputs, and go to directory to start analysis
#Make sure to start from path
os.chdir(path)
##Create directory to place data
!mkdir -p {path}/{DirectoryName}
##Move to that directory
os.chdir( path + '/' + DirectoryName )

### Verification process
##Reference genome
if os.path.isfile(ReferenceGenome):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>Reference Genome :</b>\n'+RefFile+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Reference file: '+RefFile+' does not exists</div>'))

## First fastq file
if os.path.isfile(FastqF1):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>FastqF1 :</b>\n'+FastqF1+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>First fastq file: '+FastqF1+' not found</div>'))

## Second fastq file
if os.path.isfile(FastqF2):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>FastqF2 :</b>\n'+FastqF2+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Second fastq file: '+FastqF2+' not found</div>'))

## Show current directory (where everything will happen)
##Show current directory to user
display(Markdown('<div class=\"alert alert-block alert-info\">Running analysis in:<br><b>' + os.getcwd() + '</b></div>'))

### Quality assesment of Fastq files
A simple way to verify the quality of your sequencing runs is via the FastQC program. The following cells will read the fastq files, run the program `fastqc` and output its results:

In [None]:
##Perform fastQC analysis
!fastqc -t {Ncpu} {FastqF1} {FastqF2} -o . > /dev/null 2>&1 

##Notify user if this step was sucessfull
tempname1 = FastqF1.split("/")[-1]
tempname1 = ''.join(tempname1.split(".")[-len(tempname1.split("."))])
if os.path.isfile(tempname1 + '_fastqc.html'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">'))
    !unzip -o -qq \*.zip
    display(Markdown('<b>Metrics:</b>'))
    tempname1 = FastqF1.split("/")[-1]
    tempname1 = ''.join(tempname1.split(".")[-len(tempname1.split("."))])
    print(tempname1)
    !cat {tempname1}_fastqc/summary.txt
    tempname2 = FastqF2.split("/")[-1]
    tempname2 = ''.join(tempname2.split(".")[-len(tempname2.split("."))])
    print(tempname2)
    !cat {tempname2}_fastqc/summary.txt
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>FASTQC report not produced in current directory: '+os.getcwd()+'</div>'))

#### HTML reports
Run the cell below to link the `fastqc` results to this notebook.

In [None]:
display(Markdown('Full reports (open a new tab to see them): \n' + '* [' + tempname1 + '](./' + DirectoryName + '/' + tempname1 +'_fastqc.html)\n' + '* [' + tempname2 + '](./' + DirectoryName + '/' + tempname2 +'_fastqc.html)\n'))

### Mapping reads to reference genome using bwa

After assesing the quality of the Fastq reads, we will map them to the reference genome using `bwa`. The cells below will run bwa taking as input the fastq files and producing an aligment file with extension `.sam`

In [None]:
##Move to main directory 
os.chdir( path + '/' + DirectoryName )

In [None]:
!bwa mem -t {Ncpu} -M {ReferenceGenome} {FastqF1} {FastqF2} -o {SampleName}.sam > /dev/null 2>&1

##Notify user if previous step was sucessfull
if os.path.isfile(SampleName+'.sam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Sam file not present in current path: '+os.getcwd()+'</div>'))

Furthermore, we will now transform the `.sam` file into its binary format (`.bam`) and sort it at the same time.

In [None]:
!samtools view -@ {Ncpu} -S -bh {SampleName}.sam | samtools sort -@ {Ncpu} - > {SampleName}.bam

if os.path.isfile(SampleName+'.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Sam file not present in current path: '+os.getcwd()+'</div>'))

To ease the processing of a sorted `.bam` file, its better to index it (i.e., produce a file with extension `.bai`)

In [None]:
!samtools index {SampleName}.bam

if os.path.isfile(SampleName+'.bam.bai'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not present in current path: '+os.getcwd()+'</div>'))

Now that we have sorted and indexed our raw aligment, let's filter for mapping quality (file with extension `.SF.bam` where the S stands for sort and F for filter).

In [None]:
!samtools view -@ {Ncpu} -q {minMaqQforBam} -bh {SampleName}.bam > {SampleName}.SF.bam

if os.path.isfile(SampleName+'.SF.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Filtered file not present in current path: '+os.getcwd()+'</div>'))

Finally, to reduce the bias produced at the PCR amplification step of sequencing libraries, let's remove all duplicated reads in our raw aligment i.e., all the reads that contain exactly the same sequence. For that, first we have to group our reads and identify the duplicates. These steps are done in this pipeline with `picard` tools. **Please note:** Removing duplicates do affect the efective sequencing depth seen in across the genome, however, this step should act uniformly except in regions with high level of repetitions or low level of sequence complexity.

In [None]:
!java -Xmx{ramG}g -jar {softpath}/picard2.23.6/picard.jar AddOrReplaceReadGroups I={SampleName}.SF.bam O={SampleName}.RG.bam RGID={SampleName} RGLB=LB RGPL=illumina RGPU=PU RGSM={SampleName} > /dev/null 2>&1

if os.path.isfile(SampleName+'.RG.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Read groups file not present in current path: '+os.getcwd()+'</div>'))

After grouping and marking duplicated reads, we can now filter them out and produce a new `.bam` file with extension `.Dup.bam`

In [None]:
!java -Xmx{ramG}g -jar {softpath}/picard2.23.6/picard.jar MarkDuplicates I={SampleName}.RG.bam O={SampleName}.Dup.bam M={SampleName}.dedupMetrics REMOVE_DUPLICATES=true > /dev/null 2>&1

if os.path.isfile(SampleName+'.Dup.bam'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Aligment withouth duplicates file not present in current path: '+os.getcwd()+'</div>'))

Finally, let's index it to ease its further processing:

In [None]:
!samtools index {SampleName}.Dup.bam

if os.path.isfile(SampleName+'.Dup.bam.bai'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not present in current path: '+os.getcwd()+'</div>'))

Feel free to download the files before the realigment step by following the links:

In [None]:
display(Markdown('Download bam files (right-click and \"save as\"): \n' + '* [' + SampleName + '.Dup.bam](./' + DirectoryName + '/' + SampleName +'.Dup.bam)\n' + '* [' + SampleName + '.Dup.bam.bai](./' + DirectoryName + '/' + SampleName +'.Dup.bam.bai)\n'))

### Coverage analysis with samtools
Before realigment and variant calling, let's have a quick look to the coverage of our aligment withouth duplicated reads (please note that if you want to see specifically the coverage of a region or contig, you can modify the code as mentioned in the cell)

In [None]:
##For all the genome
!samtools coverage {SampleName}.Dup.bam

##For specific regions, e.g.:
#!samtools coverage -r MtDNA {SampleName}.Dup.bam
#!samtools coverage -r I:12995803-13012859 {SampleName}.Dup.bam

### Local realigment of reads around indels via GATK
Let's now proceed to re-align the reads to do better calling around indels. For that, we start by identifying the location of possible indes via GATK

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt {Ncpu} -R {ReferenceGenome} -I {SampleName}.Dup.bam -o {SampleName}.Indel.intervals > /dev/null 2>&1

if os.path.isfile(SampleName+'.Indel.intervals'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>File with realigned indels not present in current path: '+os.getcwd()+'</div>'))

And then we proceed to perform realigment around the spotted regions

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T IndelRealigner -R {ReferenceGenome} -I {SampleName}.Dup.bam -targetIntervals {SampleName}.Indel.intervals -o {SampleName}.realigned.bam > /dev/null 2>&1

if os.path.isfile(SampleName+'.realigned.bam'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'</div>'))

Check if files present

In [None]:
if os.path.isfile(SampleName+'.realigned.bam'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Mapping step complete</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'<br>Either the path is wrong or cells above have to be run again</div>'))

Now is time for variant calling, though feel free to check the new aligment file (extension `.realigned.bam`) using [tablet](https://ics.hutton.ac.uk/tablet/) or [igv](https://software.broadinstitute.org/software/igv/). Running the cell below produces a link that you can use to save them:

In [None]:
display(Markdown('Download bam files (right-click and \"save as\"): \n' + '* [' + SampleName + '.realigned.bam](./' + DirectoryName + '/' + SampleName +'.realigned.bam)\n' + '* [' + SampleName + '.realigned.bam.bai](./' + DirectoryName + '/' + SampleName +'.realigned.bam.bai)\n'))

### Variant calling
Now that our reads are properly aligned, lets do variant calling. To start, let's make sure to be in the directory where analysis have been executed:

In [None]:
##Move to main directory 
os.chdir( path + '/' + DirectoryName )

Now, while there are multiple programs to call variants, this pipeline uses two different well known GATK variant callers, those being Unified Genotyper (UG) and Haplotype Caller (HC). Both should produce good calls though some pipelines preffer the use of HC given its confidence metrics.

#### Unified genotyper
The code below produces a `.vcf` (variant calling format) file via GATK's unified genotyper tool

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R {ReferenceGenome} -nt {Ncpu} -l INFO -glm BOTH -I {path}/{DirectoryName}/{SampleName}.realigned.bam -o {SampleName}.UG.vcf -mbq {minBaseQforVcf} -stand_call_conf {minVarCall} > /dev/null 2>&1

if os.path.isfile(SampleName+'.UG.vcf'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'</div>'))

#### Haplotype caller
Before calling variants via GATK haplotype caller, we need to produce confidence values for each site and store them into a `g.vcf` file

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T HaplotypeCaller --emitRefConfidence GVCF -R {ReferenceGenome} -l INFO -I {path}/{DirectoryName}/{SampleName}.realigned.bam -o {SampleName}.g.vcf -mbq {minBaseQforVcf} -mmq {minMaqQforVcf} -stand_call_conf {minVarCall} > /dev/null 2>&1

if os.path.isfile(SampleName+'.g.vcf'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'</div>'))

After that, we can proceed to make a new `.vcf` from the `g.vcf` file

In [None]:
!java -Xmx{ramG}g -jar {softpath}/gatk-3.8.1.0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R {ReferenceGenome} -nt {Ncpu} -l INFO -V {SampleName}.g.vcf -dt none -o {SampleName}.HC.vcf  > /dev/null 2>&1
   
    ## Check if HC vcf is done
if os.path.isfile(SampleName+'.HC.vcf'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Haplotype Caller complete</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'<br>Either the path is wrong or cells above have to be run again</div>'))
    
## Check if UG vcf is done
if os.path.isfile(SampleName+'.UG.vcf'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Unified Genotypifier complete</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Realigned bam not present in current path: '+os.getcwd()+'<br>Either the path is wrong or cells above have to be run again</div>'))


Now, for each vcf let's plot the distribution of the analized variants with python. First the vcf produced with haplotype caller

In [None]:
## Input parameters
#Size of non-overlapping window
Window = 50000
#Minimum Allele frquency to plot
AFreq = .8

df = pd.read_csv((SampleName+'.HC.vcf'),delimiter='\t',comment='#', names=["chr","pos","id","ref","alt","qual","filter","info","format",SampleName])

strarray= df[SampleName]

RefD=[]
AltD=[]
Freq=[]
for i in range(len(strarray)):
    deps=(strarray[i].split(':'))[1].split(',')
    RefD.append(int(deps[0]))
    AltD.append(int(deps[1]))
    tmpfre=int(deps[0]) + int(deps[1])
    if tmpfre > 0 :
        Freq.append(int(deps[1])/(int(deps[0]) + int(deps[1])))
    else :
        Freq.append(0)

df['RefD']=RefD
df['AltD']=AltD
df['Freq']=Freq

##Plot as separated interactive plot
%matplotlib inline
##Or non interactive
#%matplotlib notebook

display(Markdown('<b>'+SampleName+' Haplotype Caller; Number of variants above an allele frequency >'+str(AFreq)+' in '+str(Window)+'bp non-overlapping windows</b>'))

for chrname in df["chr"].unique() :
    chr1=df.loc[df["chr"]==chrname]
    chr1 = chr1.loc[chr1["Freq"]>AFreq]
    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
    plt.figure()
    plt.stem(np.array(range(len(chr1H)))*Window,chr1H)
    plt.xlabel("Genomic Location, Window Size = " + str(Window))
    plt.ylabel("No. of variants with AF > " + str(AFreq))
    plt.title(chrname)
plt.show() 

#Plot all together
#%matplotlib notebook
#%matplotlib inline
#Window = 50000
#AFreq = .8
#con = 0
#figure, axis = plt.subplots((len(df["chr"].unique()) + 3)//4, 4)
#for chrname in df["chr"].unique() :
#    chr1=df.loc[df["chr"]==chrname]
#    chr1 = chr1.loc[chr1["Freq"]>AFreq]
#    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
#    axis[con//4, con%4].stem(np.array(range(len(chr1H)))*Window,chr1H)
#    axis[con//4, con%4].set_title(chrname)
#    con = con + 1
#plt.show()

And then unified genotyper

In [None]:
## Input parameters
#Size of non-overlapping window
Window = 50000
#Minimum Allele frquency to plot
AFreq = .8

df = pd.read_csv((SampleName+'.UG.vcf'),delimiter='\t',comment='#', names=["chr","pos","id","ref","alt","qual","filter","info","format",SampleName])

strarray= df[SampleName]

RefD=[]
AltD=[]
Freq=[]
for i in range(len(strarray)):
    deps=(strarray[i].split(':'))[1].split(',')
    RefD.append(int(deps[0]))
    AltD.append(int(deps[1]))
    tmpfre=int(deps[0]) + int(deps[1])
    if tmpfre > 0 :
        Freq.append(int(deps[1])/(int(deps[0]) + int(deps[1])))
    else :
        Freq.append(0)

df['RefD']=RefD
df['AltD']=AltD
df['Freq']=Freq

##Plot as separated interactive plot
%matplotlib inline
##Or non interactive
#%matplotlib notebook

display(Markdown('<b>'+SampleName+' Unified Genotyper; Number of variants above an allele frequency >'+str(AFreq)+' in '+str(Window)+'bp non-overlapping windows</b>'))

for chrname in df["chr"].unique() :
    chr1=df.loc[df["chr"]==chrname]
    chr1 = chr1.loc[chr1["Freq"]>AFreq]
    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
    plt.figure()
    plt.stem(np.array(range(len(chr1H)))*Window,chr1H)
    plt.xlabel("Genomic Location, Window Size = " + str(Window))
    plt.ylabel("No. of variants with AF > " + str(AFreq))
    plt.title(chrname)
plt.show() 

#Plot all together
#%matplotlib notebook
#%matplotlib inline
#Window = 50000
#AFreq = .8
#con = 0
#figure, axis = plt.subplots((len(df["chr"].unique()) + 3)//4, 4)
#for chrname in df["chr"].unique() :
#    chr1=df.loc[df["chr"]==chrname]
#    chr1 = chr1.loc[chr1["Freq"]>AFreq]
#    chr1H = np.array((chr1.groupby(chr1["pos"] // Window).count())["chr"])
#    axis[con//4, con%4].stem(np.array(range(len(chr1H)))*Window,chr1H)
#    axis[con//4, con%4].set_title(chrname)
#    con = con + 1
#plt.show()

Finally, let's proceed to annotate each variant

### Variant annotation with SnpEff
The last step is to identify if any mutation observed has an effect over any coding sequence. For that, we use SNPeff suite. Lets start by moving to the directory analysis

In [None]:
##Move to main directory 
os.chdir( path + '/' + DirectoryName )

Then, let's run SnpEff on both vcfs produced by GATK algorithms.

In [None]:
###Produce Annotations
!java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/snpEff.jar {SnpEffGen} -stats {SampleName}.HC.html {path}/{DirectoryName}/{SampleName}.HC.vcf > {SampleName}.HC.ann.vcf
!java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/snpEff.jar {SnpEffGen} -stats {SampleName}.UG.html {path}/{DirectoryName}/{SampleName}.UG.vcf > {SampleName}.UG.ann.vcf 

if os.path.isfile(SampleName+'.HC.ann.vcf'):
    if os.path.isfile(SampleName+'.UG.ann.vcf'):
        display(Markdown('<b>\b</b>'))
    else:
        display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated UG vcf not present in current path: '+os.getcwd()+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated HC vcf not present in current path: '+os.getcwd()+'</div>'))

Get relevant annotations files (extension `.ann.txt`)

In [None]:
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - -s "," -e "." CHROM POS REF ALT QUAL GEN[{SampleName}].GT GEN[{SampleName}].DP GEN[*].AD "ANN[*].GENEID" "ANN[*].ALLELE" "ANN[*].EFFECT" "ANN[*].IMPACT" "ANN[*].GENE" "ANN[*].FEATURE" "ANN[*].FEATUREID" "ANN[*].BIOTYPE" "ANN[*].RANK" "ANN[*].HGVS_C" "ANN[*].HGVS_P" "ANN[*].CDNA_POS" "ANN[*].CDNA_LEN" "ANN[*].CDS_POS" "ANN[*].CDS_LEN" "ANN[*].AA_POS" "ANN[*].AA_LEN" "ANN[*].DISTANCE" "LOF[*].GENE" "LOF[*].GENEID" "NMD[*].GENE" "NMD[*].GENEID" > {SampleName}.HC.ann.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - -s "," -e "." CHROM POS REF ALT QUAL GEN[{SampleName}].GT GEN[{SampleName}].DP GEN[*].AD "ANN[*].GENEID" "ANN[*].ALLELE" "ANN[*].EFFECT" "ANN[*].IMPACT" "ANN[*].GENE" "ANN[*].FEATURE" "ANN[*].FEATUREID" "ANN[*].BIOTYPE" "ANN[*].RANK" "ANN[*].HGVS_C" "ANN[*].HGVS_P" "ANN[*].CDNA_POS" "ANN[*].CDNA_LEN" "ANN[*].CDS_POS" "ANN[*].CDS_LEN" "ANN[*].AA_POS" "ANN[*].AA_LEN" "ANN[*].DISTANCE" "LOF[*].GENE" "LOF[*].GENEID" "NMD[*].GENE" "NMD[*].GENEID" > {SampleName}.UG.ann.txt

if os.path.isfile(SampleName+'.HC.ann.txt'):
    if os.path.isfile(SampleName+'.UG.ann.txt'):
        display(Markdown('<b>\b</b>'))
    else:
        display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated UG vcf not present in current path: '+os.getcwd()+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Annotated HC vcf not present in current path: '+os.getcwd()+'</div>'))

And sort them in terms of how impactul they are in respect to the wt gene (files with extension `.ann.sort.txt`)

In [None]:
## other approach
!echo "#High impact mutations" > {SampleName}.HC.ann.sort.txt
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'HIGH'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.HC.ann.sort.txt
!echo "#Moderate impact mutations" >> {SampleName}.HC.ann.sort.txt
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'MODERATE'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.HC.ann.sort.txt
!echo "#Low impact mutations" >> {SampleName}.HC.ann.sort.txt
!cat {SampleName}.HC.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'LOW'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.HC.ann.sort.txt

!echo "#High impact mutations" > {SampleName}.UG.ann.sort.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'HIGH'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.UG.ann.sort.txt
!echo "#Moderate impact mutations" >> {SampleName}.UG.ann.sort.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'MODERATE'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.UG.ann.sort.txt
!echo "#Low impact mutations" >> {SampleName}.UG.ann.sort.txt
!cat {SampleName}.UG.ann.vcf | perl {softpath}/SnpEff-5.0/snpEff/scripts/vcfEffOnePerLine.pl | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar filter "ANN[0].IMPACT has 'LOW'" | java -Xmx{ramG}g -jar {softpath}/SnpEff-5.0/snpEff/SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].FEATUREID" "ANN[*].EFFECT" "LOF[*].GENE" >> {SampleName}.UG.ann.sort.txt

!echo "First 10 entries in Unified Genotyper variants"
!head {SampleName}.UG.ann.sort.txt
!echo "First 10 entries in Haplotype Caller"
!head {SampleName}.HC.ann.sort.txt

Finally, let's produce a link to them:

In [None]:
display(Markdown('Full reports (open a new tab to see them): \n' + '* [' + SampleName + ' Unified Genotyper results](./' + DirectoryName + '/' + SampleName +'.UG.html)\n' + '* [' + SampleName + ' Haplotype Caller results](./' + DirectoryName + '/' + SampleName +'.HC.html)\n'))

**Done**