# Reference indexing, mapping, and coverage
* [Amhed Missael Vargas Velazquez](https://www.researchgate.net/profile/Amhed-Vargas-Velazquez)
* Post-doctoral fellow, [SGB lab](https://syngenbio.kaust.edu.sa/), [KAUST](https://www.kaust.edu.sa/en)

## Description
This jupyter notebook contains commands to identify genomic variants in any polymorphic *C. elegans* strain. The description has been shortened for ease its reading. 

## Getting started
Run the cells below to create a working directory, load the necessary python libraries, and to verify you have the required software.

### Load python libraries
Run the cell below to load essential libraries for the pipeline to work:

In [None]:
## Load libraries
#os to move within directories
import os
#IPython.display for markdown
from IPython.display import display, Markdown

### Produce a folder that can be accesed by the pipeline  
Before starting any analysis, make sure to select a folder where the analysis will be performed (*unless* stated otherwise, the analysis will be performed on the **same folder as this notebook**):

In [None]:
##Set working directory
#Same location as script
path = os.getcwd()
#or somewhere else, e.g.:
#path = '/home/jupyter-user/Workstation/user/parental'

##Move to path
os.chdir(path)

##Show current directory to user
display(Markdown('<div class=\"alert alert-block alert-info\">Directory for analysis:<br><b>' + os.getcwd() + '</b></div>'))

### Make sure to have the "stand alone" software
Most of the programs used within this pipeline have "stand-alone" versions that allow users to run their analysis on any computer they want. However, first you have to make sure to have those programs. Particularly, make sure to have a directory containing the following ones: 

- GATK (GenomeAnalysisTK.jar)
- Picard (picard.jar)
- SnpEff (folder with both snpEff.jar and SnpSift.jar, and another folder with its database; more below)

For your convenience, there is a folder already containing these programs. Just make sure to set properly the path to them, e.g.:

In [None]:
##Path to software folder
softpath = '/home/WGS_pipeline/Software'

##GATK v3.8.1.0
GATKpath = (softpath + '/gatk-3.8.1.0')

##Picard v2.23.6 
PiKpath = (softpath + '/picard2.23.6')

##SnpEff v.0
Snpath = (softpath + '/SnpEff-5.0/snpEff')

##Alternative paths
#GATKpath = ''
#PiKpath = ''
#Snpath = ''

##Check if jar files are there
##Notify user if GATK .jar is present or not
if os.path.isfile(GATKpath + '/GenomeAnalysisTK.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b GATK</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>GATK not found in: ' + GATKpath +'</div>'))

##Notify user if Picard .jar is present or not
if os.path.isfile(PiKpath + '/picard.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b Picard</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Picard not found in: ' + PiKpath +'</div>'))

##Notify user if SnpEff .jar is present or not
if os.path.isfile(Snpath + '/snpEff.jar'):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>\b SnpEff</b></div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>SnpEff not found in: ' + GATKpath +'</div>'))

### Make sure to have a reference genome
In order to run this pipeline, a *C. elegans* reference genome is needed (preferentially in fasta format). The cells below allows you to prepare your reference file or to download the *C. elegans* ce11/WS235 version from Ensemble.

First, download or specify the location of your reference genome:

In [None]:
##Specify the location of the genome of Reference
#e.g.
RefFile=('/home/jupyter-newuser/Workstation/Amhed/Caenorhabditis_elegans.fa')

###OR

##To download ce11 uncomment (remove the # sign) from the lines below

#!mkdir -p {path}/data

#os.chdir(path+'/data')

#!wget -q ftp://ftp.ensembl.org/pub/release-99/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz

#!zcat Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz > Caenorhabditis_elegans.WBcel235.99.softmasked.fa

#RefFile=(path+'/data/'+'Caenorhabditis_elegans.WBcel235.99.softmasked.fa')

##Notify user if file present or not
if os.path.isfile(RefFile):
    display(Markdown('<div class=\"alert alert-block alert-success\"><b>Reference Genome :</b>\n'+RefFile+'</div>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Reference file: '+RefFile+' does not exists</div>'))

Then it need to be indexed by samtools:

In [None]:
!samtools faidx {RefFile}

##Notify user if this step was sucessfull
if os.path.isfile(RefFile+'.fai'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

And bwa

In [None]:
!bwa index {RefFile} > /dev/null 2>&1

##Notify user if this step was sucessfull
if os.path.isfile(RefFile+'.bwt'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

And by picard tools

In [None]:
RefFilebase=''.join(RefFile.split('.')[:-1])
!java -jar {softpath}/picard2.23.6/picard.jar CreateSequenceDictionary R={RefFile} O={RefFilebase +'.dict'} > /dev/null 2>&1

##Notify user if this step was sucessfull
if os.path.isfile(RefFilebase +'.dict'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not produced for '+RefFile+'</div>'))

Now your reference genome is ready for the analysis in this pipeline (**as well as many others**).

In [None]:
ReferenceGenome = RefFile

## Data analysis workflow
The common pipeline consist on:
* Quality assesment of sequencing reads via FastQC
* Mapping of reads with bwa
* Filtering and processing of alignment file with samtools
* Coverage analysis with Samtools

But before that, lets create a directory where the analysis will be executed:

In [None]:
##Move to that directory
os.chdir(path)
##Create directory to place data
!mkdir -p {path}/analysis
##Move to that directory
os.chdir( path + '/analysis' )

Lets now define parameters for pipeline:

In [None]:
##Inputs
SampleName = 'CFJ125'
FastqF1 = ('/home/jupyter-newuser/Workstation/CFJ125/CFJ125_R1.fastq') 
FastqF2 = ('/home/jupyter-newuser/Workstation/CFJ125/CFJ125_R2.fastq')
SnpEffGen = 'WBcel235.99'

#Minimal quality for bam
minMaqQforBam = 1
#Minimum quality for vcf filtering
minMaqQforVcf = 10
#Minimum basequality for variant calling
minBaseQforVcf = 10
#
minVarCall = 10
diffStep = 10
minDel = 2
maxSample= 2

#Number of threats
Ncpu=4
#Ram for java
ramG=20

### Quality assesment of Fastq files
A simple way to verify the quality of your sequencing runs is via the FastQC program. The following cells will make a directory to run the analysis and output its results.

In [None]:
##Create directory to place data
!mkdir -p FastQC
##Move to that directory
os.chdir(path + '/analysis/FastQC')

In [None]:
##Perform fastQC analysis
!fastqc -t {Ncpu} {FastqF1} {FastqF2} -o . > /dev/null 2>&1

##Notify user if this step was sucessfull
tempname1 = FastqF1.split("/")[-1]
tempname1 = ''.join(tempname1.split(".")[-len(tempname1.split("."))])
if os.path.isfile(tempname1 + '_fastqc.html'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">'))
    !unzip -o -qq \*.zip
    display(Markdown('<b>Metrics:</b>'))
    tempname1 = FastqF1.split("/")[-1]
    tempname1 = ''.join(tempname1.split(".")[-len(tempname1.split("."))])
    print(tempname1)
    !cat {tempname1}_fastqc/summary.txt
    tempname2 = FastqF2.split("/")[-1]
    tempname2 = ''.join(tempname2.split(".")[-len(tempname2.split("."))])
    print(tempname2)
    !cat {tempname2}_fastqc/summary.txt
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>FASTQC report not produced in current directory: '+os.getcwd()+'</div>'))

#### HTML reports
Run the cell below to link the FastQC results to this notebook.

In [None]:
display(Markdown('Full reports (open a new tab to see them): \n' + '* [' + tempname1 + '](./analysis/FastQC/' + tempname1 +'_fastqc.html)\n' + '* [' + tempname2 + '](./analysis/FastQC/' + tempname2 +'_fastqc.html)\n'))

### Mapping reads to reference genome using bwa

After assesed the quality of the Fastq reads, we will map them to the reference genome using bwa. For that running the cells belows will produce a directory where the analysis will be performed.

In [None]:
##Move to analysis
os.chdir(path + '/analysis/')
##Create directory to place data
!mkdir -p Alignments
##Move to that directory
os.chdir(path + '/analysis/Alignments')

In [None]:
!bwa mem -t {Ncpu} -M {ReferenceGenome} {FastqF1} {FastqF2} -o {SampleName}.sam > /dev/null 2>&1

##Notify user if previous step was sucessfull
if os.path.isfile(SampleName+'.sam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Sam file not present in current path: '+os.getcwd()+'</div>'))

Now lets transform to bam

In [None]:
!samtools view -@ {Ncpu} -S -bh {SampleName}.sam | samtools sort -@ {Ncpu} - > {SampleName}.bam

if os.path.isfile(SampleName+'.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Sam file not present in current path: '+os.getcwd()+'</div>'))

Now lets index

In [None]:
!samtools index {SampleName}.bam

if os.path.isfile(SampleName+'.bam.bai'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not present in current path: '+os.getcwd()+'</div>'))

Filter for mapping quality

In [None]:
!samtools view -@ {Ncpu} -q {minMaqQforBam} -bh {SampleName}.bam > {SampleName}.SF.bam

if os.path.isfile(SampleName+'.SF.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Filtered file not present in current path: '+os.getcwd()+'</div>'))

Then add and replace groups via picard

In [None]:
!java -Xmx{ramG}g -jar {softpath}/picard2.23.6/picard.jar AddOrReplaceReadGroups I={SampleName}.SF.bam O={SampleName}.RG.bam RGID={SampleName} RGLB=LB RGPL=illumina RGPU=PU RGSM={SampleName} > /dev/null 2>&1

if os.path.isfile(SampleName+'.RG.bam'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Read groups file not present in current path: '+os.getcwd()+'</div>'))

Then we mark duplicates

In [None]:
!java -Xmx{ramG}g -jar {softpath}/picard2.23.6/picard.jar MarkDuplicates I={SampleName}.RG.bam O={SampleName}.Dup.bam M={SampleName}.dedupMetrics REMOVE_DUPLICATES=true > /dev/null 2>&1

if os.path.isfile(SampleName+'.Dup.bam'):
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Aligment withouth duplicates file not present in current path: '+os.getcwd()+'</div>'))

Index bam

In [None]:
!samtools index {SampleName}.Dup.bam

if os.path.isfile(SampleName+'.Dup.bam.bai'):
    #display(Markdown('<div class=\"alert alert-block alert-success\">\b</div>'))
    display(Markdown('<b>\b</b>'))
else:
    display(Markdown('<div class=\"alert alert-block alert-danger\"><b>Error:</b><br>Index file not present in current path: '+os.getcwd()+'</div>'))

Feel free to download the files before realigment by following the links:

In [None]:
display(Markdown('Download bam files (right-click and \"save as\"): \n' + '* [' + SampleName + '.Dup.bam](./analysis/Alignments/' + SampleName +'.Dup.bam)\n' + '* [' + SampleName + '.Dup.bam.bai](./analysis/Alignments/' + SampleName +'.Dup.bam.bai)\n'))

### Coverage analysis with samtools
Finally let's see basic stats of the reads mapped in the last bam file

In [None]:
!samtools coverage {SampleName}.Dup.bam

**Done**