# <span style="color:#006E7F">  <center> __DAY 2 : How to analyze vcf results ?__ </center> </span>

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>

[I - Get some basic stats from vcf files](#statvcf)

[II - FILTERING VCF](#vcffiltering)

</span>

***



## <span style="color:#006E7F">__I. Get some basic stats from vcf files__ <a class="anchor" id="statvcf"></a></span> 

In this exercise, we are going to work with the whole vcf generated from the 20 clones.

#### Firstly, download the directory that contains the vcf into your home directory  `work` and decompress it
link : https://itrop.ird.fr/sv-training/VCF_CLONES.tar.gz

In [None]:
%%bash
cd /home/jovyan/work/
...

### <span style="color: #4CACBC;">First go into the directory that containsthe  vcf file to analyze  </span> 

* then list the content of this directory


### <span style="color: #4CACBC;">Count the number of variants with `bcftools stat`<a class="anchor" id="bcftools"></a></span> 

* run the bcftools stats on the vcf file and save the result into the file `output.vcf.stat`
* check that the file have been correctly created and display the 35 first lines of this file
* How many samples were used for this SNP analysis ?
* How many SNPs were detected ?

In [None]:
%%bash
bcftools ...

#### Display the first lines of the stat file

In [None]:
%%bash
head -n 35 output.vcf.stat

### <span style="color: #4CACBC;">Generating statistics from a VCF to determine how to set filters on it<a class="anchor" id="vcffilters"></a></span> `vcftools`

We will generate more statistics from a VCF using vcftools (LINK MANUAL), a very useful and fast program for handling vcf files 
to easily calculate these statistics in order to better define filters we will apply and to get an idea of how to set such filtering thresholds. 

The main information we will consider are:
* Depth: Usually, we filter SNP with a minimum and maximum depth. We use a minimum depth cutoffs to remove false positive calls and to keep higher quality calls too. 
A maximum cut off allow to remove regions with very, very high read depths such as repetitive regions.
* Quality Genotype quality : With this filter, we should not trust any genotype with a Phred score below 20 which suggests a less than 99% accuracy.
* Minor allele frequency MAF can cause big problems with SNP calls - and also inflate statistical estimates downstream. Ideally you want an idea of the distribution of your 
allelic frequencies but 0.05 to 0.10 is a reasonable cut-off. You should keep in mind however that some analyses, particularly demographic inference can be biased by MAF thresholds.
* Missing data How much missing data are you willing to tolerate? It will depend on the study but typically any site with >25% missing data should be dropped.
* biallelic, heterozygosity...

In this training, we will just display quality and depth distribution... but you should do on each value filterd.

#### <span style="color: #4CACBC;">Mean depth per individual and per site<a class="anchor" id="depthvcf"></a></span> `--depth, --site-mean-depth`

* run vcftool with the correct options
* check that the files have been created and display the first lines

In [None]:
%%bash
vcftools --vcf output.vcf --depth --out depthi
vcftools --vcf output.vcf --site-mean-depth --out depths

#### Display the first lines of the files just created by vcftools

In [None]:
%%bash


#### <span style="color: #4CACBC;">Extracting quality per site<a class="anchor" id="depthvcf"></a></span>  `--site-quality`

* run vcftool with the correct options
* check that the files have been created and display the first lines

In [None]:
%%bash
vcftools ...

#### Display the first lines of the file just created by vcftools

In [None]:
%%bash


#### Display the first lines of the two files just created by vcftools

In [None]:
%%bash


### <span style="color: #4CACBC;">Generating density plot QUAL & DEPTH<a class="anchor" id="vcfplot"></a></span> 

#### <span style="color: #4CACBC;">Plotting quality per site<a class="anchor" id="qualplot"></a></span> 

In [None]:
import os
import pandas as pd

qual_file="qual.lqual"

# import the file with pandas 
PUT CODE LINE HERE

# print the dataframe 
PUT CODE LINE HERE

# print some stats about the QUAL column
PUT CODE LINE HERE

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

sns.kdeplot(x="QUAL", data=df_qual)

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns


plt.figure(figsize = (15,8))
ax=sns.kdeplot(x="QUAL", data=df_qual[df_qual.QUAL<10000])
ax.set_title("PUT A TITLE")
ax.set_xlabel("PUT A X-AXIS LABEL")
ax.set_ylabel("PUT A Y-AXIS LABEL")

#### <span style="color: #4CACBC;">Plotting Mean depth per site<a class="anchor" id="depthplot"></a></span> 

In [None]:
depth_file="depths.ldepth.mean"
# import the file with pandas 
PUT CODE LINE HERE

# print the dataframe 
PUT CODE LINE HERE

# print some stats about all the columns
PUT CODE LINE HERE

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Density plot of the column MEAN_DEPTH


In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Density plot of the column MEAN_DEPTH WITH VALUE < 50

### <span style="color: #4CACBC;">III - FILTERING VCF <a class="anchor" id="vcffiltering"></a></span> 


#### Which filters ?

Here, we will apply on the vcf the following filters :

* QUAL > 300
* DP > 30 and DP < 400
* Less than 3 SNPs into a window of 10pb

The threshodl of each filter depends on the SNP analysis (sample number, sequencing depth).

In a second step, according the following analysys (eg: population genomics), usually, we will apply other filters such as :
* removing missing data
* keeping only biallellic
* heterozygosity
    
#### Select only the SNPs (and remove the INDELs) - `gatk variantFiltration`
* Filter vcf to keep only SNPs
* Check that the new vcf has been created
* Get the number of polymorphisms in the new vcf file 

In [None]:
%%bash
gatk SelectVariants --java-options "-Xmx8G -Xms8G" -R /home/jovyan/work/SV_DATA/REF/reference.fasta -V output.vcf -select-type SNP -O output.onlySNP.vcf

#### Get the SNP count of this new vcf file

In [None]:
%%bash
ls -lrth
bcftools ...

#### Compress the vcf and generate the index of the compressed vcf - `tabix -p vcf vcf_file`

In [None]:
%%bash
bgzip output.onlySNP.vcf
ls -lrth

In [None]:
%%bash
tabix -f -p vcf output.onlySNP.vcf.gz
ls -lrth

#### Applying filters on QUAL, DEPTH and CLUSTER SNPsSelect only the SNPs (and remove the INDELs) - `gatk variantFiltration`

Be careful, it miss one filter in the following command

In [None]:
%%bash
gatk VariantFiltration --java-options "-Xmx12G -Xms12G" -R /home/jovyan/work/SV_DATA/REF/reference.fasta  -V output.onlySNP.vcf.gz --filter-expression "QUAL<200" --filter-name "LOW_QUAL" --filter-expression "DP<30" --filter-name "LOW_DP"     --cluster-size 3 --cluster-window-size 10  -O output.filteredSNP.vcf

#### List the content of the directory to check tht the new vcf has been correctly created

In [26]:
...

total 1.4G
-rw-r----- 1 jovyan users 149K Jun 17 09:46 output.vcf.idx
-rw-r----- 1 jovyan users 756M Jun 17 09:49 rawSNP.vcf
-rw-r----- 1 jovyan users   19 Jun 17 09:49 genome.txt
-rw-r----- 1 jovyan users 234K Jun 17 09:49 rawSNP.vcf.idx
-rw-r----- 1 jovyan users 475M Jun 17 09:51 output.vcf
-rw-r--r-- 1 jovyan users  40K Jun 20 20:35 output.vcf.stat
-rw-r--r-- 1 jovyan users  473 Jun 20 20:39 depthi.idepth
-rw-r--r-- 1 jovyan users  17M Jun 20 20:39 depths.ldepth.mean
-rw-r--r-- 1 jovyan users  14M Jun 20 20:39 qual.lqual
-rw-r--r-- 1 jovyan users  17M Jun 20 20:40 AF.frq
-rw-r--r-- 1 jovyan users  15M Jun 20 20:40 AF2.frq
-rw-r--r-- 1 jovyan users 147K Jun 20 20:46 output.onlySNP.vcf.idx
-rw-r--r-- 1 jovyan users  50M Jun 20 20:46 output.onlySNP.vcf.gz
-rw-r--r-- 1 jovyan users  830 Jun 20 21:23 output.onlySNP.vcf.gz.tbi
-rw-r--r-- 1 jovyan users  57M Jun 20 21:27 output.filteredSNP.vcf


#### Compress and index the new vcf file

In [None]:
%%bash


#### How many SNPs are kept after applying these filters ?

In [None]:
%%bash


#### _Commands to apply other filters_

##### remove missing data if more than 5% of missing data at a site (12 samples over 228 samples) `vcftools`

`vcftools --vcf input.vcf --max-missing-count 12 --remove-filtered-all --recode --recode-INFO-all  --out ouput.onlySNP.12na`

##### keep SNPs with only 90% homzygous (222 samples)

`cat input.vcf | java -jar SnpSift.jar filter " (countHom( )> 222) " > output.onlySNP.12na.recode.222homoz.vcf`

##### keep only biallelic sites
`vcftools --vcf input.vcf --min-alleles 2 --max-alleles 2 --remove-filtered-all --recode --recode-INFO-all --out output.onlySNP.filteredPASS.na.225.Homoz.biallelic`

## <span style="color:#006E7F">V - Compute the SNP density along the chromosomes<a class="anchor" id="density"></span>  

In [None]:
echo -e "Reference\t1000000\n" > genome.txt
bedtools genomecov -bga -split -i output.vcf -g genome.txt > density.csv

In [None]:
head density.csv