#### <span style="color:#grey"> __Formation South Green 2022 - Structural Variants Detection by using short and long reads__ </span>

# <span style="color:#006E7F">  <center> __DAY 2 : How to analyze mapping results ?__ </center> </span>

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>

[I - Get some basic mapping stats with samtools flagstat](#mappingstats)

* [Run samtools flagstat](#flagstat)
* [Samtools flagstat output](#flagstatoutput)
* [Merge individual flagstat files into an unique file with python code](#multiflagstat) 
* [Plot mapping ratio per sample](#ratioplot)
* [EXERCICE : DO THE SAME MANIP WITH MINIMAP2 RESULTS](#minimap)

[II - Get some basic stats from vcf files](#statvcf) 
* [Count the number of variants with `bcftools stat`](#bcftools)
* [Generating statistics from a VCF to determining how to set filters on it](#vcffilters)
* [Generating density plot QUAL & DEPTH](#vcfplot) 

[III - FILTERING VCF](#vcffiltering)

</span>

***



## <span style="color:#006E7F">__I - Get some basic mapping stats with samtools flagstat__ <a class="anchor" id="mappingstats"></span>  

### <span style="color: #4CACBC;"> First go into the directory that contains all the bam files</span>  


In [None]:
%cd /home/jovyan/work/MAPPING-ILL/BAM
%ls

### <span style="color: #4CACBC;">Run samtools flagstat on each bam file (generated by bwa-mem2) - `for loop`<a class="anchor" id="flagstat"></span> 
Save the flagstat output into a file - ex : Clone2.bam -> Clone2.bam.flagstat

In [None]:
%%bash

for file in *.bam;
do
    echo $file
    samtools flagstat $file > $file.flagstat
done

In [None]:
ls *flagstat 

### <span style="color: #4CACBC;">Let's look the content of one file <a class="anchor" id="flagstatoutput"></span> 

In [None]:
cat Clone10*stat

### <span style="color: #4CACBC;">Merge individual flagstat files into an unique file with python code <a class="anchor" id="multiflagstat"></span> 

In [None]:
# IMPORT PYTHON PACKAGE USED BY THE CODE
import os
import pandas as pd

# VARIABLE INITIALIZATION

## NAME OF THE DIRECTORY THAT CONTAINS FLAGSTAT FILES
flagstat_dir = "/home/jovyan/work/MAPPING-ILL/BAM" #PUT THE DIRECTORY NAME THAT CONTAINS FLAGSTAT FILES 

## NAME OF THE FILE THAT WILL CONTAIN ALL THE FLAGSTAT RESULTATS
stat_file = f"{flagstat_dir}/all_stat.csv"

# PRINT THE CONTENT OF 2 PREVIOUS VARIABLES INITIALIZED
print("DIRECTORY : ",flagstat_dir)
print("FINAL STAT FILE : ",stat_file)


In [None]:
%pwd

In [None]:
# OPEN THE FINAL FILE IN WHICH WE PRINT SOME STATS EXTRACTED FROM EACH INDIVIDUAL FILE GENERATED BY SAMTOOLS FLAGSTAT
with open(stat_file, 'w') as stat: 
    # WRITE A HEADER LINE IN OUR STAT FILE
    header_line = "sample,mapped,paired,unmapped"
    stat.write(header_line)
    
    # READING EACH FILE OF THE FLGSTAT DIRECTORY
    for file in os.listdir(flagstat_dir):
        # If flagstat is in name of file
        if "flagstat" in file:
            # Extract sample name and save into a new variable newLine 
            new_line = f"\n{file.split('.')[0]},"
            # OPEN AND READS FLAGSTAT FILE
            with open(file, "r") as flagstat:
                # read file line by line
                for line in flagstat:
                    # remove the line skipper at the endo of the line
                    line = line.rstrip()              
                    # Keep only line mapped, paired or singleton word
                    if 'mapped (' in line or 'paired (' in line or 'singleton' in line:
                        # get percentage value and save it into the varaible called perc
                        perc = f"{line.split('(')[1].split('%')[0]}"
                        new_line += f"{perc},"
                # WRITE THE LINE ONCE THE FLAGSTAT FILE COMPLETELY READ
                stat.write(new_line.strip(","))

### <span style="color: #4CACBC;">Display the content of the final stat file  <a class="anchor" id="statfile"></span> 

In [None]:
%cat $stat_file

### <span style="color: #4CACBC;">Plot mapping ratio per sample <a class="anchor" id="ratioplot"></a></span> 

#### Load csv file into a panda datafrale


In [None]:
df_bam_stat = pd.read_csv(stat_file, index_col=False, sep=",")
df_bam_stat

#### Basic stats

In [None]:
# Je n'affiche que les valeurs de la colonne "mapped"
print(df_bam_stat['mapped'])

In [None]:
# J affiche la moyenne, min et max de cette colonne
minimum = df_bam_stat["mapped"].min()
maximun = df_bam_stat["mapped"].max()
mean_flag = df_bam_stat["mapped"].mean()

print("\n######## BASIC STATS\n MAPPED")       
print(f"\t%min : {minimum}\t %max : {maximun}\t %mean : {mean_flag}")


In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x="sample",y="paired", data=df_bam_stat)

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

ax=sns.scatterplot(x="sample",y="value", hue='variable', data=pd.melt(df_bam_stat, 'sample'))
ax.set_title("scatterplot from mapping using clones ")
ax.set_xlabel("Clones")
ax.set_ylabel("Mapping percentage")

### <span style="color: #4CACBC;"> EXERCICE : DO THE SAME THING WITH MINIMAP2 RESULTS <a class="anchor" id="minimap"></span> 

## <span style="color:#006E7F">__II. Get some basic stats from vcf files__ <a class="anchor" id="statvcf"></a></span> 

### <span style="color: #4CACBC;">First go into the directory that contains vcf file  </span> 

In this exercise, we are going to work with a REAL vcf from IRIGIN project in rice. 

In [None]:
%cd /home/jovyan/work/VCF_REAL/
%ls -lrt

### <span style="color: #4CACBC;">Count the number of variants with `bcftools stat`<a class="anchor" id="bcftools"></a></span> 

In [None]:
%%bash
bcftools stats final.genotype.vcf >  final.genotype.stats

In [None]:
%%bash
head -n 35 final.genotype.stats

### <span style="color: #4CACBC;">Generating statistics from a VCF to determine how to set filters on it<a class="anchor" id="vcffilters"></a></span> 

We will generate more statistics from a VCF using vcftools (LINK MANUAL), a very useful and fast program for handling vcf files 
to easily calculate these statistics in order to better define filters we will apply and to get an idea of how to set such filtering thresholds. 

The main information we will consider are:
* Depth: Usually, we filter SNP with a minimum and maximum depth. We use a minimum depth cutoffs to remove false positive calls and to keep higher quality calls too. 
A maximum cut off allow to remove regions with very, very high read depths such as repetitive regions.
* Quality Genotype quality : With this filter, we should not trust any genotype with a Phred score below 20 which suggests a less than 99% accuracy.
* Minor allele frequency MAF can cause big problems with SNP calls - and also inflate statistical estimates downstream. Ideally you want an idea of the distribution of your 
allelic frequencies but 0.05 to 0.10 is a reasonable cut-off. You should keep in mind however that some analyses, particularly demographic inference can be biased by MAF thresholds.
* Missing data How much missing data are you willing to tolerate? It will depend on the study but typically any site with >25% missing data should be dropped.
* biallelic, heterozygosity...

In this training, we will just display quality and depth distribution... but you should do on each value filterd.

#### <span style="color: #4CACBC;">Mean depth per infividual and per site<a class="anchor" id="depthvcf"></a></span> 



In [None]:
%%bash
vcftools --gzvcf  final.genotype.vcf --depth --out depthi
vcftools --gzvcf  final.genotype.vcf --site-mean-depth --out depths

In [None]:
%%bash
ls -lrt
head *depth*

#### <span style="color: #4CACBC;">Extracting quality per site<a class="anchor" id="depthvcf"></a></span> 


In [None]:
%%bash
vcftools --gzvcf final.genotype.vcf  --site-quality --out qual

In [None]:
%%bash
ls -lrt
head *qual


__Calculate allele frequency__

* --freq2 : outputs the frequencies without information about the alleles
* --freq would return their identity. 
* --max-alleles 2 to exclude sites that have more than two alleles.

In [None]:
%%bash
vcftools --gzvcf final.genotype.vcf --freq --out AF --max-alleles 2
vcftools --gzvcf final.genotype.vcf --freq2 --out AF2 --max-alleles 2
ls -lrt
head *.frq

### <span style="color: #4CACBC;">Generating density plot QUAL & DEPTH<a class="anchor" id="vcfplot"></a></span> 

#### <span style="color: #4CACBC;">Plotting quality per site<a class="anchor" id="qualplot"></a></span> 

In [None]:
qual_file="qual.lqual"
df_qual = pd.read_csv(qual_file, index_col=False, sep="\t")
print(df_qual)
df_qual.describe()

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

sns.kdeplot(x="QUAL", data=df_qual)

#### <span style="color: #4CACBC;">Plotting Mean depth per site<a class="anchor" id="depthplot"></a></span> 

In [None]:
depth_file="depths.ldepth.mean"
df_depth = pd.read_csv(depth_file, index_col=False, sep="\t")
print(df_depth)
df_depth.describe()

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

sns.kdeplot(x="MEAN_DEPTH", data=df_depth)

### <span style="color: #4CACBC;">III - FILTERING VCF <a class="anchor" id="vcffiltering"></a></span> 


In [None]:
#### _GATK VARIANT FILTRATION_

* DP > 10
* QUAL >200
* Less than 3 SNPs into a window of 10pb
* DP < 20000

* GATK VARIANT FILTRATION
* GATK SELECT VARIANTs
* VCFTOOLS NA
* SNP SIFT HOMOZ
* VCFTOOLS BIALLELLIC

In [None]:
######## GATK VARIANT FILTRATION

## Cmd : module load bioinfo/gatk/4.1.4.1; 
##       gatk VariantFiltration --java-options "-Xmx45G -Xms45G" -R OglaRS2.ADWL02-allCtgsIRIGIN_TOG5681.dedup8095-NR.fasta     -V ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.vcf.gz --filter-expression "QUAL<200" --filter-name "LOW_QUAL"    --filter-expression "DP<10" --filter-name "LOW_DP"     --cluster-size 3 --cluster-window-size 10     --filter-expression "DP>20000" --filter-name "HIGH-DP"   -O ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.gatkVF.filteredIndelSNP.vcf


In [None]:
#### _GATK SELECT VARIANT_

Select only SNPs

In [None]:
######## GATK SELECT VARIANTS

## Cmd : module load bioinfo/gatk/4.1.4.1; 
    gatk SelectVariants --java-options "-Xmx45G -Xms45G" -R OglaRS2.ADWL02-allCtgsIRIGIN_TOG5681.dedup8095-NR.fasta     -V ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.gatkVF.filteredIndelSNP.vcf -select-type SNP   -O ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.gatkVF.filteredIndelSNP.onlySNP.vcf


In [None]:
#### _vcftools_

* na 12 samples over 228 samples (5%)

In [None]:
######## VCFTOOLS NA

## Cmd : module load bioinfo/vcftools/0.1.16; vcftools --vcf ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.gatkVF.filteredIndelSNP.onlySNP.vcf --max-missing-count 12 --remove-filtered-all --recode --recode-INFO-all  --out ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.gatkVF.filteredIndelSNP.onlySNP.12na


```

In [None]:
#### _SNPsift_

* homoz = 222 samples (90%)

In [None]:
######## SNPSIFT HOMOZYGOUS FILTERS

## Cmd : cat ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.gatkVF.filteredIndelSNP.onlySNP.12na.recode.vcf | java -jar /usr/local/snpEff-4.3/SnpSift.jar filter " (countHom( )> 222) " > ALLVCFs/tmp/ALL.Chr06.F4.GenotypeGVCFS.MQ0.gatkVF.filteredIndelSNP.onlySNP.12na.recode.222homoz.vcf

In [None]:
#### _vcftools - include only bi-allelic sites_

In [None]:
#vcftools --vcf /scratch/tranchant/ALL.Chr2.selectVariant.FILTERED.recode.NA.recode.225.Homoz.vcf --min-alleles 2 --max-alleles 2 --out //scratch/tranchant/ALL.Chr2.selectVariant.FILTERED.recode.NA.recode.225.Homoz.minmaxAllele2 --recode --recode-INFO-all
vcftools --vcf /scratch/tranchant/ALLVCFs/tmp/ALL.Chr02.F4.GenotypeGVCFS.MQ0.gatkSV.onlySNP.filteredPASS.na.225.Homoz.vcf --min-alleles 2 --max-alleles 2 --remove-filtered-all --recode --recode-INFO-all --out /scratch/tranchant/ALLVCFs/tmp/ALL.Chr02.F4.GenotypeGVCFS.MQ0.gatkSV.onlySNP.filteredPASS.na.225.Homoz.biallelic

In [None]:
af_file="AF2.frq"
df_csv_stat = pd.read_csv(af_file, index_col=False, header=0, names=["CHROM","POS","N_ALLELES","N_CHR","ALL1","ALL2"], sep="\t")
df_csv_stat

#### Basic stats

In [None]:
# Je n'affiche que les valeurs de la colonne "mapped"
print(df_csv_stat[["ALL1", "ALL2"]].describe())

#### Calculate MAF


In [None]:
df_csv_stat["MAF"]=df_csv_stat[['ALL1', 'ALL2']].min(axis=1)


In [None]:
print(df_csv_stat[df_csv_stat.ALL1>0.70])

In [None]:
# Plot with seaborn
import matplotlib.pyplot as plt
import seaborn as sns

sns.kdeplot(x="MAF", data=df_csv_stat)

In [None]:
cd $vcf_dir

In [None]:
%%bash
wget --no-check-certificat https://itrop.ird.fr/sv-training/out.vcf.gz

In [None]:
ls

In [None]:
%%bash
zgrep -vc "^#" out.vcf.gz | head

In [None]:
%%bash
REF="/home/jovyan/work/DATA/Clone10/referenceCorrect.fasta"
cd /home/jovyan/work/MAPPING-ILL/VCF/
#tail -n 50 /home/jovyan/work/MAPPING-ILL/VCF/Clone1.g.vcf
gatk CombineGVCFs -R $REF --variant Clone1.g.vcf --variant Clone2.g.vcf  -O final.vcf

In [None]:
%%bash

REF="/home/jovyan/work/DATA/Clone10/referenceCorrect.fasta"
cd /home/jovyan/work/MAPPING-ILL/VCF/
ls -lrt
head -n 1000 final.vcf | tail

gatk --java-options "-Xmx4g" GenotypeGVCFs -R $REF -V final.vcf -O final.genotype.vcf

In [None]:
%%bash

REF="/home/jovyan/work/DATA/Clone10/referenceCorrect.fasta"
cd /home/jovyan/work/MAPPING-ILL/VCF/
ls -lrt
head -n 1000 final.genotype.vcf | tail


In [None]:
%%bash

REF="/home/jovyan/work/DATA/Clone10/referenceCorrect.fasta"
VCF=home/jovyan/work/MAPPING-ILL/VCF/final.genotype.vcf
cd /home/jovyan/work/MAPPING-ILL/VCF/f
grep  -vc "^#" /

bcftools stats /home/jovyan/work/MAPPING-ILL/VCF/final.genotype.vcf | head -n 50

In [None]:
eles, --freq would return their identity. We need to add max-alleles 2 to exclude sites that have more than two alleles.

vcftools --gzvcf $SUBSET_VCF --freq2 --out $OUT --max-alleles 2

In [None]:
grep -vc "^#" out.vcf.gz