# <span style="color:green">Formation South Green 2022</span> - Structural Variants Detection by using short and long reads 

# __DAY 3 : Structural variant calling__

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)


***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>



[I - Structural variation with `Breakdancer` (SR)__ ](#break)

   * [ Reference indexation](#refindex)


[II - Structural variation with Sniffles (LR)](#sniffles) 
  

   * [1. Prepare LR data](#step1)
   * [2. SV detection for all CLONES](#step2)
     * [Count the number of variations](#countv)
     * [Convert sam into bam `samtools view`](#samtoolsview)
     * [Merge all the vcf files across all samples](#merge)
   * [3. Force call all the SVs across all the samples](#step3)

</span>

***


# <span style="color:#006E7F">__I - Structural variation with `Breakdancer` (SR)__ <a class="anchor" id="break"></span>  



To run BreakDancer, first use `bam2cfg.pl` to prepare the required config file.

### <span style="color: #4CACBC;">First of all, create a list of bam files </span>  

In [None]:
mkdir -p ~/work/BREAKDANCER/;
cd ~/work/BREAKDANCER/;
realpath /home/jovyan/work/MAPPING-ILL/dirClone*/*SORTED.bam > bam_files.txt

ls

In [None]:
cat bam_files.txt

In [None]:
bam2cfg.pl bam_files.txt 
#breakdancer_options

### <span style="color: #4CACBC;">Then run BreakDancer on the config </span>   

In [None]:
breakdancer-max config_file.cfg

# <span style="color:#006E7F">__II - Structural variation with `Sniffles` (LR)__ <a class="anchor" id="data"></span>  

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore).

It detects all types of SVs (10bp+) using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

# <span style="color: #4CACBC;"> 1. Prepare LR data<a class="anchor" id="step1"></span>  

In [None]:
# create SNIFFLES folder
mkdir -p /home/jovyan/work/SNIFFLES/
cd /home/jovyan/work/SNIFFLES/

# declare your Clone
CLONE="Clone10"

# declare reference path
REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"

In [None]:
# check reads ONT from Clones
ONT=/home/jovyan/work/SV_DATA/LONG_READS
ls $ONT


# <span style="color: #4CACBC;"> 2. SV detection for all CLONES<a class="anchor" id="step2"></span>  

In [None]:
cd /home/jovyan/work/MAPPING-ONT/BAM

In [None]:
ls

In [None]:
#wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" wget https://itrop.ird.fr/sv-training/BAM_ONT.tar.gz
#tar zxvf BAM_ONT.tar.gz
#BAM_ONT="/home/jovyan/work/BAM_ONT"
#rm BAM_ONT.tar.gz
#ls $BAM_ONT

## <span style="color: #4CACBC;"> Run `sniffles` </span>  

In [None]:
cd /home/jovyan/work/MAPPING-ONT/BAM/

for i in {2,4,6,8,10,12,14,16,18,20}
    do
        echo "============ Clone${i}==============";
        samtools index Clone${i}_ONT_SORTED.bam
        sniffles -i Clone${i}_ONT_SORTED.bam -v /home/jovyan/work/SNIFFLES/Clone${i}_SV.vcf --allow-overwrite
    done
time


## <span style="color: #4CACBC;"> Count the number of variations<a class="anchor" id="countv"></span>  

In [None]:
cd  /home/jovyan/work/SNIFFLES/

for i in {2,4,6,8,10,12,14,16,18,20}
    do
        echo "Clone${i}";
        grep -v "#" Clone${i}_SV.vcf | wc -l
    done

## <span style="color: #4CACBC;"> Merge all the vcf files across all samples<a class="anchor" id="merge"></span>  
 


Check the sniffles website https://github.com/fritzsedlazeck/Sniffles/ an its wiki for more details

In [None]:
cd  /home/jovyan/work/SNIFFLES/

# Put all file paths over all vcf files together
ls *SV.vcf > vcf_raw_calls.txt

# We call SURVIVOR to merge these into one vcf file
conda activate survivor
SURVIVOR merge vcf_raw_calls.txt 1000 1 1 -1 -1 -1 merged_SURVIVOR_1kDist.vcf
conda deactivate

This will generate one VCF file for all the samples, but we lack the information 

If a SV identified in one sample but not in the other is really absent.

# <span style="color: #4CACBC;"> 3. Force call all the SVs across all the samples<a class="anchor" id="step3"></span>  

Next we run Sniffles again across all the samples similar to this :

In [None]:
cd /home/jovyan/work/MAPPING-ONT/BAM/

for i in {2,4,6,8,10,12,14,16,18,20}
    do
        echo -e "\n========== REMapping Clone$i======== \n";
        sniffles Clone${i}_SORTED.bam -v /home/jovyan/work/SNIFFLES/Clone${i}_SV.gt.vcf --Ivcf /home/jovyan/work/SNIFFLES/merged_SURVIVOR_1kDist.vcf
    done


The previous command might have triggered a segmentation fault
You can download the results with the following command:

In [None]:
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/SNIFFLES.tar.gz
tar -xzf SNIFFLES.tar.gz

In [None]:
#Put all file paths over all new vcf files together
cd  /home/jovyan/work/SNIFFLES/
ls *SV.gt.vcf > vcf_gt_calls.txt

# relauch survivor to merge vcf again and finally obtain now a fully genotyped multisample vcf
conda activate survivor
SURVIVOR merge vcf_gt_calls.txt 1000 1 1 -1 -1 -1 merged_gt_SURVIVOR_1kDist.vcf
conda deactivate

# The -1 for the minimum SV caller is necessary to obtain all calls even if they might be 0/0 in all samples.

### <span style="color: #4CACBC;">  Have a look on the VCF file</span>  

In [None]:
head -n 100 merged_gt_SURVIVOR_1kDist.vcf | tail -n 5

Now, you can plotting proportion of SV in all sample for VCF for example