# Formation South Green 2021  

##  Initiation à l’analyse de données Minion

### PART 4

Created by J.Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-CIRAD)

Adapted from C. Tranchant et F. Sabot (Training trasmiting science 2021)

Septembre 2021

# 1. Structural variation with Sniffles

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore).

It detects all types of SVs (10bp+) using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

## Prepare data

In [None]:
# download  all clones fastq.gz
cd ~/SG-ONT-2021/DATA
# download your compressed CloneX 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/all_clones_short.tar.gz

In [None]:
#decompress it
cd ~/SG-ONT-2021/DATA
tar zxvf all_clones_short.tar.gz

In [None]:
# create SNIFFLES folder
mkdir -p ~/SG-ONT-2021/RESULTS/SNIFFLES/
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/

# declare your Clone
CLONE="Clone20"

# symbolic links of reference 
ln -s /home/jovyan/SG-ONT-2021/DATA/${CLONE}/reference.fasta .
REF="reference.fasta"

In [None]:
ls ~/SG-ONT-2021/DATA/all_clones_short/

# 1. Mapping and SV detection for all CLONES

In [None]:
# fonction bash 
run_sniffles () {
  CLONE=$1 # this is the first parametter of this fonction
  REF="reference.fasta"
  ONT="/home/jovyan/SG-ONT-2021/DATA/all_clones_short/${CLONE}.fastq.gz"
  ## Mapping using minimap2 : Mapping ONT reads (clone) vs a reference using minimap2 
  time minimap2 -t 4 -ax map-ont --MD  -R '@RG\tID:${CLONE}\tSM:${CLONE}' ${REF} ${ONT} > ${CLONE}.bam
  ## Sort BAM
  time samtools sort -@4 -o ${CLONE}_SORTED.bam ${CLONE}.bam
  # Obtain calls for a samples
  time sniffles -t 4 -s 2 -q 10 -l 10 -r 500 -m ${CLONE}_SORTED.bam -v ${CLONE}_SV.vcf
}

# -s/--min_support	Minimum number of reads that support a SV to be reported. Default: 10
# -l/--min_length	Minimum length of SV to be reported. Default: 30bp
# -q/--minmapping_qual	Minimum mapping quality of alignment to be taken into account. Default: 20
# -r/--min_seq_size	Discard read if non of its segment is larger then this. Default: 2kb

### Obtain calls for a samples

In [None]:
for i in {2,6,10,15,18}
    do
        cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
        echo "\n\n============ Clone${i}==============\n";
        run_sniffles Clone$i
    done
time

### Count the number of variations

In [None]:
for i in {2,6,10,15,18}
    do
        cd  ~/SG-ONT-2021/RESULTS/SNIFFLES
        echo "Clone${i}";
        grep -v "#" Clone${i}_SV.vcf | wc -l
    done

# 2. Merge all the vcf files across all samples 


Check the sniffles website https://github.com/fritzsedlazeck/Sniffles/ an its wiki for more details

In [None]:
# Put all file paths over all vcf files together
ls *SV.vcf > vcf_raw_calls.txt

# We call SURVIVOR to merge these into one vcf file
conda activate survivor
cd ~/SG-ONT-2021/RESULTS/SNIFFLES
time SURVIVOR merge vcf_raw_calls.txt 1000 1 1 -1 -1 -1 merged_SURVIVOR_1kDist.vcf
conda deactivate

This will generate one VCF file for all the samples, but we lack the information 

If a SV identified in one sample but not in the other is really absent.

# 3. Force call all the SVs across all the samples

Next we ran Sniffles again across all the samples similar to this :

In [None]:
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
for i in {2,6,10,15,18}
    do
        echo -e "\n========== REMapping Clone$i======== \n";
        sniffles -t 8 -s 2 -q 10 -l 10 -r 500 -m Clone${i}_SORTED.bam -v Clone${i}_SV.gt.vcf --Ivcf merged_SURVIVOR_1kDist.vcf
    done


In [None]:
#Put all file paths over all new vcf files together
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
ls *SV.gt.vcf > vcf_gt_calls.txt

# relauch survivor to merge vcf again and finally obtain now a fully genotyped multisample vcf
conda activate survivor
SURVIVOR merge vcf_gt_calls.txt 1000 1 1 -1 -1 -1 merged_gt_SURVIVOR_1kDist.vcf
conda deactivate

# The -1 for the minimum SV caller is necessary to obtain all calls even if they might be 0/0 in all samples.

# Have a look on the VCF file

In [None]:
head -n 100 merged_gt_SURVIVOR_1kDist.vcf | tail -n 5

Now, you can plotting proportion of SV in all sample for VCF for example