# Formation South Green 2021  

##  Initiation à l’analyse de données Minion

### PART 4

Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE)

Adapted from C. Tranchant et F. Sabot (Training transmiting science 2021)

Septembre 2021

# 1. Structural variation with Sniffles

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore).

It detects all types of SVs (10bp+) using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

## Prepare data

In [1]:
# download  all clones fastq.gz
cd ~/SG-ONT-2021/DATA
# download your compressed CloneX 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/all_clones_short.tar.gz

--2021-09-27 16:10:54--  https://itrop.ird.fr/ont-training/all_clones_short.tar.gz
Resolving itrop.ird.fr (itrop.ird.fr)... 91.203.35.184
Connecting to itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 842363643 (803M) [application/x-gzip]
Saving to: ‘all_clones_short.tar.gz’


2021-09-27 16:11:11 (46.9 MB/s) - ‘all_clones_short.tar.gz’ saved [842363643/842363643]

FINISHED --2021-09-27 16:11:11--
Total wall clock time: 17s
Downloaded: 1 files, 803M in 17s (46.9 MB/s)


In [2]:
#decompress it
cd ~/SG-ONT-2021/DATA
tar zxvf all_clones_short.tar.gz

all_clones_short/
all_clones_short/Clone2.fastq.gz
all_clones_short/Clone6.fastq.gz
all_clones_short/Clone10.fastq.gz
all_clones_short/Clone15.fastq.gz
all_clones_short/Clone18.fastq.gz


In [21]:
# create SNIFFLES folder
mkdir -p ~/SG-ONT-2021/RESULTS/SNIFFLES/
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/

# declare your Clone
CLONE="Clone10"

# symbolic links of reference 
ln -s /home/jovyan/SG-ONT-2021/DATA/${CLONE}/reference.fasta .
REF="reference.fasta"

In [22]:
ls ~/SG-ONT-2021/DATA/all_clones_short/

Clone10.fastq.gz  Clone18.fastq.gz  Clone6.fastq.gz
Clone15.fastq.gz  Clone2.fastq.gz


# 1. Mapping and SV detection for all CLONES

In [23]:
# fonction bash 
run_sniffles () {
  CLONE=$1 # this is the first parametter of this fonction
  REF="reference.fasta"
  ONT="/home/jovyan/SG-ONT-2021/DATA/all_clones_short/${CLONE}.fastq.gz"
  ## Mapping using minimap2 : Mapping ONT reads (clone) vs a reference using minimap2 
  time minimap2 -t 4 -ax map-ont --MD  -R '@RG\tID:${CLONE}\tSM:${CLONE}' ${REF} ${ONT} > ${CLONE}.bam
  ## Sort BAM
  time samtools sort -@4 -o ${CLONE}_SORTED.bam ${CLONE}.bam
  # Obtain calls for a samples
  time sniffles -t 4 -s 2 -q 10 -l 10 -r 500 -m ${CLONE}_SORTED.bam -v ${CLONE}_SV.vcf
}

# -s/--min_support	Minimum number of reads that support a SV to be reported. Default: 10
# -l/--min_length	Minimum length of SV to be reported. Default: 30bp
# -q/--minmapping_qual	Minimum mapping quality of alignment to be taken into account. Default: 20
# -r/--min_seq_size	Discard read if non of its segment is larger then this. Default: 2kb

### Obtain calls for each samples

In [24]:
for i in {2,6,10,15,18}
    do
        cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
        echo "\n\n============ Clone${i}==============\n";
        run_sniffles Clone$i
    done
time

[M::mm_idx_gen::0.106*0.79] collected minimizers
[M::mm_idx_gen::0.138*1.47] sorted minimizers
[M::main::0.138*1.46] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.147*1.44] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.153*1.42] distinct minimizers: 165344 (91.75% are singletons); average occurrences: 1.156; average spacing: 5.336
[M::worker_pipeline::17.883*3.11] mapped 10241 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 4 -ax map-ont --MD -R @RG\tID:${CLONE}\tSM:${CLONE} reference.fasta /home/jovyan/SG-ONT-2021/DATA/all_clones_short/Clone2.fastq.gz
[M::main] Real time: 17.895 sec; CPU: 55.705 sec; Peak RSS: 0.418 GB

real	0m17.924s
user	0m54.977s
sys	0m0.752s
[bam_sort_core] merging from 0 files and 4 in-memory blocks...

real	0m2.975s
user	0m9.459s
sys	0m0.839s
Estimating parameter...
	Max dist between aln events: 5
	Max diff in window: 50
	Min score ratio: 2
	Avg DEL ratio: 0.0299483
	Avg I

### Count the number of variations

In [25]:
for i in {2,6,10,15,18}
    do
        cd  ~/SG-ONT-2021/RESULTS/SNIFFLES
        echo "Clone${i}";
        grep -v "#" Clone${i}_SV.vcf | wc -l
    done

Clone2
39
Clone6
43
Clone10
65
Clone15
301
Clone18
225


# 2. Merge all the vcf files across all samples 


Check the sniffles website https://github.com/fritzsedlazeck/Sniffles/ an its wiki for more details

In [26]:
# Put all file paths over all vcf files together
ls *SV.vcf > vcf_raw_calls.txt

# We call SURVIVOR to merge these into one vcf file
conda activate survivor
cd ~/SG-ONT-2021/RESULTS/SNIFFLES
time SURVIVOR merge vcf_raw_calls.txt 1000 1 1 -1 -1 -1 merged_SURVIVOR_1kDist.vcf
conda deactivate

(survivor) (survivor) merging entries: 65
merging entries: 301
merging entries: 225
merging entries: 39
merging entries: 43

real	0m0.117s
user	0m0.116s
sys	0m0.001s
(survivor) (base) 

: 1

This will generate one VCF file for all the samples, but we lack the information 

If a SV identified in one sample but not in the other is really absent.

# 3. Force call all the SVs across all the samples

Next we run Sniffles again across all the samples similar to this :

In [27]:
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
for i in {2,6,10,15,18}
    do
        echo -e "\n========== REMapping Clone$i======== \n";
        sniffles -t 8 -s 2 -q 10 -l 10 -r 500 -m Clone${i}_SORTED.bam -v Clone${i}_SV.gt.vcf --Ivcf merged_SURVIVOR_1kDist.vcf
    done


(base) 

Automatically enabling genotype mode
Force calling SVs
Estimating parameter...
	Max dist between aln events: 5
	Max diff in window: 50
	Min score ratio: 2
	Avg DEL ratio: 0.0299483
	Avg INS ratio: 0.0254998
Construct Tree...
		594 SVs found in input.
	Invalid types found skipping 0 entries.
Start parsing: Chr Reference
Segmentation fault


Automatically enabling genotype mode
Force calling SVs
Estimating parameter...
	Max dist between aln events: 4
	Max diff in window: 50
	Min score ratio: 2
	Avg DEL ratio: 0.0301874
	Avg INS ratio: 0.0481301
Construct Tree...
		594 SVs found in input.
	Invalid types found skipping 0 entries.
Start parsing: Chr Reference
Segmentation fault


Automatically enabling genotype mode
Force calling SVs
Estimating parameter...
	Max dist between aln events: 4
	Max diff in window: 50
	Min score ratio: 2
	Avg DEL ratio: 0.0531733
	Avg INS ratio: 0.0472265
Construct Tree...
		594 SVs found in input.
	Invalid types found skipping 0 entries.
Start parsing: 

: 1

The previous command might have triggered a segmentation fault
You can download the results with the following command:

In [28]:
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/SNIFFLES.tar.gz
tar -xzf SNIFFLES.tar.gz

--2021-09-28 09:43:07--  https://itrop.ird.fr/ont-training/SNIFFLES.tar.gz
Resolving itrop.ird.fr (itrop.ird.fr)... 91.203.35.184
Connecting to itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2014301697 (1.9G) [application/x-gzip]
Saving to: ‘SNIFFLES.tar.gz’


2021-09-28 09:43:43 (52.7 MB/s) - ‘SNIFFLES.tar.gz’ saved [2014301697/2014301697]

FINISHED --2021-09-28 09:43:43--
Total wall clock time: 37s
Downloaded: 1 files, 1.9G in 36s (52.7 MB/s)
(base) (base) 

: 1

In [None]:
#Put all file paths over all new vcf files together
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/SNIFFLES
ls *SV.gt.vcf > vcf_gt_calls.txt

# relauch survivor to merge vcf again and finally obtain now a fully genotyped multisample vcf
conda activate survivor
SURVIVOR merge vcf_gt_calls.txt 1000 1 1 -1 -1 -1 merged_gt_SURVIVOR_1kDist.vcf
conda deactivate

# The -1 for the minimum SV caller is necessary to obtain all calls even if they might be 0/0 in all samples.

# Have a look on the VCF file

In [None]:
head -n 100 merged_gt_SURVIVOR_1kDist.vcf | tail -n 5

Now, you can plotting proportion of SV in all sample for VCF for example