# Formation South Green 2021  

##  Initiation à l’analyse de données Minion

### PART 4

Created by J.Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-CIRAD)

Adapted from C. Tranchant et F. Sabot (Training trasmiting science 2021)

Septembre 2021

# 1. Structural variation with Sniffles

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore).

It detects all types of SVs (10bp+) using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

## Prepare data

In [2]:
# download  all clones fastq.gz
cd ~/SG-ONT-2021/DATA
# download your compressed CloneX 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/all_clones_short.tar.gz

bash: cd: ~SG-ONT-2021/DATA: No such file or directory
--2021-09-23 11:24:37--  https://itrop.ird.fr/ont-training/all_clones_short.tar.gz
Resolving itrop.ird.fr (itrop.ird.fr)... 91.203.35.184
Connecting to itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 842363643 (803M) [application/x-gzip]
Saving to: ‘all_clones_short.tar.gz’


2021-09-23 11:24:53 (53.0 MB/s) - ‘all_clones_short.tar.gz’ saved [842363643/842363643]

FINISHED --2021-09-23 11:24:53--
Total wall clock time: 15s
Downloaded: 1 files, 803M in 15s (53.0 MB/s)


In [4]:
#decompress it
cd ~/SG-ONT-2021/DATA
tar zxvf all_clones_short.tar.gz

all_clones_short/
all_clones_short/Clone2.fastq.gz
all_clones_short/Clone6.fastq.gz
all_clones_short/Clone10.fastq.gz
all_clones_short/Clone15.fastq.gz
all_clones_short/Clone18.fastq.gz


In [7]:
# create SNIFFLES folder
mkdir -p ~/SG-ONT-2021/RESULTS/SNIFFLES/
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/

# declare your Clone
CLONE="Clone20"

# symbolic links of reference 
ln -s /home/jovyan/SG-ONT-2021/DATA/${CLONE}/reference.fasta .
REF="reference.fasta"

# 1. Mapping and SV detection for all CLONES

In [10]:
run_sniffles () {
  CLONE=$1 # this is the first parametter of this fonction
  REF="reference.fasta"
  ONT="/home/jovyan/SG-ONT-2021/DATA/all_clones_short/${CLONE}.fastq.gz"
  ## Mapping using minimap2 : Mapping ONT reads (clone) vs a reference using minimap2 
  time minimap2 -t 4 -ax map-ont --MD  -R '@RG\tID:${CLONE}\tSM:${CLONE}' ${REF} ${ONT} > ${CLONE}.bam
  ## Sort BAM
  time samtools sort -@4 -o ${CLONE}_SORTED.bam ${CLONE}.bam
  # Obtain calls for a samples
  time sniffles -t 4 -s 2 -q 10 -l 10 -r 500 -m ${CLONE}_SORTED.bam -v ${CLONE}_SV.vcf
}

# -s/--min_support	Minimum number of reads that support a SV to be reported. Default: 10
# -l/--min_length	Minimum length of SV to be reported. Default: 30bp
# -q/--minmapping_qual	Minimum mapping quality of alignment to be taken into account. Default: 20
# -r/--min_seq_size	Discard read if non of its segment is larger then this. Default: 2kb

### Obtain calls for a samples

In [11]:
for i in {1..20}
    do
        cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
        echo " ============ Clone${i}============== ";
        run_sniffles Clone$i
    done
time

[M::mm_idx_gen::0.080*1.05] collected minimizers
[M::mm_idx_gen::0.119*1.73] sorted minimizers
[M::main::0.119*1.73] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.128*1.68] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.133*1.65] distinct minimizers: 165344 (91.75% are singletons); average occurrences: 1.156; average spacing: 5.336
ERROR: failed to open file '/home/jovyan/SG-ONT-2021/DATA/all_clones_short/Clone1.fastq.gz'
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 4 -ax map-ont --MD -R @RG\tID:${CLONE}\tSM:${CLONE} reference.fasta /home/jovyan/SG-ONT-2021/DATA/all_clones_short/Clone1.fastq.gz
[M::main] Real time: 0.141 sec; CPU: 0.227 sec; Peak RSS: 0.014 GB

real	0m0.145s
user	0m0.150s
sys	0m0.079s

real	0m0.469s
user	0m0.021s
sys	0m0.021s
Estimating parameter...
Too few reads detected in Clone1_SORTED.bam

real	0m0.090s
user	0m0.000s
sys	0m0.010s
[M::mm_idx_gen::0.081*1.05] collected minimizers
[M::m

### Count the number of variations

In [46]:
for i in {2,4,6,8,10,12,14,16,18,20}
    do
        cd  ~/SG-ONT-2021/RESULTS/SNIFFLES
        echo "Clone${i}";
        grep -v "#" Clone${i}_SV.vcf | wc -l
    done

Clone2
39
Clone4
25
Clone6
44
Clone8
123
Clone10
66
Clone12
121
Clone14
189
Clone16
533
Clone18
221
Clone20
215
(base) 

: 1

# 2. Merge all the vcf files across all samples 


Check the sniffles website https://github.com/fritzsedlazeck/Sniffles/ an its wiki for more details

In [58]:
# Put all file paths over all vcf files together
ls *SV.vcf > vcf_raw_calls.txt

# We call SURVIVOR to merge these into one vcf file
conda activate survivor
cd ~/SG-ONT-2021/RESULTS/SNIFFLES
time SURVIVOR merge vcf_raw_calls.txt 1000 1 1 -1 -1 -1 merged_SURVIVOR_1kDist.vcf
conda deactivate

(base) (base) (base) (base) (survivor) (survivor) merging entries: 66
merging entries: 82
merging entries: 121
merging entries: 147
merging entries: 189
merging entries: 302
merging entries: 533
merging entries: 202
merging entries: 221
merging entries: 223
merging entries: 34
merging entries: 215
merging entries: 39
merging entries: 34
merging entries: 25
merging entries: 48
merging entries: 44
merging entries: 90
merging entries: 123
merging entries: 73

real	0m0.457s
user	0m0.376s
sys	0m0.049s
(survivor) (base) 

: 1

This will generate one VCF file for all the samples, but we lack the information 

If a SV identified in one sample but not in the other is really absent.

# 3. Force call all the SVs across all the samples

Next we ran Sniffles again across all the samples similar to this :

In [69]:
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
for i in {1..20}
    do
        echo -e "\n========== REMapping Clone$i======== \n";
        sniffles -t 8 -s 2 -q 10 -l 10 -r 500 -m Clone${i}_SORTED.bam -v Clone${i}_SV.gt.vcf --Ivcf merged_SURVIVOR_1kDist.vcf
    done


(base) 

Automatically enabling genotype mode
Force calling SVs
Estimating parameter...
	Max dist between aln events: 6
	Max diff in window: 50
	Min score ratio: 2
	Avg DEL ratio: 0.0299276
	Avg INS ratio: 0.0253061
Construct Tree...
		2226 SVs found in input.
	Invalid types found skipping 0 entries.
Start parsing: Chr Reference
Segmentation fault


Automatically enabling genotype mode
Force calling SVs
Estimating parameter...

(base) 

In [70]:
#Put all file paths over all new vcf files together
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
ls *SV.gt.vcf > vcf_gt_calls.txt

# relauch survivor to merge vcf again and finally obtain now a fully genotyped multisample vcf
conda activate survivor
SURVIVOR merge vcf_gt_calls.txt 1000 1 1 -1 -1 -1 merged_gt_SURVIVOR_1kDist.vcf
conda deactivate

# The -1 for the minimum SV caller is necessary to obtain all calls even if they might be 0/0 in all samples.

(base) (base) (base) (base) (base) (survivor) merging entries: 0
merging entries: 0
merging entries: 0
merging entries: 0
merging entries: 0
merging entries: 0
(survivor) (base) (base) (base) 

: 1

# Have a look on the VCF file

In [71]:
head -n 100 merged_gt_SURVIVOR_1kDist.vcf | tail -n 5

##FORMAT=<ID=ID,Number=1,Type=String,Description="Variant ID from input.">
##FORMAT=<ID=RAL,Number=1,Type=String,Description="Reference allele sequence reported from input.">
##FORMAT=<ID=AAL,Number=1,Type=String,Description="Alternative allele sequence reported from input.">
##FORMAT=<ID=CO,Number=1,Type=String,Description="Coordinates">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Clone1_SV.gt.vcf	Clone2_SV.gt.vcf	Clone3_SV.gt.vcf	Clone4_SV.gt.vcf	Clone5_SV.gt.vcf	Clone6_SV.gt.vcf
(base) 

: 1

Now, you can plotting proportion of SV in all sample for VCF for example