# Formation South Green 2021  

##  Initiation à l’analyse de données Minion

### PART 4

Created by J.Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-CIRAD)

Adapted from C. Tranchant et F. Sabot (Training trasmiting science 2021)

Septembre 2021

# 1. Structural variation with Sniffles

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore).

It detects all types of SVs (10bp+) using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

## Prepare data

In [12]:
# create SNIFFLES folder
mkdir -p ~/SG-ONT-2021/RESULTS/SNIFFLES/
cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/

# declare your Clone
CLONE="Clone20"
ONT="/home/jovyan/SG-ONT-2021/DATA/Clone20/ONT/Clone20.fastq.gz"

# symbolic links of reference 
ln -s /home/jovyan/SG-ONT-2021/DATA/${CLONE}/reference.fasta .
REF="reference.fasta"

## Mapping ONT reads (clone) vs a reference using minimap2

In [13]:
# activate env
## Mapping using minimap2
minimap2 -t 4 -ax map-ont --MD ${REF} ${ONT} > ${REF/.fasta/_vs_ONT.bam}
## Sort and Convert in BAM
samtools sort -@4 -o ${REF/.fasta/_vs_ONT_SORTED.bam} ${REF/.fasta/_vs_ONT.bam}

[M::mm_idx_gen::0.081*1.05] collected minimizers
[M::mm_idx_gen::0.126*1.98] sorted minimizers
[M::main::0.126*1.97] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.134*1.91] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.140*1.88] distinct minimizers: 165344 (91.75% are singletons); average occurrences: 1.156; average spacing: 5.336
[M::worker_pipeline::23.254*3.40] mapped 10804 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 4 -ax map-ont --MD reference.fasta /home/jovyan/SG-ONT-2021/DATA/Clone20/ONT/Clone20.fastq.gz
[M::main] Real time: 23.269 sec; CPU: 79.118 sec; Peak RSS: 0.554 GB
[bam_sort_core] merging from 0 files and 4 in-memory blocks...


## Launch Sniffles to detect SV

In [16]:
time sniffles -t 4 -s 2 -q 10 -l 10 -r 500 -m ${REF/.fasta/_vs_ONT_SORTED.bam} -v ${CLONE}_SV.vcf

Estimating parameter...
	Max dist between aln events: 4
	Max diff in window: 50
	Min score ratio: 2
	Avg DEL ratio: 0.0483461
	Avg INS ratio: 0.0608967
Start parsing... Reference
		# Processed reads: 10000
Finalizing  ..

real	0m22.931s
user	1m0.146s
sys	0m0.521s


## Have a look to the VCF

In [17]:
head -n 50 ${CLONE}_SV.vcf | tail -n 5

Reference	32483	11	N	<DEL>	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=Reference;END=94816;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=7.000000;Kurtosis_quant_stop=6.889869;SVTYPE=DEL;SUPTYPE=SR;SVLEN=-62333;STRANDS=+-;RE=20	GT:DR:DV	./.:.:20
Reference	34096	12	TGGAACACGATT	N	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=Reference;END=34108;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=10.014243;Kurtosis_quant_stop=11.935312;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-12;STRANDS=+-;RE=85	GT:DR:DV	./.:.:85
Reference	34097	13	N	<DUP>	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=Reference;END=294169;STD_quant_start=0.000000;STD_quant_stop=2.664583;Kurtosis_quant_start=2.000000;Kurtosis_quant_stop=0.042681;SVTYPE=DUP;SUPTYPE=SR;SVLEN=260072;STRANDS=-+;RE=15	GT:DR:DV	./.:.:15
Reference	34941	14	N	<DEL>	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=Reference;END=294178;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-0.304000;Kurtosis_q

## Count the number of variations

In [18]:
grep -v "#" ${CLONE}_SV.vcf | wc -l

214


# Mapping and SV detection for all CLONES

In [None]:
run_sniffles () {
  CLONE=$1
  REF="reference.fasta"
  ONT="/home/jovyan/SG-ONT-2021/DATA/Clone20/ONT/${CLONE}.fastq.gz"
  ## Go to output repertory
  
  ## Mapping using minimap2
  minimap2 -t 4 -ax map-ont --MD  -R '@RG\tID:${CLONE}\tSM:${CLONE}' ${REF} ${ONT} > ${REF/.fasta/_vs_ONT.bam}
  ## Sort and Convert in BAM
  samtools sort -@4 -o ${REF/.fasta/_vs_ONT_SORTED.bam} ${REF/.fasta/_vs_ONT.bam}
  sniffles -t 4 -s 2 -q 10 -l 10 -r 500 -m ${REF/.fasta/_vs_ONT_SORTED.bam} -v ${CLONE}_SV.vcf
}


In [None]:
for i in {1..20}
    do
        cd  ~/SG-ONT-2021/RESULTS/SNIFFLES/
        echo Clone$i;
        run_sniffles Clone$i
    done

# Plotting SV vs SNP, one sample for SV, all sample for VCFc

In [None]:
# TIP use the rawVCF file

# Merging and genotyping all SV accross all samples

In [1]:
# TIP check the sniffles website https://github.com/fritzsedlazeck/Sniffles/ an its wiki
ls *SV.vcf > vcf_raw_calls.txt
survivor merge vcf_raw_calls.txt 1000 1 1 -1 -1 -1 merged_SURVIVOR_1kDist.vcf
for i in {1..20}
    do
        echo Clone$i;
        mkdir dirClone$i; 
        cd dirClone$i; 
        echo -e "\nREMapping Clone$i\n";
        sniffles -t 4 -s 2 -q 10 -l 10 -r 500 -m Clone$i.SORTED.bam -v Clone${i}SV.gt.vcf --Ivcf merged_SURVIVOR_1kDist.vcf
    done

ls *SV.gt.vcf > vcf_gt_calls.txt
survivor merge vcf_gt_calls.txt 1000 1 1 -1 -1 -1 merged_gt_SURVIVOR_1kDist.vcf


ls: cannot access '*SV.vcf': No such file or directory


: 2

# Have a look on the VCF file

In [None]:
head -n 100 merged_gt_SURVIVOR_1kDist.vcf | tail -n 5

# Plot SV vs SNP across all samples

In [None]:
# TIP use again the rawVCF file
# You can also use the final filtered one to see overlaps