
# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022

# <span style="color:#006E7F">__TP4 - VARIANTS DETECTION__ <a class="anchor" id="data"></span>  
    
# <span style="color: #4CACBC;"> Structural variation with Sniffles</span>  

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs (10bp+) using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

https://github.com/fritzsedlazeck/Sniffles

In the following exercices, we will :
* map the long reads of the `4222`, `B8` and `G11` samples against the reference genome `GCA_002220235.1_ASM222023v1_genomic`
* call SV and create a ?snf file for each sample
* merge calling using the snf files into a single .vcf 

------

# <span style="color: #4CACBC;">1. Mapping and SV detection for all algae samples</span>  


#### __Mapping Long Reads against the reference genome__

* Go into the`RESULTS` directory and create the directory `SNIFFLES`
* Perform the mapping with `minimaps2` https://github.com/lh3/minimap2
* Sort the bam file with `samtools sort` and index the bam file created

```
minimap2 -ax map-ont -t 8 --MD -R '@RG\tID:SAMPLE_ID\tSM:SAMPLE_ID'  REF_FILE FASTQ_FILE > SAM_FILE
```

#### __Preparing data before mapping__

List the `DATA` directory and check long reads from A8, 5417 and G11 samples.

In [None]:
ls -l ~/work/DATA/

Don'f forget, we have already download genomic reference GCA_002220235.1_ASM222023v1 in notebook 2, check annotation file (gtf or gff) in `REF` directory

In [None]:
# create SNIFFLES folder
mkdir -p ~/work/RESULTS/SNIFFLES/
cd  ~/work/RESULTS/SNIFFLES/

# symbolic links of reference 
ln -s /home/jovyan/work/DATA/REF/GCA_002220235.1_ASM222023v1_genomic.fna .


In [None]:
ls ~/work/RESULTS/SNIFFLES/


### Obtain calls for each samples

Call SV candidates and create an associated .snf file for each sample:

`sniffles --input sample1.bam --snf sample1.snf`


In [None]:
for i in {"4222_RB2","B8_RB11","G11_RB6_2022"}
    do
      cd  ~/work/RESULTS/SNIFFLES/
      echo "============ sample : $i==============";
      NAMESAMPLE="${i}"
      REF="GCA_002220235.1_ASM222023v1_genomic.fna"
      ONT="/home/jovyan/work/DATA/${NAMESAMPLE}.fastq.gz" 
      ## Mapping using minimap2 : Mapping ONT reads (clone) vs a reference using minimap2 
      minimap2 -t 8 -ax map-ont --MD  -R '@RG\tID:${CLONE}\tSM:${CLONE}' ${REF} ${ONT} > ${NAMESAMPLE}.bam
      ## Sort BAM
      samtools sort -@8 -o ${NAMESAMPLE}_SORTED.bam ${NAMESAMPLE}.bam
      #index bam
      samtools index -@8 ${NAMESAMPLE}_SORTED.bam
      # Obtain calls for a samples
      sniffles -t 8 -i ${NAMESAMPLE}_SORTED.bam --snf ${NAMESAMPLE}.snf --allow-overwrite   > ${NAMESAMPLE}_SV.log
    done

# -s/--min_support	Minimum number of reads that support a SV to be reported. Default: 10
# -l/--min_length	Minimum length of SV to be reported. Default: 30bp
# -q/--minmapping_qual	Minimum mapping quality of alignment to be taken into account. Default: 20
# -r/--min_seq_size	Discard read if non of its segment is larger then this. Default: 2kb

### Count the number of variations, 

How much SV were found for each sample ? 

check log files !

# <span style="color: #4CACBC;"> 2. Merge all the vcf files across all samples</span>  

Combined calling using multiple .snf files into a single .vcf: 

`sniffles --input sample1.snf sample2.snf ... sampleN.snf --vcf multisample.vcf`

In [None]:
sniffles --input 4222_RB2.snf B8_RB11.snf G11_RB6_2022.snf --vcf multisample.vcf --allow-overwrite

# Have a look on the VCF file

In [None]:
head -n 100 multisample.vcf | tail -n 5

# Count the number of SVs `bftools stats`


In [None]:
bcftools stats multisample.vcf | head -n30

# Crossing informations between SV and the reference annotation - `bedtools intersect` or `intersectBed`
    
    https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

* Count how many SVs detected by SNIFFLES are inside genes from the annotation?
* Extract all SV inside genes  

In [None]:
mkdir -p ~/work/RESULTS/SNIFFLES/
cd  ~/work/RESULTS/SNIFFLES/

## SV inside genes

In [None]:
grep '\sgene\s' /home/jovyan/work/DATA/REF/GCA_002220235.1_ASM222023v1_genomic.gff > /home/jovyan/work/DATA/REF/GCA_002220235.1_ASM222023v1_onlygenes.gff

In [None]:
bedtools intersect  -a /home/jovyan/work/DATA/REF/GCA_002220235.1_ASM222023v1_onlygenes.gff -b multisample.vcf  -c > intersect_ref_vs_multisample.bed

In [None]:
head intersect_ref_vs_multisample.bed

### count SV number within "genes"

In [None]:
wc -l intersect_ref_vs_multisample.bed

### count SV number within each "contig"

In [None]:
cut -f1 intersect_ref_vs_multisample.bed | sort |  uniq -c

## extract all information about SV and gene intersection

In [None]:
bedtools intersect -a /home/jovyan/work/DATA/REF/GCA_002220235.1_ASM222023v1_onlygenes.gff -b multisample.vcf -wo >  intersect_ref_vs_multisample.full.bed

In [None]:
tail -n5 intersect_ref_vs_multisample.full.bed