# <span style="color:green">Formation South Green 2022</span> - Structural Variants Detection by using short and long reads 

# __DAY 1 : How to map reads against a reference genome ?__ 

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

## __1. Preparing the working environment__ 

### First create a dedicated folder to work 


In [115]:
# make a directory to DATA
mkdir -p /home/jovyan/work/DATA
cd /home/jovyan/work/DATA
ls

## Download sequencing data (SR & LR) for Simulated clones

Before starting, please download special data created for this practical training. Data are available on the from I-Trop server.

Each participant will analyse a Clone, results will be complete in this shared file

To generate Clone data, a 1Mb contig was extracted from chromosome 1 of rice.

20 levels of variation were generated and long reads were simulated for each.

We have introduced different variations (SNP, indel, indel+translocations) and also some contaminations.

In [116]:
# download your compressed CloneX 
for i in {1..10}
    do
        echo -e "\nDownloading Clone$i data\n"
        CLONE="Clone$i"
        wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/${CLONE}.tar.gz
        #decompress file
        tar zxvf ${CLONE}.tar.gz
        rm ${CLONE}.tar.gz
    done


Downloading Clone1 data

--2022-06-16 08:38:45--  https://itrop.ird.fr/ont-training/Clone1.tar.gz
Resolving itrop.ird.fr (itrop.ird.fr)... 91.203.35.184
Connecting to itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 179981963 (172M) [application/x-gzip]
Saving to: ‘Clone1.tar.gz’


2022-06-16 08:38:48 (48.8 MB/s) - ‘Clone1.tar.gz’ saved [179981963/179981963]

FINISHED --2022-06-16 08:38:48--
Total wall clock time: 3.6s
Downloaded: 1 files, 172M in 3.5s (48.8 MB/s)
Clone1/
Clone1/ILL/
Clone1/ILL/Clone1_R2.fastq.gz
Clone1/ILL/Clone1_R1.fastq.gz
Clone1/reference.fasta
Clone1/ONT/
Clone1/ONT/Clone1.fastq.gz
Clone1/ONT/Clone1_DeepSimu_sequencing_summary.txt

Downloading Clone2 data

--2022-06-16 08:38:50--  https://itrop.ird.fr/ont-training/Clone2.tar.gz
Resolving itrop.ird.fr (itrop.ird.fr)... 91.203.35.184
Connecting to itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443... connected.
HTTP request sent, awaiting response... 20

In [None]:
# check data 
ls -l Clone*

-----------------------
# 2. MAPPING PRACTICE

Read congruency is an important measure in determining assembly accuracy.

Clusters of read pairs or single long reads that align incorrectly are strong indicators of mis-assembly.

Reads mapping is usually the firt step before SNP or Variant calling.

### 2.1 Make a folder for your results

In [117]:
mkdir -p ~/work/MAPPING-ILL
cd ~/work/MAPPING-ILL

### 2.2 Declare important variables

In [128]:
i=10
REF_DIR="/home/jovyan/work/DATA/Clone${i}/"
REF="/home/jovyan/work/DATA/Clone${i}/reference.fasta"
ONT="/home/jovyan/work/DATA/Clone${i}/ONT/Clone${i}.fastq.gz"
ILL_R1="/home/jovyan/work/DATA/Clone${i}/ILL/Clone${i}_R1.fastq.gz"
ILL_R2="/home/jovyan/work/DATA/Clone${i}/ILL/Clone${i}_R2.fastq.gz"

In [129]:
echo "Clone${i} $REF" 
echo $ILL_R1 $ILL_R2

Clone10 /home/jovyan/work/DATA/Clone10/reference.fasta
/home/jovyan/work/DATA/Clone10/ILL/Clone10_R1.fastq.gz /home/jovyan/work/DATA/Clone10/ILL/Clone10_R2.fastq.gz


## 2.1 Mapping short reads vs a reference with `bwa mem`

In this practice, we are going to map short reads against a reference. To know, how well do the reads align back to the reference, we use bwa-mem2 and samtools to assess the basic alignment statistics.

In this exercise, we will use reference.fasta assembly as well ILLUMINA READS from your favorite CLONE.

The tool bwa needs 2 steps: 
- **Reference indexing**: `bwa index reference`
- **Mapping in itself**: `bwa mem  -R READGROUP [options] reference fastq1 fastq2 > out.sam`

## Reference indexation 

Before mapping we need index reference file! Check bwa-mem2 index command line.

In [130]:
cd $REF_DIR

In [131]:
seqtk seq $REF > referenceCorrect.fasta
REF="/home/jovyan/work/DATA/Clone${i}/referenceCorrect.fasta"

In [132]:
echo -e "\nIndexing reference $REF\n"
bwa-mem2 index $REF


Indexing reference /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta

[bwa_index] Pack FASTA... 0.01 sec
init ticks = 95687071
ref seq len = 2040002
binary seq ticks = 45918765
build index ticks = 545680122
ref_seq_len = 2040002
count = 0, 576483, 1020001, 1463519, 2040002
BWT[1932441] = 4
CP_SHIFT = 5, CP_MASK = 31
sizeof CP_OCC = 64
max_occ_ind = 63750
ref_seq_len = 2040002
count = 0, 576483, 1020001, 1463519, 2040002
BWT[1932441] = 4
CP_SHIFT = 6, CP_MASK = 63
sizeof CP_OCC = 64
max_occ_ind = 31875


In [91]:
pwd

/home/jovyan/work/MAPPING-ILL/dirClone10


In [123]:
ls

ILL             reference.amb          reference.bwt.8bit.32   reference.pac
ONT             reference.ann          referenceCorrect.fasta
reference.0123  reference.bwt.2bit.64  reference.fasta


##  let's map now but only WITH READS FROM ONLY ONE CLONE

In [133]:
cd ~/work/MAPPING-ILL
echo -e "\nCreation directory for Clone$i\n"
echo Clone$i
mkdir -p dirClone$i
cd dirClone$i


Creation directory for Clone10

Clone10


In [159]:
echo -e "\nMapping Clone$i\n"
bwa-mem2 mem -M -t 4 -R "@RG\tID:Clone${i}\tSM:Clone${i}" $REF $ILL_R1 $ILL_R2 > Clone$i.sam


Mapping Clone10

-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta
Entering FMI_search
reference seq len = 2040003
count
0,	1
1,	576484
2,	1020002
3,	1463520
4,	2040003

Reading other elements of the index from files /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta
prefix: /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta.0123
Reference genome size: 2040002 bp
Done readng reference genome !!

[0000] 1: Calling process()

Threads used (compute): 4
Info: projected #read in a task: 264910
------------------------------------------
Memory pre-allocation for chaining: 557.3706 MB
Memory pre-allocation for BSW: 958.4681 MB
Memory pre-allocation for BWT: 309.2567 MB
------------------------------------------
No. of pipeline thr

In [160]:
ls 

Clone10.bam  Clone10.mappedpaired.bam  Clone10.sam  Clone10.SORTED.bam


In [161]:
head Clone10.sam

@SQ	SN:Reference	LN:1020001
@RG	ID:Clone10	SM:Clone10
@PG	ID:bwa	PN:bwa	VN:2.0pre2	CL:bwa-mem2 mem -M -t 4 -R @RG\tID:Clone10\tSM:Clone10 /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta /home/jovyan/work/DATA/Clone10/ILL/Clone10_R1.fastq.gz /home/jovyan/work/DATA/Clone10/ILL/Clone10_R2.fastq.gz
Reference-Clone10295760	77	*	0	0	*	*	0	0	GTATAAGTACCCGGTCGAATCAAAGGTAACGTTAAATAGGTACTCCGCCAGGGCAGATTTCAACAGCCAAACTGCCCCCCAGGGGTATCTTACAGGCAATGGCTTAGAAGCGTTCCTAAGTGGACGACTCTCTGGAAACTCGCCAATGAG	CC=G=GGGGGGG=IIIIGGIIGICGIIIICIGIIIIIGGCIIGIGIG=CIGIIIICIGICGGGGGCCCGGCGCGCGGGGG8CGG8CGGGGGCCCGGGGGGIGCGGGG=GCGGCCGCGG55GGCGCG8GGGGCGG=CCGCGG5GGCGCGC=	AS:i:0	XS:i:0	RG:Z:Clone10
Reference-Clone10295760	141	*	0	0	*	*	0	0	GCACCCAAGGTGATCAACCCGGCGCTGCATGAGTATGCAACATGTTCGGCAGATGCCGTCAGTTTGGCATGCGTAATTCAATGTCGCAAGGAGGATATCCCGCTGGGATTACATTCGCGTATAGTTTATGGGCCTTCATTCGTTTTTACG	CC=GGGGGGGGGGIGIIIIIICIGICI5IGCGIGGGICIGIIIICCIIG=IICGC=G==GGCGGGIGI=CCGGCGGCIGG=CGCG5CG=CGCGGG5GCGGGCCCII=GGGGGCG==GGGGGGGCCGCGGCCCGGG

In [162]:
samtools view -@4 -bh -S -o  Clone$i.bam Clone$i.sam & rm Clone$i.sam
samtools flagstat Clone$i.bam 

[1] 33990
[main_samview] fail to open "Clone10.sam" for reading.
[1]+  Exit 1                  samtools view -@4 -bh -S -o Clone$i.bam Clone$i.sam
296107 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
221305 + 0 mapped (74.74%:-nan%)
296107 + 0 paired in sequencing
148037 + 0 read1
148070 + 0 read2
218251 + 0 properly paired (73.71%:-nan%)
219689 + 0 with itself and mate mapped
1616 + 0 singletons (0.55%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


In [163]:
samtools view -bh -@4 -f 0x02 -o Clone$i.mappedpaired.bam Clone$i.bam 
samtools flagstat Clone$i.mappedpaired.bam

218251 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
218251 + 0 mapped (100.00%:-nan%)
218251 + 0 paired in sequencing
109118 + 0 read1
109133 + 0 read2
218251 + 0 properly paired (100.00%:-nan%)
218251 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


In [166]:
samtools sort -@4 Clone$i.mappedpaired.bam Clone$i.SORTED & rm Clone$i.mappedpaired.bam

[1] 34017
open: No such file or directory
[bam_sort_core] fail to open file Clone10.mappedpaired.bam
[1]+  Done                    samtools sort -@4 Clone$i.mappedpaired.bam Clone$i.SORTED


In [171]:
ls -lrth

total 56M
-rw-r--r-- 1 jovyan users 39M Jun 16 08:50 Clone10.bam
-rw-r--r-- 1 jovyan users 18M Jun 16 09:04 Clone10.SORTED.bam


##  Indexing bam

In [174]:
samtools index Clone$i.SORTED.bam
ls -lrt

total 57296
-rw-r--r-- 1 jovyan users 40301001 Jun 16 08:50 Clone10.bam
-rw-r--r-- 1 jovyan users 18359452 Jun 16 09:04 Clone10.SORTED.bam
-rw-r--r-- 1 jovyan users     2856 Jun 16 09:07 Clone10.SORTED.bam.bai


## Let's map with data from 4 clones using a loop for mapping, with a single folder per sample

In [183]:
for i in {2..3}
    do
        cd ~/work/MAPPING-ILL
        echo -e "\nCreation directory for Clone$i\n"
        echo Clone$i
        mkdir -p dirClone$i
        cd dirClone$i

        echo -e "\nDeclare Clone$i data";
        #REF="/home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta" # index is already done for clone10
        ONT="/home/jovyan/work/DATA/Clone${i}/ONT/Clone${i}.fastq.gz"
        ILL_R1="/home/jovyan/work/DATA/Clone${i}/ILL/Clone${i}_R1.fastq.gz"
        ILL_R2="/home/jovyan/work/DATA/Clone${i}/ILL/Clone${i}_R2.fastq.gz"
        
        echo -e "\nMapping Clone$i\n"
        bwa-mem2 mem -M -t 4 -R "@RG\tID:Clone${i}\tSM:Clone${i}" $REF $ILL_R1 $ILL_R2 > Clone$i.sam
        samtools view -@4 -bh -S -o Clone$i.bam Clone$i.sam 
        samtools view -bh -@4 -f 0x02 -o Clone$i.mappedpaired.bam Clone$i.bam
        samtools sort -@4 Clone$i.mappedpaired.bam Clone$i.SORTED 
        samtools index Clone$i.SORTED.bam 
    done


Creation directory for Clone2

Clone2

Declare Clone2 data

Mapping Clone2

-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta
Entering FMI_search
reference seq len = 2040003
count
0,	1
1,	576484
2,	1020002
3,	1463520
4,	2040003

Reading other elements of the index from files /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta
prefix: /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = /home/jovyan/work/DATA/Clone10/referenceCorrect.fasta.0123
Reference genome size: 2040002 bp
Done readng reference genome !!

[0000] 1: Calling process()

Threads used (compute): 4
Info: projected #read in a task: 264910
------------------------------------------
Memory pre-allocation for chaining: 557.3706 MB
Memory pre-allocation for BSW: 958.4681 MB
Memory pre-allocation for BWT: 309.2567 MB
---

In [None]:
What is the percentage of aligned ONT and ILLUMINa reads in your clone assembly?

In [None]:

        echo -e "\nCalling Clone$i";
        gatk MarkDuplicates -I Clone$i.SORTED.bam -M duplicates.$i.metrics -O Clone$i.SORTED.MD.bam;
        samtools index Clone$i.SORTED.MD.bam;
        gatk --java-options "-Xmx4g" HaplotypeCaller --native-pair-hmm-threads 4 -I Clone$i.SORTED.MD.bam -O Clone$i.g.vcf -R ../referenceCorrect.fasta -ERC GVCF;


## 2.2 Mapping Long reads vs a Reference

In [None]:
echo -e "\nMapping Clone$i minimap2 \n"
minimap2 -ax map-ont -t 4 ${REF} ${ONT} > Clone${i}_ONT.sam 
samtools view -@4 -bS -F 0x904 Clone${i}_ONT.sam | samtools sort -@4 - Clone${i}_ONT_SORTED
samtools index Clone${i}_ONT_SORTED.bam
rm Clone${i}_ONT.sam


Mapping Clone3 minimap2 

[M::mm_idx_gen::0.068*1.06] collected minimizers
[M::mm_idx_gen::0.094*1.77] sorted minimizers
[M::main::0.094*1.77] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.100*1.73] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.106*1.69] distinct minimizers: 165344 (91.75% are singletons); average occurrences: 1.156; average spacing: 5.336


## Calling all samples on one raw VCF

In [None]:
cd /ifb/data/mydatalocal/MappingAndSNP
#Loop to inflate the --variant option
OPTION=""
for i in {1..20}
do
    OPTION="${OPTION} --variant dirClone${i}/Clone${i}.g.vcf"
done
echo $OPTION
gatk CombineGVCFs -R referenceCorrect.fasta $OPTION -O rawSNP.vcf

# Have a Look to it

In [None]:
head -n 1000 rawSNP.vcf | tail

## Compute the Genotypes *:D I know I am a bad person...*

In [None]:
gatk --java-options "-Xmx4g" GenotypeGVCFs -R referenceCorrect.fasta -V rawSNP.vcf -O output.vcf

## Compute the SNP density along the chromosomes

In [None]:
echo -e "Reference\t1000000\n" > /ifb/data/mydatalocal/MappingAndSNP/genome.txt
bedtools genomecov -bga -split -i /ifb/data/mydatalocal/MappingAndSNP/output.vcf -g /ifb/data/mydatalocal/MappingAndSNP/genome.txt > /ifb/data/mydatalocal/MappingAndSNP/density.csv

In [None]:
head /ifb/data/mydatalocal/MappingAndSNP/density.csv