# <span style="color:green">Formation South Green 2022</span> - Structural Variants Detection by using short and long reads 

# __DAY 1 : How to map reads against a reference genome ?__ 

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

## __1. Preparing the working environment__ 

### First create a dedicated folder to work 


In [7]:
# go to work directory and download data
cd /home/jovyan/work/
ls

MAPPING-ILL  SV_DATA               training_SV_teaching_old
MAPPING-ONT  training_SV_teaching


## Download sequencing data (SR & LR) for Simulated clones

Before starting, please download special data created for this practical training. Data are available on the from I-Trop server.

Each participant will analyse a Clone, results will be complete in this shared file

To generate Clone data, a 1Mb contig was extracted from chromosome 1 of rice.

20 levels of variation were generated and long reads were simulated for each.

We have introduced different variations (SNP, indel, indel+translocations) and also some contaminations.

In [None]:
# download available compressed DATA 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/sv-training/SV_DATA.tar.gz
# decompress data
tar zxvf SV_DATA.tar.gz
rm SV_DATA.tar.gz

In [None]:
# check data 
ls -l SV_DATA/

-----------------------
# 2. MAPPING PRACTICE

Read congruency is an important measure in determining assembly accuracy.

Clusters of read pairs or single long reads that align incorrectly are strong indicators of mis-assembly.

Reads mapping is usually the firt step before SNP or Variant calling.

### 2.1 Make a folder for your results

In [8]:
mkdir -p ~/work/MAPPING-ILL
cd ~/work/MAPPING-ILL

### 2.2 Declare important variables

In [9]:
i=10
REF_DIR="/home/jovyan/work/SV_DATA/REF/"
REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"
ONT="/home/jovyan/work/SV_DATA/LONG_READS/Clone${i}.fastq.gz"
ILL_R1="/home/jovyan/work/SV_DATA/SHORT_READS/Clone${i}_R1.fastq.gz"
ILL_R2="/home/jovyan/work/SV_DATA/SHORT_READS/Clone${i}_R2.fastq.gz"

In [10]:
echo "Clone${i} $REF" 
echo $ILL_R1 $ILL_R2

Clone10 /home/jovyan/work/SV_DATA/REF/reference.fasta
/home/jovyan/work/SV_DATA/SHORT_READS/Clone10_R1.fastq.gz /home/jovyan/work/SV_DATA/SHORT_READS/Clone10_R2.fastq.gz


## 2.1 Mapping short reads vs a reference with `bwa mem`

In this practice, we are going to map short reads against a reference. To know, how well do the reads align back to the reference, we use bwa-mem2 and samtools to assess the basic alignment statistics.

In this exercise, we will use reference.fasta assembly as well ILLUMINA READS from your favorite CLONE.

The tool bwa needs 2 steps: 
- **Reference indexing**: `bwa index reference`
- **Mapping in itself**: `bwa mem  -R READGROUP [options] reference fastq1 fastq2 > out.sam`

## Reference indexation 

Before mapping we need index reference file! Check bwa-mem2 index command line.

In [None]:
cd $REF_DIR

In [11]:
echo -e "\nIndexing reference $REF\n"
bwa-mem2 index $REF


Indexing reference /home/jovyan/work/SV_DATA/REF/reference.fasta

[bwa_index] Pack FASTA... 0.01 sec
init ticks = 195299651
ref seq len = 2040002
binary seq ticks = 67372992
build index ticks = 647368842
ref_seq_len = 2040002
count = 0, 576483, 1020001, 1463519, 2040002
BWT[1932441] = 4
CP_SHIFT = 5, CP_MASK = 31
sizeof CP_OCC = 64
max_occ_ind = 63750
ref_seq_len = 2040002
count = 0, 576483, 1020001, 1463519, 2040002
BWT[1932441] = 4
CP_SHIFT = 6, CP_MASK = 63
sizeof CP_OCC = 64
max_occ_ind = 31875


In [12]:
pwd

/home/jovyan/work/MAPPING-ILL


In [13]:
ls

dirClone1   dirClone13  dirClone17  dirClone20  dirClone6
dirClone10  dirClone14  dirClone18  dirClone3   dirClone7
dirClone11  dirClone15  dirClone19  dirClone4   dirClone8
dirClone12  dirClone16  dirClone2   dirClone5   dirClone9


##  Let's map now but only WITH READS FROM ONLY ONE CLONE

In [14]:
cd ~/work/MAPPING-ILL
echo -e "\nCreation directory for Clone$i\n"
echo Clone$i
mkdir -p dirClone$i
cd dirClone$i


Creation directory for Clone10

Clone10


In [None]:
echo -e "\nMapping Clone$i\n"
bwa-mem2 mem -M -t 4 $REF $ILL_R1 $ILL_R2 > Clone$i.sam

In [None]:
pwd 

In [None]:
ls

In [None]:
head Clone10.sam

##  Calculate stats from mapping

In [None]:
samtools view -@4 -bh -S -o  Clone$i.bam Clone$i.sam 
rm Clone$i.sam
samtools flagstat Clone$i.bam >Clone$i.flagstats

##  Calculate stats only for correctly mapped reads

In [None]:
samtools view -bh -@4 -f 0x02 -o Clone$i.mappedpaired.bam Clone$i.bam 
samtools flagstat Clone$i.mappedpaired.bam > Clone$i.mappedpaired.flagstats

##  Sorting final bam

In [None]:
samtools sort -@4 Clone$i.mappedpaired.bam Clone$i.SORTED 
rm Clone$i.mappedpaired.bam

##  Indexing bam

In [None]:
samtools index Clone$i.SORTED.bam

In [None]:
ls -lrt

## Let's map with data from all clones using a loop for mapping, with a single folder per sample

In [15]:
for i in {1..20}
    do
        cd ~/work/MAPPING-ILL
        echo -e "\nCreation directory for Clone$i\n"
        echo Clone$i
        mkdir -p dirClone$i
        cd dirClone$i
        
        echo -e "\nDeclare variables$i\n"
        REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"
        ILL_R1="/home/jovyan/work/SV_DATA/SHORT_READS/Clone${i}_R1.fastq.gz"
        ILL_R2="/home/jovyan/work/SV_DATA/SHORT_READS/Clone${i}_R2.fastq.gz"

        echo -e "\nMapping Clone$i\n"
        bwa-mem2 mem -M -t 4 $REF $ILL_R1 $ILL_R2 > Clone$i.sam
        
        echo -e "\nMapping Clone$i\n"
        samtools view -@4 -bh -S -o  Clone$i.bam Clone$i.sam 
        rm Clone$i.sam
        echo -e "\nFlagstats from all reads $i\n"
        samtools flagstat Clone$i.bam >Clone$i.flagstats
        
        echo -e "\nExtract only correctly mapped and calculate flagstats $i\n"
        samtools view -bh -@4 -f 0x02 -o Clone$i.mappedpaired.bam Clone$i.bam 
        samtools flagstat Clone$i.mappedpaired.bam > Clone$i.mappedpaired.flagstats
        
        echo -e "\nSort mappedpaires bam file $i\n"
        samtools sort -@4 Clone$i.mappedpaired.bam Clone$i.SORTED 
        rm Clone$i.mappedpaired.bam
    done


Creation directory for Clone8

Clone8

Declare variables8


Mapping Clone8

-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: /home/jovyan/work/SV_DATA/REF/reference.fasta
Entering FMI_search
reference seq len = 2040003
count
0,	1
1,	576484
2,	1020002
3,	1463520
4,	2040003

Reading other elements of the index from files /home/jovyan/work/SV_DATA/REF/reference.fasta
prefix: /home/jovyan/work/SV_DATA/REF/reference.fasta
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = /home/jovyan/work/SV_DATA/REF/reference.fasta.0123
Reference genome size: 2040002 bp
Done readng reference genome !!

[0000] 1: Calling process()

Threads used (compute): 4
Info: projected #read in a task: 264910
------------------------------------------
Memory pre-allocation for chaining: 557.3706 MB
Memory pre-allocation for BSW: 958.4681 MB
Memory pre-allocation for BWT: 309.2567 MB
-----------------------------------

## 2.2 Mapping Long reads vs a Reference

Similar process such as SR is done in LR. In this case mapper is minimap2.

In [None]:
# Declare variables
i=10
REF_DIR="/home/jovyan/work/SV_DATA/REF/"
REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"
ONT="/home/jovyan/work/SV_DATA/LONG_READS/Clone${i}.fastq.gz"

##  Let's map now but only WITH READS FROM ONLY ONE CLONE

In [None]:
mkdir -p ~/work/MAPPING-ONT
cd ~/work/MAPPING-ONT
echo -e "\nCreation directory for Clone$i\n"
echo Clone$i
mkdir -p dirClone$i
cd dirClone$i

In [None]:
echo -e "\nMapping Clone$i minimap2 \n"
minimap2 -ax map-ont -t 4 ${REF} ${ONT} > Clone${i}_ONT.sam 

In [None]:
head -n3 Clone${i}_ONT.sam

In [None]:
## Convert samtobam 
echo -e "\nConvert samtobam and filter it \n"
samtools view -@4 -bh -S -F 0x904 -o Clone${i}_ONT.bam Clone${i}_ONT.sam
rm Clone${i}_ONT.sam

In [None]:
echo -e "\nSort and index bam \n"
# sort and index bam
samtools sort -@4 Clone${i}_ONT.bam Clone${i}_ONT_SORTED 
samtools index Clone${i}_ONT_SORTED.bam

In [None]:
# Calculate stats from mapping
echo -e "\nCalculate stats from mapping\n"
samtools flagstat Clone${i}_ONT_SORTED.bam >Clone${i}_ONT.flagstats

## Let's map with data from all clones using a loop for mapping, with a single folder per sample and ONT reads

In [None]:
for i in {1..20}
    do
        mkdir -p ~/work/MAPPING-ONT
        cd ~/work/MAPPING-ONT
        echo -e "\nCreation directory for Clone$i\n"
        echo Clone$i
        mkdir -p dirClone$i
        cd dirClone$i
        
        echo -e "\nMapping Clone$i minimap2 \n"
        minimap2 -ax map-ont -t 4 ${REF} ${ONT} > Clone${i}_ONT.sam 
        
        # Convert samtobam 
        echo -e "\nConvert samtobam and filter it \n"
        samtools view -@4 -bh -S -F 0x904 -o Clone${i}_ONT.bam Clone${i}_ONT.sam
        rm Clone${i}_ONT.sam

        echo -e "\nSort and index bam \n"
        # sort and index bam
        samtools sort -@4 Clone${i}_ONT.bam Clone${i}_ONT_SORTED 
        samtools index Clone${i}_ONT_SORTED.bam

        # Calculate stats from mapping
        echo -e "\nCalculate stats from mapping\n"
        samtools flagstat Clone${i}_ONT_SORTED.bam >Clone${i}_ONT.flagstats
    done

In [None]:
ls