# <span style="color:green">Formation South Green 2022</span> - Structural Variants Detection by using short and long reads 

# __DAY 1 : How to map reads against a reference genome ?__ 

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)


***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>

[I - Preparing data](#data)

* [Download sequencing data (SR & LR) for Simulated clones](#download)

[II - Mapping Practice](#mapping) 
  
[2.1. Mapping short reads vs a reference with  `bwa mem`](#bwamem)

   * [ Reference indexation](#refindex)
   * [Run the mapping with `bwa mem`](#bwamem2-cmd)
   * [Calculate stats from mapping `samtools flagstat`](#flagstats)
   * [Convert sam into bam `samtools view`](#samtoolsview)
   * [Generate a bam file that contains only the reads correctly paired mapped `samtools view`](#corrmap)
   * [Indexing bam fil](#indexbam) 
   * [EXERCICE : MAP ALL SR WITH BWAMEM2](#mapallmem)

[2.2. Mapping long reads vs a reference with  `minimap2`](#minimapé)
   * [EXERCICE : MAP ALL LR WITH MINIMAP2](#mapallminimap)

[III - Centralize final mapping data into a single bam directory](#reorder)
</span>

***


# <span style="color:#006E7F">__I - Preparing data__ <a class="anchor" id="data"></span>  

### <span style="color: #4CACBC;"> First create a dedicated folder to work</span>  


In [None]:
# go to work directory and download data
cd /home/jovyan/work/
ls

### <span style="color: #4CACBC;"> Download sequencing data (SR & LR) for Simulated clones <a class="anchor" id="download"> </span>  

Before starting, please download special data created for this practical training. Data are available on the from I-Trop server.

Each participant will analyse a Clone, results will be complete in this shared file

To generate Clone data, a 1Mb contig was extracted from chromosome 1 of rice.

20 levels of variation were generated and long reads were simulated for each.

We have introduced different variations (SNP, indel, indel+translocations) and also some contaminations.

In [None]:
# download available compressed DATA 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/sv-training/SV_DATA.tar.gz
# decompress data
tar zxvf SV_DATA.tar.gz
rm SV_DATA.tar.gz

### <span style="color: #4CACBC;"> List the content of the directory work and check that the directory SV_DATA have been created</span>  


### <span style="color: #4CACBC;"> List the content of the directory SV_DATA</span>  

# <span style="color:#006E7F">__II -  MAPPING PRACTICE__ <a class="anchor" id="mapping"></span>  

Read congruency is an important measure in determining assembly accuracy.

Clusters of read pairs or single long reads that align incorrectly are strong indicators of mis-assembly.

Reads mapping is usually the firt step before SNP or Variant calling.

### <span style="color: #4CACBC;"> Make a folder for your results</span>  

In [None]:
mkdir -p ~/work/MAPPING-ILL
cd ~/work/MAPPING-ILL

### <span style="color: #4CACBC;"> Declare important variables</span>  

We are going to set up bash variables with the path to our data.We set a bash variable like this : `var="value"`
and call it as: `echo $var`


In [None]:
# REFERENCE 
REF_DIR="/home/jovyan/work/SV_DATA/REF/"
REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"

# ONT DATA
ONT="/home/jovyan/work/SV_DATA/LONG_READS/Clone${i}.fastq.gz"

# ILLUMINA DATA
ILL_R1="/home/jovyan/work/SV_DATA/SHORT_READS/Clone${i}_R1.fastq.gz"
ILL_R2="/home/jovyan/work/SV_DATA/SHORT_READS/Clone${i}_R2.fastq.gz"

#CLONE NUMBER THAT YOU ARE GOING TO ANALYZE 
i=10 

##### Print the variable i, REF, ILL_R1 & ILL_R2

In [None]:
echo "Clone${i} $REF" 
echo $ILL_R1 $ILL_R2

-------------
# <span style="color: #4CACBC;"> 2.1. Mapping short reads vs a reference with  `bwa mem` <a class="anchor" id="bwamem"></span>  

In this practice, we are going to map short reads against a reference. To know, how well do the reads align back to the reference, we use bwa-mem2 and samtools to assess the basic alignment statistics.

In this exercise, we will use reference.fasta assembly as well ILLUMINA READS from your favorite CLONE.

The tool bwa needs 2 steps: 
- **Reference indexing**: `bwa index reference`
- **Mapping in itself**: `bwa mem  -R READGROUP [options] reference fastq1 fastq2 > out.sam`

## <span style="color: #4CACBC;"> Reference indexation  <a class="anchor" id="refindex"></span>  

Before mapping we need index reference file! Check bwa-mem2 index command line.

In [None]:
cd $REF_DIR

In [None]:
echo -e "\nIndexing reference $REF\n"
bwa-mem2 index $REF

### <span style="color: #4CACBC;">Check that the indexes have been created </span>  

## <span style="color: #4CACBC;"> => Let's map now but only WITH READS FROM ONLY ONE CLONE </span>  

* Go into the directory MAPPING-ILL
* Create a subdirectory to save the files generated by the mapping step. 
Eg: If you are going to analyze the `clone1`, create the subdirectory `dirClone1`. 

In [None]:
cd ~/work/MAPPING-ILL
echo -e "\n>>>>>>>>>> Creation directory for Clone$i\n"
mkdir -p dirClone$i
cd dirClone$i

## <span style="color: #4CACBC;"> Run the mapping with `bwa mem` <a class="anchor" id="bwamem2-cmd"></span>  

### <span style="color: #4CACBC;">Check that the file `.sam` have been created by `bwa mem` </span>  


### <span style="color: #4CACBC;">Display the first and the end of the sam file just created </span>  

## <span style="color: #4CACBC;"> Convert sam into bam `samtools view` <a class="anchor" id="samtoolsview"></span>  


#### Check that the bam file have been created 

* Have a look at the filesize of the sam and bam files.
* Remove the sam file 

In [None]:
ls -lh
rm Clone$i.sam

## <span style="color: #4CACBC;"> Calculate stats from mapping `samtools flagstat`<a class="anchor" id="flagstats"></span>   

### <span style="color: #4CACBC;"> Display the content of the flagstat file</span>  


## <span style="color: #4CACBC;"> Generate a bam file that contains only the reads correctly paired mapped `samtools view`<a class="anchor" id="corrmap"></span>   

https://broadinstitute.github.io/picard/explain-flags.html

## <span style="color: #4CACBC;"> Sorting final bam </span>  

* Generate the bam file ordered
* Check that the new bam file have been created
* Remove the bam file previously created (Clone$i.mappedpaired.bam)

## <span style="color: #4CACBC;"> Indexing bam file<a class="anchor" id="indexbam"></span>   

In [None]:
samtools index Clone$i.SORTED.bam

In [None]:
ls -lrt

## <span style="color: #4CACBC;"> => Let's map with data from all clones using a loop for mapping, with a single folder per sample<a class="anchor" id="mapallminimap"></span>   

# <span style="color: #4CACBC;"> 2.2 Mapping Long reads vs a Reference `minimap2` <a class="minimap2" id=""></span> 


Similar process such as SR is done in LR. In this case mapper is minimap2.

In [None]:
# Declare variables
i=10
REF_DIR="/home/jovyan/work/SV_DATA/REF/"
REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"
ONT="/home/jovyan/work/SV_DATA/LONG_READS/Clone${i}.fastq.gz"

## <span style="color: #4CACBC;"> Mapping with `minimap2`</span> 

## <span style="color: #4CACBC;"> => Let's map now but only WITH READS FROM ONLY ONE CLONE</span>  

In [None]:
mkdir -p ~/work/MAPPING-ONT
cd ~/work/MAPPING-ONT
echo -e "\nCreation directory for Clone$i\n"
echo Clone$i
mkdir -p dirClone$i
cd dirClone$i

In [None]:
echo -e "\nMapping Clone$i minimap2 \n"
#PUT YOUR MINIMAP COMMAND HERE

### <span style="color: #4CACBC;"> Display the first lines of the sam file</span>  

## <span style="color: #4CACBC;"> Convert samtobam</span>  

In [None]:
echo -e "\nConvert samtobam and filter it \n"
samtools view -@4 -bh -S -F 0x904 -o Clone${i}_ONT.bam Clone${i}_ONT.sam
rm Clone${i}_ONT.sam

## <span style="color: #4CACBC;"> Sort and index bam</span>  

In [None]:
echo -e "\nSort and index bam \n"
samtools sort -@8 Clone${i}_ONT.bam Clone${i}_ONT_SORTED 
samtools index Clone${i}_ONT_SORTED.bam

## <span style="color: #4CACBC;"> Calculate stats from mapping</span>  

In [None]:
echo -e "\nCalculate stats from mapping\n"
#PUT YOUR SAMTOOLS FLAGSTAT COMMAND HERE

### <span style="color: #4CACBC;"> Display the content of the flagstat file</span>  

## <span style="color: #4CACBC;"> => Let's map with data from all clones using a loop for mapping, with a single folder per sample and ONT reads<a class="anchor" id="mapallminimap"></span> 

## <span style="color:#006E7F">__III -  Centralize final mapping data into a single bam directory__ <a class="anchor" id="reorder"></span>   

# <span style="color: #4CACBC;"> Reorder BAM files into a folder only for Illumina</span>  


In [None]:
mkdir -p ~/work/MAPPING-ILL/BAM
cd ~/work/MAPPING-ILL/

for i in {1..20}
    do
         ln -s ~/work/MAPPING-ILL/dirClone$i/Clone$i.SORTED.bam BAM/
    done

In [None]:
ls /home/jovyan/work/MAPPING-ILL/BAM -l

# <span style="color: #4CACBC;"> Reorder BAM files into a folder only for ONT</span>  