# __How to map reads against a reference genome ?__ 

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>

[PRACTICE I - Mapping](#mapping) 
  
   * [ Reference indexation](#refindex)
   * [Run the mapping with `bwa mem`](#bwamem2-cmd)
   * [Calculate stats from mapping `samtools flagstat`](#flagstats)
   * [Convert sam into bam `samtools view`](#samtoolsview)
   * [Generate a bam file that contains only the reads correctly paired mapped `samtools view`](#corrmap)
   * [Indexing bam fil](#indexbam) 
   
 
[PRACTICE IV - Mapping ON ALL SAMPLES](#loop)

***


To analyze sequencing data, we usually use a lot of bioinformatics softwares generating a lot of data. It's very important to manage and organize your data. 

Firstly, we are going to download data we use in this training.

### <span style="color: #4CACBC;"> Download sequencing data and the reference genome <a class="anchor" id="download"> - `wget` </span>  

Data are available at the following url : https://itrop.ird.fr/sv-training/SV_DATA.tar.gz. 


In [None]:
# download available compressed DATA 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/sv-training/SV_DATA.tar.gz
# decompress data
tar zxvf SV_DATA.tar.gz
rm SV_DATA.tar.gz

### <span style="color: #4CACBC;"> List the content of your home directory and check that the directory SV_DATA have been created</span>  - `ls` 


In [None]:
ls /home/jovyan

### <span style="color: #4CACBC;"> List the content of the directory SV_DATA</span>  - `ls`

In [None]:
ls /home/jovyan/SV_DATA

### <span style="color: #4CACBC;"> List the content of the directory REF</span>  - `ls`

In [None]:
ls /home/jovyan/SV_DATA/REF

What is the format of the file present in this directory ? What do you think this file contains?
<br>
How many sequences does this file contain? `grep`

### <span style="color: #4CACBC;"> Go into the directory SV_DATA/SHORT_READS and list the content of this directory - `cd` `ls`</span>  
How many files does it contain ? What is the format ?


### <span style="color: #4CACBC;"> List the 10 first lines of one file</span>  - `head` `zcat` 

### <span style="color: #4CACBC;"> Go into your home directory and create the directory 1-FASTQC</span>  

### <span style="color: #4CACBC;"> Run `fastqc` on all raw fastq</span>  

### <span style="color: #4CACBC;"> Run `MultiQC`</span>  

* go into the directory 1-FASTQC
* run MultiQC into this directory

# <span style="color:#006E7F">__PRACTICE III -  MAPPING__ <a class="anchor" id="mapping"></span>  

In this practice, we are going to map short reads against a reference. We will use reference.fasta as reference genome and ILLUMINA READS from your favorite CLONE.

2 steps are required : 
- **Reference indexing**: `bwa index reference`
- **Mapping in itself**: `bwa mem  -R READGROUP [options] reference fastq1 fastq2 > out.sam`

## <span style="color: #4CACBC;"> Reference indexation  <a class="anchor" id="refindex"></span>  

Before mapping we need index reference file! Check bwa-mem2 index command line. 
* Go into the directory REF

* Index the reference with reference.fasta

In [None]:
bwa-mem2 index reference.fasta

### <span style="color: #4CACBC;">Check that the indexes have been created </span>  

In [None]:
ls -lrt /home/jovyan

### <span style="color: #4CACBC;"> Create the `2-MAPPING` directory that will contain files generated by bwa-mem2 to perform mapping</span>  

## <span style="color: #4CACBC;"> Let's map now but only WITH READS FROM ONLY ONE CLONE </span>  

* Go into the directory 2-MAPPING

## <span style="color: #4CACBC;"> Run the mapping with `bwa mem2` <a class="anchor" id="bwamem2-cmd"></span>  

### <span style="color: #4CACBC;">Check that the file `.sam` have been created by `bwa mem` </span>  


### <span style="color: #4CACBC;">Display the first and the end of the sam file just created </span>  

## <span style="color: #4CACBC;"> Convert sam into bam `samtools view` <a class="anchor" id="samtoolsview"></span>  


#### Check that the bam file have been created 

* Have a look at the filesize of the sam and bam files.
* Remove the sam file 

## <span style="color: #4CACBC;"> Calculate stats from mapping `samtools flagstat`<a class="anchor" id="flagstats"></span>   

### <span style="color: #4CACBC;"> Display the content of the flagstat file</span>  


## <span style="color: #4CACBC;"> Generate a bam file that contains only the reads correctly paired mapped `samtools view`<a class="anchor" id="corrmap"></span>   

https://broadinstitute.github.io/picard/explain-flags.html

## <span style="color: #4CACBC;"> Sorting final bam </span>  

* Generate the bam file ordered
* Check that the new bam file have been created
* Remove the bam file previously created (Clone$i.mappedpaired.bam)

## <span style="color: #4CACBC;"> Indexing bam file<a class="anchor" id="indexbam"></span>   
    
* create the index of the bam file just created previously

In [None]:
samtools index ...

* List the content of the directory `REF` and check that the index file has been created

In [None]:
ls -lrt

## <span style="color: #4CACBC;"> PRACTICE IV - Mapping ON ALL SAMPLES<a class="anchor" id="loop"></span>   
    
Let's map with data from all clones using a loop for mapping

In [None]:
cd ~/2-MAPPING

for i in {1..10}; 
do 

done;
