# <span style="color:green">Formation à Abidjan 2023</span> - Initiation à l’analyse de données Minion pour l'analyse de métagénome viraux

Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD), A. Comte (PHIM-IRD) and E. Tibiri (WAVE-INERA)
Septembre 2023

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>
   

[TP2 - DATA CLEANING](#tp2) 

[1. Remove the host reads](#host)

   * [1.1 downloading pineaple reference genome ](#pineapple)
   * [1.2 Mapping the reads on the reference genome](#mappingpineapple)
   * [1.3 Separate reads from pineapple and reads not from pineapple](#filterpineapple) 

[2. Remove other organisms reads](#clearother)
   * [2.1 Remove fungi ](#fungi)
   * [2.2 Remove bacteria](#bacteria)

</span>

***

# <span style="color:#006E7F">__TP2 - DATA CLEANING__ <a class="anchor" id="tp2"></span>  


To save time and ressources, it's better to clean your data before launching heavy analysis.

In our case, we are interrested in the viruses reads in the dataset only. Then, to alleviate the dataset, we can:
- remove the host reads (pineapple reads)
- remove the fungi reads
- remove the bacteria reads
- remove the human reads
- ...

To do so, we are going to map the reads on a reference genome or a given bank of genomes and remove the reads mapped on these genomes of the dataset.

## <span style="color: #4CACBC;"> 1. Remove the host reads<a class="anchor" id="host"> </span>

### <span style="color: #4CACBD;"> 1.1 downloading pineaple reference genome  <a class="anchor" id="pineapple"></span>

For cleaning the dataset, you need to download the reference genome of the host (pineapple).

These data comme from [NCBI.](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_001540865.1/)

In [None]:
cd ~/work/SG-ONT-2023/DATA

# download reference genome of the pineapple
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training-2023/GCA_001540865.1_Ananas_comosus_cultivar_F153.fasta

### <span style="color: #4CACBD;"> 1.2 Mapping the reads on the reference genome <a class="anchor" id="mappingpineapple"></span>

Minimap2 is a fast sequence mapping and alignment program that can find overlaps between long noisy reads, or map long reads or their assemblies to a reference genome optionally with detailed alignment

In [None]:
minimap2 --help

In [None]:
# create output directory
mkdir -p ~/work/SG-ONT-2023/CLEANING
cd ~/work/SG-ONT-2023/CLEANING

# Mapping fastq files vs the "Ananas_comosus_cultivar_F153.fasta" and create a "reads_vs_ananas.sam" file 
minimap2 -ax map-ont ......

Observe the quantity of mapped reads with samtools:

In [None]:
samtools flagstats reads_vs_ananas.sam

What is the percentage of reads mapping on the pineapple genome in this dataset?

### <span style="color: #4CACBD;"> Sam format </span>

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments.
Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information. The eleven fields are always present and in the order shown below:

| Col | Field | Type | Regexp/Range | Brief description |
| --- | --- | --- | --- | --- |
| 1 | QNAME | String | [!-?A-~]{1,254} | Query template NAME |
| 2 | FLAG | Int | [0, 216 − 1] | bitwise FLAG |
| 3 | RNAME | String | \*|[:rname:∧*=][:rname:]* | Reference sequence NAME11 |
| 4 | POS | Int | [0, 231 − 1] | 1-based leftmost mapping POSition |
| 5 | MAPQ | Int | [0, 28 − 1] | MAPping Quality |
| 6 | CIGAR | String | \*|([0-9]+[MIDNSHPX=])+ | CIGAR string |
| 7 | RNEXT | String | \*|=|[:rname:∧*=][:rname:]* | Reference name of the mate/next read |
| 8 | PNEXT | Int | [0, 231 − 1] | Position of the mate/next read |
| 9 | TLEN | Int | [−231 + 1, 231 − 1] | observed Template LENgth |
| 10 | SEQ | String | \*|[A-Za-z=.]+ | segment SEQuence |
| 11 | QUAL | String | [!-~]+ | ASCII of Phred-scaled base QUALity+33 |

more information in : https://samtools.github.io/hts-specs/SAMv1.pdf

### Best practices are to convert sam to bam using `samtools` to save disk espace.

`samtools view -b aln.sam > aln.bam`

### <span style="color: #4CACBD;"> 1.3 Separate reads from pineapple and reads not from pineapple <a class="anchor" id="filterpineapple"></span>

Sam file can be filtered with samtools with the combination of **bitwise FLAGs**

| Bit | Description |
| --- | --- |
| 1 0x1 | template having multiple segments in sequencing |
| 2 0x2 | each segment properly aligned according to the aligner |
| 4 0x4 | segment unmapped |
| 8 0x8 | next segment in the template unmapped |
| 16 0x10 | SEQ being reverse complemented |
| 32 0x20 | SEQ of the next segment in the template being reverse complemented |
| 64 0x40 | the first segment in the template |
| 128 0x80 | the last segment in the template |
| 256 0x100 | secondary alignment |
| 512 0x200 | not passing filters, such as platform/vendor quality controls |
| 1024 0x400 | PCR or optical duplicate |
| 2048 0x800 | supplementary alignment |

To separate mapped reads and umapped reads, we use samtools and the flag 4:

In [None]:
# go to the cleaning repository

cd ~/work/SG-ONT-2023/CLEANING

In [None]:
# extract mapped reads (pineapple reads):

samtools view -@ 4 -bh WHAT_FLAG? reads_vs_ananas.sam > reads_vs_ananas_mapped.sam

# unmapped reads (all reads exept pineapple):

samtools view -@ 4 -bh WHAT_FLAG? reads_vs_ananas.sam > reads_vs_ananas_unmapped.sam

In [None]:
# convert unmapped.sam file to fastq  using `samtools fastq`

samtools fastq reads_vs_ananas_unmapped.sam > reads_vs_ananas_unmapped.fastq

# convert mapped.sam file to fastq  using `samtools fastq`

samtools fastq reads_vs_ananas_mapped.sam > reads_vs_ananas_mapped.fastq

#### How many reads are on the dataset now filtered of the pineapple reads?

In [None]:
awk '{s++}END{print s/4}' reads_vs_ananas_unmapped.fastq

#### How many reads were filtered (How many pineapple reads)?

In [None]:
awk '{s++}END{print s/4}' reads_vs_ananas_mapped.fastq

#### How many nucleic bases were mapped (using `seqtk`) ?

#### How many nucleic bases were mapped ?

#### What is the mean size of pineapple reads?

#### What is the proportion of pineapple reads in oposition to the unmapped reads ?

#### what is the mean size of reads that are not from pineapple genome?

#### What is the proportion of pineapple reads in oposition to the unmapped reads ?

## <span style="color: #4CACBC;"> 2. Remove other organisms reads <a class="anchor" id="clearother"></span>

Here some example of other organisms reads you can remove and how to remove them.
This step demand a lot a ressources because of the size of the genomic banks.

<span style="color:red">**WARNING : Please do not run any of the steps below.**</span> 


### <span style="color: #4CACBC;"> 2.1 Remove fungi <a class="anchor" id="fungi"></span>

#### Download fungi genomic bank <span style="color:red"> (Don't run it! ) </span>

#### Mapping of reads on the fungi genomic bank + remove fungi reads

In [None]:
minimap2 -ax map-ont --split-prefix=tmp fungi.genomic.fna.gz reads_vs_ananas_unmapped.fastq  > reads_vs_fungi.sam

In [None]:
# extract mapped reads (fungi reads):

samtools view -@ 4 -bh -F 4 reads_vs_fungi.sam > reads_vs_fungi_mapped.sam

# unmapped reads (all reads exept fungi):

samtools view -@ 4 -bh -f 4 reads_vs_fungi.sam > reads_vs_fungi_unmapped.sam

In [None]:
# sam to fastq

samtools fastq reads_vs_fungi_unmapped.sam > reads_vs_fungi_unmapped.fastq

### <span style="color: #4CACBC;"> 2.2 Remove other organisms <a class="anchor" id="bactoche"></span>

#### Similar protocol can be done on bacteriae genomic bank <span style="color:red"> (Don't run it! ) </span> 

This is the genomic bank we used for bacteria