# <span style="color:green">Formation au Burkina Faso 2022</span> - Initiation à l’analyse de données Minion pour l'analyse de métagénome viraux

Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD) and A. Comte (PHIM-IRD) 

Septembre 2022

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>
   

[TP2 - DATA CLEANING](#tp2) 

[1. Remove the host reads](#host)

   * [1.1 downloading pineaple reference genome ](#pineapple)
   * [1.2 Mapping the reads on the reference genome](#mappingpineapple)
   * [1.3 Separate reads from pineapple and reads not from pineapple](#filterpineapple) 

[2. Remove other organisms reads](#clearother)
   * [2.1 Remove fungi ](#fungi)
   * [2.2 Remove bacteria](#bacteria)

</span>

***

# <span style="color:#006E7F">__TP2 - DATA CLEANING__ <a class="anchor" id="tp2"></span>  


To save time and ressources, it's better to clean your data before launching heavy analysis.

In our case, we are interrested in the viruses reads in the dataset only. Then, to alleviate the dataset, we can:
- remove the host reads (pineapple reads)
- remove the fungi reads
- remove the bacteria reads
- remove the human reads
- ...

To do so, we are going to map the reads on a reference genome or a given bank of genomes and remove the reads mapped on these genomes of the dataset.

## <span style="color: #4CACBC;"> 1. Remove the host reads<a class="anchor" id="host"> </span>

### <span style="color: #4CACBD;"> 1.1 downloading pineaple reference genome  <a class="anchor" id="pineapple"></span>

For cleaning the dataset, you need to download the reference genome of the host (pineapple).

These data comme from [NCBI.](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_001540865.1/)

In [1]:
cd ~/work/SG-ONT-2022/DATA

# download reference genome of the pineapple
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training-2022/GCA_001540865.1_Ananas_comosus_cultivar_F153.fasta

--2022-09-05 21:11:05--  https://itrop.ird.fr/ont-training-2022/GCA_001540865.1_Ananas_comosus_cultivar_F153.fasta
Resolving itrop.ird.fr (itrop.ird.fr)... 91.203.35.184
Connecting to itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 316005678 (301M)
Saving to: ‘GCA_001540865.1_Ananas_comosus_cultivar_F153.fasta’


2022-09-05 21:11:10 (56.3 MB/s) - ‘GCA_001540865.1_Ananas_comosus_cultivar_F153.fasta’ saved [316005678/316005678]

FINISHED --2022-09-05 21:11:10--
Total wall clock time: 5.6s
Downloaded: 1 files, 301M in 5.4s (56.3 MB/s)


### <span style="color: #4CACBD;"> 1.2 Mapping the reads on the reference genome <a class="anchor" id="mappingpineapple"></span>

Minimap2 is a fast sequence mapping and alignment program that can find overlaps between long noisy reads, or map long reads or their assemblies to a reference genome optionally with detailed alignment

In [2]:
minimap2 --help

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
  Indexing:
    -H           use homopolymer-compressed k-mer (preferrable for PacBio)
    -k INT       k-mer size (no larger than 28) [15]
    -w INT       minimizer window size [10]
    -I NUM       split index for every ~NUM input bases [4G]
    -d FILE      dump index to FILE []
  Mapping:
    -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]
    -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]
    -G NUM       max intron length (effective with -xsplice; changing -r) [200k]
    -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]
    -r NUM       bandwidth used in chaining and DP-based alignment [500]
    -n INT       minimal number of minimizers on a chain [3]
    -m INT       minimal chaining score (matching bases minus log gap penalty) [40]
    -X           skip self and dual mappings (for the all-vs-all mode)
    

In [5]:
# create output directory
mkdir -p ~/work/SG-ONT-2022/CLEANING
cd ~/work/SG-ONT-2022/CLEANING

# Mapping
minimap2 -ax map-ont  ../DATA/GCA_001540865.1_Ananas_comosus_cultivar_F153.fasta ../DATA/data.fastq > reads_vs_ananas.sam

[M::mm_idx_gen::9.156*1.79] collected minimizers
[M::mm_idx_gen::11.975*2.07] sorted minimizers
[M::main::11.975*2.07] loaded/built the index for 26 target sequence(s)
[M::mm_mapopt_update::12.811*2.00] mid_occ = 222
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 26
[M::mm_idx_stat::13.361*1.96] distinct minimizers: 28562326 (70.71% are singletons); average occurrences: 2.064; average spacing: 5.360
[M::worker_pipeline::251.579*2.95] mapped 793608 sequences
[M::worker_pipeline::309.293*2.95] mapped 112482 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax map-ont ../DATA/GCA_001540865.1_Ananas_comosus_cultivar_F153.fasta ../DATA/data.fastq
[M::main] Real time: 309.364 sec; CPU: 913.049 sec; Peak RSS: 3.231 GB


Observe the quantity of mapped reads with samtools:

In [6]:
samtools flagstats reads_vs_ananas.sam

1286838 + 0 in total (QC-passed reads + QC-failed reads)
86917 + 0 secondary
293831 + 0 supplementary
0 + 0 duplicates
880748 + 0 mapped (68.44% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


What is the percentage of reads mapping on the pineapple genome in this dataset?

### <span style="color: #4CACBD;"> Sam format </span>

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments.
Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information. The eleven fields are always present and in the order shown below:

| Col | Field | Type | Regexp/Range | Brief description |
| --- | --- | --- | --- | --- |
| 1 | QNAME | String | [!-?A-~]{1,254} | Query template NAME |
| 2 | FLAG | Int | [0, 216 − 1] | bitwise FLAG |
| 3 | RNAME | String | \*|[:rname:∧*=][:rname:]* | Reference sequence NAME11 |
| 4 | POS | Int | [0, 231 − 1] | 1-based leftmost mapping POSition |
| 5 | MAPQ | Int | [0, 28 − 1] | MAPping Quality |
| 6 | CIGAR | String | \*|([0-9]+[MIDNSHPX=])+ | CIGAR string |
| 7 | RNEXT | String | \*|=|[:rname:∧*=][:rname:]* | Reference name of the mate/next read |
| 8 | PNEXT | Int | [0, 231 − 1] | Position of the mate/next read |
| 9 | TLEN | Int | [−231 + 1, 231 − 1] | observed Template LENgth |
| 10 | SEQ | String | \*|[A-Za-z=.]+ | segment SEQuence |
| 11 | QUAL | String | [!-~]+ | ASCII of Phred-scaled base QUALity+33 |

more information in : https://samtools.github.io/hts-specs/SAMv1.pdf

### Best practices are to convert sam to bam using `samtools` to save disk espace.

`samtools view -b aln.sam > aln.bam`

### <span style="color: #4CACBD;"> 1.3 Separate reads from pineapple and reads not from pineapple <a class="anchor" id="filterpineapple"></span>

Sam file can be filtered with samtools with the combination of **bitwise FLAGs**

| Bit | Description |
| --- | --- |
| 1 0x1 | template having multiple segments in sequencing |
| 2 0x2 | each segment properly aligned according to the aligner |
| 4 0x4 | segment unmapped |
| 8 0x8 | next segment in the template unmapped |
| 16 0x10 | SEQ being reverse complemented |
| 32 0x20 | SEQ of the next segment in the template being reverse complemented |
| 64 0x40 | the first segment in the template |
| 128 0x80 | the last segment in the template |
| 256 0x100 | secondary alignment |
| 512 0x200 | not passing filters, such as platform/vendor quality controls |
| 1024 0x400 | PCR or optical duplicate |
| 2048 0x800 | supplementary alignment |

To separate mapped reads and umapped reads, we use samtools and the flag 4:

In [7]:
# go to the cleaning repository

cd ~/work/SG-ONT-2022/CLEANING

In [8]:
# extract mapped reads (pineapple reads):

samtools view -@ 4 -bh -F 4 reads_vs_ananas.sam > reads_vs_ananas_mapped.sam

# unmapped reads (all reads exept pineapple):

samtools view -@ 4 -bh -f 4 reads_vs_ananas.sam > reads_vs_ananas_unmapped.sam

In [9]:
# sam to fastq

samtools fastq reads_vs_ananas_unmapped.sam > reads_vs_ananas_unmapped.fastq
samtools fastq reads_vs_ananas_mapped.sam > reads_vs_ananas_mapped.fastq

[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 406090 reads
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 500000 reads


#### How many reads are on the dataset now filtered of the pineapple reads?

In [10]:
awk '{s++}END{print s/4}' reads_vs_ananas_unmapped.fastq

406090


#### How many reads were filtered (How many pineapple reads)?

In [11]:
awk '{s++}END{print s/4}' reads_vs_ananas_mapped.fastq

500000


#### How many nucleic bases were mapped ?

In [12]:
seqtk seq -A reads_vs_ananas_mapped.fastq | grep -v ">" | wc -m

420673470


#### What is the proportion of pineapple reads in oposition to the unmapped reads ?

## <span style="color: #4CACBC;"> 2. Remove other organisms reads <a class="anchor" id="clearother"></span>

Here some example of other organisms reads you can remove and how to remove them.
This step demand a lot a ressources because of the size of the genomic banks.

<span style="color:red">**WARNING : Please do not run any of the steps below.**</span> 


### <span style="color: #4CACBC;"> 2.1 Remove fungi <a class="anchor" id="fungi"></span>

#### Download fungi genomic bank <span style="color:red"> (Don't run it! ) </span>

#### Mapping of reads on the fungi genomic bank + remove fungi reads

### <span style="color: #4CACBC;"> 2.2 Remove bacteria <a class="anchor" id="bactoche"></span>

#### Download bacteriae genomic bank <span style="color:red"> (Don't run it! ) </span>

#### Mapping of reads on the bacteriae genomic bank + remove bacteriae reads