# <span style="color:green">Formation au Burkina Faso 2022</span> - Initiation à l’analyse de données Minion pour l'analyse de métagénome viraux

# __DAY 1 : How to map reads against a reference genome ?__ 

Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD) and A. Comte (PHIM-IRD) 

Septembre 2022

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>
   

[TP1 - BASECALLING and QC](#data) 

[2. Basecalling](#guppy)

   * [2.1 Basecalling with `guppy`](#guppy)
    

[3. Quality Control on Long Reads](#qc)
   * [3.1 Quality Control of FASTQ with`Nanoplot`](#nanoplot)
   * [3.2 Compare reads QC statistics with `NanoComp`](#nanocomp)


[4. BONUS](#bonus)

</span>

***




### Our objectives in the following TP are : 
- explore the diversity of the metavirome of pineapple.
- reconstruct the complete genome sequencing of a novel member of the genus Vitivirus in the family Betaflexiviridae (subfamily Trivirinae) infecting pineapple.

# <span style="color:#006E7F">__TP1 - BASECALLING and QC__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Creating the folder, downloading data and so on</span>  

### <span style="color: #4CACBD;">  Data</span>
    

Before starting, please download special data created for this practical training. Data are available on the from I-Trop server.

This data is the total RNAs extracted from pineapple leaf samples collected in Reunion Island. Nanopore sequencing was performed using a MinION portable device and the cDNA-PCR Barcoding kit.
This original dataset contained more than 4M reads. We choose to sample it for this formation. 

In [None]:
mkdir -p ~/work/SG-ONT-2022/DATA
cd ~/work/SG-ONT-2022/DATA

# download sample data already basecalled
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training-2022/data.fastq

## <span style="color: #4CACBC;"> 2. Basecalling</span>  <a class="anchor" id="guppy"></span> 

When DNA molecules are sequenced, electrical signals are stocked on fast5 format files.

These signals need to be converted on standard fastq files to post-analysis.

Several training dataset models are usually used to convert fast5 to fastq. 

### <span style="color: #4CACBC;"> 2.1 Basecalling with Guppy</span>


Guppy is a data processing toolkit that contains the 'Oxford Nanopore Technologies' basecalling algorithms, and several bioinformatic post-processing features.

Basecalling with guppy can be launch using guppy tool. 

Guppy takes fast5 raw read files and transform electrical signal in fastq files.

We recommend to basecaller yours dataset using a GPU graphic card to obtain results quickly.

Documentation about how to run Guppy on I-trop GPU can be found on https://bioinfo.ird.fr/index.php/tutorials-fr/gpu/


In [None]:
#To see all the documentation of guppy:
  guppy_basecaller --help

To run guppy you need to choose your configuration file accordingly to the flowcell and the kit you used for sequencing.

In [None]:
#List supported flowcells and kits:
  guppy_basecaller --print_workflows

3 types of config files:
- sup : high accuracy, very slow
- hac : medium accuracy, medium recources needes
- fast : poor accuracy, very fast

In [None]:
Guppy can also demultiplex and trim adaptaters or barcodes.

#### Basecalling command line <span style="color:red"> (Don't run it! ) </span>

In [None]:
guppy_basecaller -c dna_r9.4.1_450bps_sup.cfg -i fast5/ -r -s output --num_callers 4 --gpu_runners_per_device 8 --min_qscore 6 --device cuda:2 --trim_adapters --detect_adapter --detect_mid_strand_adapter --pt_scaling --do_read_splitting

## <span style="color: #4CACBC;"> 3. Quality Control on Long Reads</span>  <a class="anchor" id="qc"></span> 

Calculating data quantity

In [None]:
cd ~/work/SG-ONT-2022/DATA
pwd

Calculating how many reads are in the fastq file

In [None]:
awk '{s++}END{print s/4}' data.fastq

Calculating how many bases were sequenced

In [None]:
seqtk seq -A data.fastq | grep -v ">" | wc -m

What is the sequencing depth?


### <span style="color: #4CACBD;"> 3.1 Quality Control of FASTQ with Nanoplot </span> <a class="anchor" id="nanoplot"></span> 

Control reads quality using Nanoplot. You can parameter this tool using --help.

In [None]:
NanoPlot --help

Launch NanoPlot. You can launch NanoPlot using summaries or fastq files.

In [None]:
# create a folder to save results
mkdir -p ~/work/SG-ONT-2022/QC
cd ~/work/SG-ONT-2022/QC

In [None]:
###### run nanoplot 
NanoPlot -t 1 --fastq  ../DATA/data.fastq --outdir NANOPLOT

Check stats on created NanoStats file.

In [None]:
cat NANOPLOT/NanoStats.txt

* What do you think about data? 

* Estimate coverage.

* What about reads quality qscore?

Observe NanoPlot-report.html.

To open it on jupyter you need to click on "trust HTML".

* what about this dataset?

### <span style="color: #4CACBD;">3.2 Compare reads QC statistics with NanoComp (Summaries dataset) </span> <a class="anchor" id="nanoplot"></span> 

Compare long reads sequencing datasets using **NanoComp**.

NanoComp compiles quality information in a useful html report.

You can launch NanoComp using summaries or fastq files.

In [None]:
NanoComp --help

In [None]:
NanoComp --fastq ../DATA/data.fastq --outdir NANOCOMP

###  Others complemental tools: 

https://github.com/wdecoster/NanoPlot#companion-scripts

* NanoComp: comparing multiple runs

* NanoStat: statistic summary report of reads or alignments

* NanoFilt: filtering and trimming of reads

* NanoLyse: removing contaminant reads (e.g. lambda control DNA) from fastq

* FiltLong : filtering long reads by quality https://github.com/rrwick/Filtlong


## <span style="color: #4CACBD;">4. BONUS: relaunch QC tools on the original dataset </span> <a class="anchor" id="bonus"></span> 

As descripted in 1.1 The dataset used previously is a sample of the original dataset.

In [None]:
# download sequencing summary of the original data

mkdir -p ~/work/SG-ONT-2022/DATA/real_summaries/
cd ~/work/SG-ONT-2022/DATA/real_summaries/
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training-2022/sequencing_summary.txt

Relaunch NanoComp and NanoPlot on this sequencing summary

In [None]:
NanoComp --summary sequencing_summary.txt --outdir ~/work/SG-ONT-2022/QC/NANOCOMP_real
NanoPlot --summary sequencing_summary.txt --outdir ~/work/SG-ONT-2022/QC/NANOPLOT_real

Compare the outputs with the sampled ones.