# <span style="color:green">Formation au Burkina Faso 2022</span> - Initiation à l’analyse de données Minion pour l'analyse de métagénome viraux

Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD) and A. Comte (PHIM-IRD) 

Septembre 2022

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>
   

[TP1 - BASECALLING and QC](#data) 

[2. Basecalling](#guppy)

   * [2.1 Basecalling with `guppy`](#guppy)
    

[3. Quality Control on Long Reads](#qc)
   * [3.1 Quality Control of FASTQ with`Nanoplot`](#nanoplot)

</span>

***

### Our objectives in the following TP are : 
- explore the diversity of the metavirome of pineapple.
- reconstruct the complete genome sequencing of a novel member of the genus Vitivirus in the family Betaflexiviridae (subfamily Trivirinae) infecting pineapple.

# <span style="color:#006E7F">__TP1 - BASECALLING and QC__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Creating the folder, downloading data and so on</span>  

### <span style="color: #4CACBD;">  Data</span>
    

Before starting, please download special data created for this practical training. Data are available on the from I-Trop server.

This data is the total RNAs extracted from pineapple leaf samples collected in Reunion Island. Nanopore sequencing was performed using a MinION portable device and the cDNA-PCR Barcoding kit.

This original dataset contained more than 4M reads. We choose to sample it for this formation. 

`fast5` directory contains some electrical signals for basecalling step

`data.fastq` is a fastq file already basecalled 

In [8]:
mkdir -p ~/work/SG-ONT-2022/DATA
cd ~/work/SG-ONT-2022/DATA

# download fast5 sample data using for basecalling step
wget --no-check-certificat -rm -nH --cut-dirs=1 -r --no-parent --reject="index.html*" https://itrop.ird.fr/ont-training-2022/fast5/

--2022-09-05 15:39:22--  https://itrop.ird.fr/ont-training-2022/fast5/
Résolution de itrop.ird.fr (itrop.ird.fr)… 91.203.35.184
Connexion à itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 2586 (2,5K) [text/html]
Enregistre : «fast5/index.html.tmp»


En-tête de dernière modification manquant — horodatage arrêté.
2022-09-05 15:39:23 (38,3 MB/s) - «fast5/index.html.tmp» enregistré [2586/2586]

Chargement de robots.txt ; veuillez ignorer les erreurs.
--2022-09-05 15:39:23--  https://itrop.ird.fr/robots.txt
Réutilisation de la connexion existante à itrop.ird.fr:443.
requête HTTP transmise, en attente de la réponse… 404 Not Found
2022-09-05 15:39:23 erreur 404 : Not Found.

Suppression de fast5/index.html.tmp puisqu’il devrait être rejeté.

--2022-09-05 15:39:23--  https://itrop.ird.fr/ont-training-2022/fast5/?C=N;O=D
Réutilisation de la connexion existante à itrop.ird.fr:443.
requête HTTP transmise, en attente de la

In [9]:
ls ~/work/SG-ONT-2022/DATA

[0m[01;35mback.gif[0m  [01;35mblank.gif[0m  [01;34mfast5[0m  [01;35munknown.gif[0m


In [None]:
mkdir -p ~/work/SG-ONT-2022/DATA
cd ~/work/SG-ONT-2022/DATA

# download fastq sample data already basecalled
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training-2022/data.fastq

--2022-09-05 15:44:03--  https://itrop.ird.fr/ont-training-2022/data.fastq
Résolution de itrop.ird.fr (itrop.ird.fr)… 91.203.35.184
Connexion à itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 1226812840 (1,1G)
Enregistre : «data.fastq»



## <span style="color: #4CACBC;"> 2. Basecalling <a class="anchor" id="guppy"></span> 

When DNA molecules are sequenced, electrical signals are stocked on fast5 format files.

These signals need to be converted on standard fastq files to post-analysis.

Several training dataset models are usually used to convert fast5 to fastq. 

### <span style="color: #4CACBC;"> 2.1 Basecalling with Guppy</span>


Guppy is a data processing toolkit that contains the 'Oxford Nanopore Technologies' basecalling algorithms, and several bioinformatic post-processing features.

Basecalling with guppy can be launch using guppy tool. 

Guppy takes fast5 raw read files and transform electrical signal in fastq files.

## Basecalling some available fast5 files 

In [None]:
#To see all the documentation of guppy:
  guppy_basecaller --help

To run guppy you need to choose your configuration file accordingly to the flowcell and the kit you used for sequencing.

In [None]:
#List supported flowcells and kits:
  guppy_basecaller --print_workflows

3 types of config files:
- sup : high accuracy, very slow
- hac : medium accuracy, medium recources needes
- fast : poor accuracy, very fast

We recommend to basecaller yours dataset using a GPU graphic card to obtain results quickly.

Guppy can also demultiplex and trim adaptaters or barcodes.

#### Basecall fast5 files with `guppy`

In [None]:
# a vous de jouer !
guppy_basecaller -c dna_r9.4.1_XXXbps_XXX.cfg -i fast5/ 

## <span style="color: #4CACBC;"> 3. Quality Control on Long Reads <a class="anchor" id="qc"></span> 


Calculating data quantity

In [None]:
cd ~/work/SG-ONT-2022/DATA
pwd

Remember how fastq file is formatted. Each read is writen in fastq file using four lines :

* First line is the header. It has information about sequencer

* Second line is the sequence

* Third line start with + caracter and

* Four line contains quality score of each base

In [None]:
head -n 4 data.fastq

Calculating how many reads are in the fastq file

In [None]:
awk '{s++}END{print s/4}' data.fastq

Calculating how many bases were sequenced using `seqtk`

In [None]:
seqtk seq -A data.fastq | grep -v ">" | wc -m

What is the sequencing depth?


### <span style="color: #4CACBD;"> 3.1 Quality Control of FASTQ with Nanoplot  <a class="anchor" id="nanoplot"></span> 

Control reads quality using Nanoplot. You can parameter this tool using --help.

In [None]:
NanoPlot --help

Launch NanoPlot. You can launch NanoPlot using summaries or fastq files.

In [None]:
# create a folder to save results
mkdir -p ~/work/SG-ONT-2022/QC
cd ~/work/SG-ONT-2022/QC

In [None]:
###### run nanoplot with the available fastq file data.fastq
NanoPlot 

Check stats on created NanoStats file.

In [None]:
cat NANOPLOT/NanoStats.txt

* What do you think about data? 

* What about reads quality qscore?

Observe NanoPlot-report.html.

To open it on jupyter you need to click on "trust HTML".

* what about this dataset?

###  Others complemental tools: 

https://github.com/wdecoster/NanoPlot#companion-scripts

* NanoComp: comparing multiple runs

* NanoStat: statistic summary report of reads or alignments

* NanoFilt: filtering and trimming of reads

* NanoLyse: removing contaminant reads (e.g. lambda control DNA) from fastq

* FiltLong : filtering long reads by quality https://github.com/rrwick/Filtlong
