
# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022
    
Adapted by J. Orjuela (DIADE-IRD) - Mai 2023

# <span style="color:#006E7F">__TP1 - BASECALLING and QC__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Creating the folder, downloading data and so on</span>  


### <span style="color: #4CACBD;"> 1.1 Real data </span>

Before starting, please download special data created for this practical training. Data are available on the  from I-Trop server.

Each participant will analyse a alguea samples from Louis Dennui. 


In [None]:
cd ~/work
mkdir -p DATA
cd DATA
# download your compressed algae 4222, B8 and G11 genotypes
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/algae_data/4222_RB2.fastq.gz
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/algae_data/B8_RB11.fastq.gz
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/algae_data/G11_RB6_2022.fastq.gz

Downloading available reference genome fasta file GCF_002220235.1. We will use this reference genome to compare with our results.
Previously downloaded from https://www.ncbi.nlm.nih.gov/assembly/GCF_002220235.1

In [None]:
cd ~/work
mkdir -p DATA/REF
cd DATA/REF
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/algae_data/GCA_002220235.1_ASM222023v1_genomic.fna

## <span style="color: #4CACBC;"> 2. Basecalling </span>  

When DNA molecules are sequenced, electrical signals are stocked on fast5 format files.

These signals need to be converted on standard fastq files to post-analysis.

Several training dataset models are usually used to convert fast5 to fastq. 

### <span style="color: #4CACBD;"> 2.1 Basecalling with Guppy  <span style="color:red"> (Don't run it! ) </span> </span>

Guppy is a data processing toolkit that contains the 'Oxford Nanopore Technologies' basecalling algorithms, and several bioinformatic post-processing features.

Basecalling with guppy can be launch using guppy tool. 

Guppy takes fast5 raw read files and transform electrical signal in fastq files.

We recommend to basecaller yours dataset using a GPU graphic card to obtain results quickly.

for <R10 kit, documentation about how to run Guppy on I-trop GPU can be found on https://bioinfo.ird.fr/index.php/tutorials-fr/gpu/

In [None]:
MODEL="dna_r9.4.1_450bps_hac.cfg"
INPUT="path/to/FAST5"
OUTPUT="/path/to/FASTQ"
echo "guppy_basecaller -c ${MODEL} -i ${INPUT} --recursive -s ${OUTPUT} --num_callers 8 --gpu_runners_per_device 8 --device auto --min_qscore 7 --compress_fastq"

## <span style="color: #4CACBC;"> 3. Quality Control on Long Reads </span>  


Calculating data quantity

In [None]:
cd ~/work/DATA
pwd

Calculating how many bases were sequenced

In [None]:
seqtk seq -A /home/jovyan/work/DATA/4222_RB2.fastq.gz | grep -v ">" | wc -m

In [None]:
seqtk seq -A /home/jovyan/work/DATA/REF/GCA_002220235.1_ASM222023v1_genomic.fna | grep -v ">" |wc -m

### What is the sequencing depth of sample ?

### <span style="color: #4CACBD;"> 3.1 Quality Control of FASTQ with Nanoplot (Hh real data --fastq ) </span>


Control reads quality using Nanoplot. You can parameter this tool using --help.

In [None]:
NanoPlot --help

Check quality using NanoPlot over Hh real data fastq

In [None]:
# create a folder to save results
mkdir -p ~/work/RESULTS
cd ~/work/RESULTS

In [None]:
time NanoPlot -t 8 --fastq ~/work/DATA/4222_RB2.fastq.gz --outdir NANOPLOT_4222

Observe NanoStats.txt file

In [None]:
cat NANOPLOT_4222/NanoStats.txt

## Observe `report.html` file

* What do you think about data? 

* What about reads quality qscore?

## Run NanoComp using the three algae fastq files and compare it.

In this notebook, we use fastq directly, but use summary_files is recommanded.

In [None]:
NanoComp

In [None]:
mkdir -p ~/work/RESULTS/NANOCOMP
cd ~/work/RESULTS/NANOCOMP

In [None]:
time NanoComp --fastq /home/jovyan/work/DATA/4222_RB2.fastq.gz  /home/jovyan/work/DATA/B8_RB11.fastq.gz  /home/jovyan/work/DATA/G11_RB6_2022.fastq.gz 

### ... discuss about results

### <span style="color: #4CACBD;"> Others complemental tools </span>
 

https://github.com/wdecoster/NanoPlot#companion-scripts

* NanoComp: comparing multiple runs

* NanoStat: statistic summary report of reads or alignments

* NanoFilt: filtering and trimming of reads

* NanoLyse: removing contaminant reads (e.g. lambda control DNA) from fastq

* FiltLong : filtering long reads by quality https://github.com/rrwick/Filtlong
