
# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  

* Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD), A. Comte (PHIM-IRD) and E. Tibiri (WAVE-INERA) September 2023
* Adapted by E.Tibiri (WAVE-INERA) - May 2024

# <span style="color:#006E7F">__Practical1 - BASECALLING and QC__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Creating the working folder, downloading data and so on</span>  


### <span style="color: #4CACBD;"> 1.1 Real data </span>


These data were generated by WAVE and are not yet published




In [None]:
mkdir -p  ~/work/DATA
cd ~/work/DATA

In [None]:

# Copy a directory or file from the host system to a Docker container
docker cp directory/ container_name:/path/destination/

## <span style="color: #4CACBC;"> 2. Basecalling </span>  

When DNA molecules are sequenced, electrical signals are stocked in fast5 format files.

These signals need to be converted into standard fastq files for further analysis.

Several training dataset models are usually used to convert fast5 to fastq. 

### <span style="color: #4CACBD;"> 2.1 Basecalling with Guppy  <span style="color:red">  </span> </span>

Guppy is a data processing toolkit that contains the 'Oxford Nanopore Technologies' basecalling algorithms, and several bioinformatic post-processing features.

Basecalling with Guppy can be launched using Guppy tool. 

Guppy takes fast5 raw read files and converts electrical signals into fastq files.

If possible, we recommend to basecall your dataset using a GPU graphic card to obtain results quickly.

Documentation about how to run Guppy on I-trop GPU can be found on https://bioinfo.ird.fr/index.php/tutorials-fr/gpu/


In [None]:
MODEL="dna_r9.4.1_450bps_hac.cfg"
INPUT="path/to/FAST5"
OUTPUT="/path/to/FASTQ"
guppy_basecaller ...

## <span style="color: #4CACBC;"> 3. Quality Control on Long Reads </span>  


Calculating data quantity

In [None]:
cd ~/work/DATA
pwd

Calculating how many bases were sequenced

In [None]:
seqtk seq -A /home/jovyan/work/DATA/barcode93.fastq.gz | grep -v ">" |wc -m

In [None]:
seqtk seq -A /home/jovyan/work/DATA/barcode96.fastq.gz | grep -v ">" |wc -m

In [None]:
seqtk seq -A /home/jovyan/work/REF/JN165089.fasta | grep -v ">" |wc -m

### What is the sequencing depth if barcode93 length is about 100 Mb (100,376,085 bp) ?

### <span style="color: #0D6657;"> 3.1 Quality Control of FASTQ with Nanoplot </span>


Control read quality using Nanoplot. You can parameter this tool using --help.

In [None]:
NanoPlot --help

Check quality using NanoPlot over barcode93 and barcode96 real data fastq

In [None]:
# create a folder to save results
mkdir -p ~/work/RESULTS/QC
cd ~/work/RESULTS/QC

In [None]:
NanoPlot ...

Observe NanoStats.txt file

In [None]:
cat NANOPLOT_93/NanoStats.txt

## Observe `report.html` file

* What do you think about data? 

* What about read quality qscore?

### <span style="color: #4CACBD;"> Others complemental tools </span>
 

https://github.com/wdecoster/NanoPlot#companion-scripts

* NanoComp: comparing multiple runs

* NanoStat: statistic summary report of reads or alignments

* NanoFilt: filtering and trimming of reads

* NanoLyse: removing contaminant reads (e.g. lambda control DNA) from fastq

* FiltLong : filtering long reads by quality https://github.com/rrwick/Filtlong
