
# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022


# <span style="color:#006E7F">__TP1 - BASECALLING and QC__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Creating the folder, downloading data and so on</span>  

### <span style="color: #4CACBD;"> 1.1 Simulated clones </span>

Before starting, please download special data created for this practical training. Data are available on the  from I-Trop server.

Each participant will analyse a alguea

To generate Clone data, a **1Mb** contig was extracted from chromosome 1 of rice.

20 levels of variation were generated and long reads were simulated for each.

We have introduced different variations (SNP, indel, indel+translocations) and also some contaminations. 

In [None]:
CLONE=Clone10

In [None]:
cd ~/work
mkdir -p DATA
cd DATA
# download your compressed CloneX 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/${CLONE}.tar.gz

In [None]:
#decompress file
tar zxvf ${CLONE}.tar.gz

In [None]:
# check data 
cd ~/work/DATA
ls -l ${CLONE}

### <span style="color: #4CACBD;"> 1.2 Real data </span>

Some steps in this practical training can not work in Clones dataset. A second dataset will be download.

These data were generated in this [paper.](https://www.biorxiv.org/content/10.1101/2021.07.04.451066v1.full)

Please decompress Hh dataset.


In [None]:
cd ~/work/DATA
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/real_Hh.tar.gz

In [None]:
#decompress file
tar zxvf real_Hh.tar.gz

## <span style="color: #4CACBC;"> 2. Basecalling </span>  

When DNA molecules are sequenced, electrical signals are stocked on fast5 format files.

These signals need to be converted on standard fastq files to post-analysis.

Several training dataset models are usually used to convert fast5 to fastq. 

### <span style="color: #4CACBD;"> 2.1 Basecalling with Guppy  <span style="color:red"> (Don't run it! ) </span> </span>

Guppy is a data processing toolkit that contains the 'Oxford Nanopore Technologies' basecalling algorithms, and several bioinformatic post-processing features.

Basecalling with guppy can be launch using guppy tool. 

Guppy takes fast5 raw read files and transform electrical signal in fastq files.

We recommend to basecaller yours dataset using a GPU graphic card to obtain results quickly.

Documentation about how to run Guppy on I-trop GPU can be found on https://bioinfo.ird.fr/index.php/tutorials-fr/gpu/


In [None]:
MODEL="dna_r9.4.1_450bps_hac.cfg"
INPUT="path/to/FAST5"
OUTPUT="/path/to/FASTQ"
echo "guppy_basecaller -c ${MODEL} -i ${INPUT} --recursive -s ${OUTPUT} --num_callers 8 --gpu_runners_per_device 8 --device auto --min_qscore 7 --compress_fastq"

## <span style="color: #4CACBC;"> 3. Quality Control on Long Reads </span>  


Calculating data quantity

In [None]:
cd ~/work/DATA
pwd

Calculating how many bases were sequenced

In [None]:
seqtk seq -A ${CLONE}/ONT/${CLONE}.fastq.gz | grep -v ">" |wc -m

What is the sequencing depth if clone lenght is about 1Mb ?

### <span style="color: #4CACBD;"> 3.1 Quality Control of FASTQ with Nanoplot (one clone --summary) </span>

Control reads quality using Nanoplot. You can parameter this tool using --help.

In [None]:
NanoPlot --help

Launch NanoPlot by your Clone. You can launch NanoPlot using summaries or fastq files.

In [None]:
# create a folder to save results
mkdir -p ~/work/RESULTS
cd ~/work/RESULTS

In [None]:
# run nanoplot 
time NanoPlot -t 1 --summary ../DATA/${CLONE}/ONT/${CLONE}_DeepSimu_sequencing_summary.txt --outdir NANOPLOT_${CLONE}

* NOTE: Clones are simulated data. Check stats on created NanoStats file. <span style="color:red"> it's normal ! it doesn't work! ) </span>

Check stats on created NanoStats file.

In [None]:
cat NANOPLOT_${CLONE}/NanoStats.txt

* What do you think about data? 

* Estimate coverage.

* What about reads quality qscore?

### <span style="color: #4CACBD;"> 3.2 Quality Control of FASTQ with Nanoplot (Hh real data --fastq ) </span>


Check quality using NanoPlot over Hh real data fastq. This can take a while.

In [None]:
cd ~/work/RESULTS/
pwd

**The following Nanoplot will run for a few minutes**

In [None]:
time NanoPlot -t 8 --fastq ../DATA/real_Hh/H_M1C132_1.fastq.gz --outdir NANOPLOT_Hh

Observe report.html.

* what about this dataset?

### <span style="color: #4CACBD;"> 3.3 Compare reads QC statistics with NanoComp (Summaries dataset) </span>

Compare long reads sequencing datasets using **NanoComp**.

NanoComp compiles quality information in a useful html report.

You can launch NanoComp using summaries or fastq files.

In [None]:
NanoComp --help


For the moment, we used only a Clone and simulated data. 

Please download a "summaries" obtained in paspalum real data. 

Download available 'Real_PSummaries' to compare them. 


In [None]:
cd ~/work/DATA
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/real_summaries.tar.gz

In [None]:
#decompress 
tar -zxvf real_summaries.tar.gz

In [None]:
cd ~/work/DATA/real_summaries/
ls -lh  *txt

Compare this 3 summaries files with NanoComp

In [None]:
time NanoComp --summary OLD_summary.txt FF_summary.txt BON_summary.txt --outdir ~/work/RESULTS/NANOCOMP-realsummaries

### <span style="color: #4CACBD;"> Others complemental tools </span>
 

https://github.com/wdecoster/NanoPlot#companion-scripts

* NanoComp: comparing multiple runs

* NanoStat: statistic summary report of reads or alignments

* NanoFilt: filtering and trimming of reads

* NanoLyse: removing contaminant reads (e.g. lambda control DNA) from fastq

* FiltLong : filtering long reads by quality https://github.com/rrwick/Filtlong
