# What is basecalling ?

Basecalling is the process of obtaining DNA sequences of reads from the squiggle obtain by Nanopore Sequencing (See Nanopore Sequencing Notebook).
It is based on neural network and large training dataset. For Oxford nanopore basecalling it is trained on large dataset of bacteria (C. elegans), and human genomic sequences.

In this study we used two basecallers to detect methylation and transposable elements :
- Guppy
- Dorado

Guppy is the old reference basecaller from Oxford Nanopore, while Dorado is the most recent one. Dorado uses pod5 files as input and promise to decrease computationnal time without decreasing basecalling accuracy.

Classical basecalling and basecalling of modified bases uses differents algorithm and option in Guppy and Dorado. Therefore to detect methylation we needed to perform back the basecalling on modified bases.

For both basecalling tools we will use the sup : super accuracy model, which is supposed to give the highest basecalling accuracy, but is in counterpart the most computationnaly intensive.

# Guppy basecalling

>***Our Guppy 6.5.7 model***


    📚 Basecalling : dna_r9.4.1_450bps_sup.cfg  

    📚 Modified Basecalling : dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_sup.cfg


  >***Running Guppy 6.5.7 classical basecalling***

In [None]:
for i in ./Gd*; do
    /home/data/ont-guppy_6.5.7/bin/guppy_basecaller -i "$i" -c dna_r9.4.1_450bps_sup.cfg -s "./$(basename "$i")_Guppy_basecalling" --bam_out --recursive --device cuda:all:100%
done

  >***Running Guppy 6.5.7 modified basecalling***

In [None]:
for i in ./Gd*; do
    /home/data/ont-guppy_6.5.7/bin/guppy_basecaller -i "$i" -c dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_sup.cfg -s "./$(basename "$i")_Guppy_basecalling" --bam_out --recursive --device cuda:all:100%
done

# Dorado basecalling

>***Conversion fast5 to pod5, Why do we need fast5 to pod5 convertion ?***

* pod5 is a new format by Oxford Nanopore team, that is suppose to be more compressed and therefore computationnaly efficient for Dorado basecalling.

>***Installation of pod5***

In [None]:
$ sudo apt install python3-pip      ## Installation pip
$ pip install pod5                  ## Installation pod5
$ sudo find / -type f -name pod5    ## Find location pod5

>***Convertion of a single fast5 containing folder***

In [None]:
$ pod5 convert fast5 *.fast5 --output . --one-to-one . ## Convert all the fast5 to pod5
                                                       ## while keeping the original fast5

>***Installation Dorado 0.5.3***

In [None]:
$ wget https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.5.3-linux-x64.tar.gz   ## Install Dorado v0.5.3
$ tar -xf dorado-0.5.3-linux-x64.tar.gz  ## untar

>***Our Dorado 0.5.3 model***

In [None]:
📚 Basecalling : dna_r9.4.1_e8_sup@v3.3

📚 Modified Basecalling : dna_r9.4.1_e8_sup@v3.3  (specify --modified-bases)

>***Downloading the model***

In [None]:
$ dorado download --model dna_r9.4.1_e8_sup@v3.3 ## for our specific model
$ dorado download --all ## for all model

>***Running Dorado 0.5.3 basecalling***

In [None]:
for i in ./Gd*; do
    /home/data/dorado-0.5.3-linux-x64/bin/dorado basecaller /home/data/dorado-0.5.3-linux-x64/bin/dna_r9.4.1_e8_sup@v3.3 "$i" -b 320 > "$(basename "$i")_Dorado_basecalling.bam"
done

Note: If you obtain this error while running Dorado :
CUDA out of memory. Tried to allocate ... GiB

Please reduce the --batchsize (-b) to a smaller value

>***Running Dorado 0.5.3 modified basecalling***

In [None]:
for i in ./Gd*; do
    /home/data/dorado-0.5.3-linux-x64/bin/dorado basecaller /home/data/dorado-0.5.3-linux-x64/bin/dna_r9.4.1_e8_sup@v3.3 "$i" --modified-bases-models /home/data/dorado-0.5.3-linux-x64/bin/dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0 --recursive -b 320 > "$(basename "$i")_Dorado_modbasecalling.bam"
done