# Basecalling with Dorado

As of December 2022, [Dorado](https://github.com/nanoporetech/dorado) is software we can use to convert the Nanopore [fast5](https://medium.com/@shiansu/a-look-at-the-nanopore-fast5-format-f711999e2ff6) files into the [fastq](https://en.wikipedia.org/wiki/FASTQ_format) format. Nanopore will be making changes to its file formats in the near future, but for now, this is how we will make the conversion. 

Recall, in nanopore sequencing, the movement of DNA through a nanopore results in an electrical signal which encodes the DNA sequence. This basecalling step is the translation of the the electical signal into DNA bases. This process is imperfect, and as we will learn, each base has a probability of being incorrectly determined. 

<figure>
    <img src="/media/cc0-images/elephant-660-480.jpg"
         alt="DNA passing through a nanopore and generating an electrical signal">
    <figcaption>DNA passing through a nanopore and generating an electrical signal</figcaption>
</figure>


**Important** 

This software will work with a machine that has [GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit) processers. These processors can do the math needed very quickly. You will be given a machine that is equipped with GPUs so don't try to run this notebook on another type of computer. 

## Preparing the virtual machine

There are some commands we can use to check the machine we are using. We can see what GPUs we have access to (assuming this is a machine running NVIDIA GPUs). 

First, we can load updated [drivers](https://en.wikipedia.org/wiki/Device_driver)

In [None]:
module load nvhpc/22.3/nvhpc

In [None]:
module load nvhpc/22.3/nvhpc-nompi

This next commnd should confirm we have a GPU to run the software

In [None]:
 nvidia-smi -L

We can see the software version for this GPU

In [None]:
nvcc --version

The version of Dorado we use should o be compatible with
`Cuda compilation tools, release 11.6, V11.6.112`
So that is the output you should get from the command above. 

## Connecting to the data share and configuring the machine

If you have not used this virtual machine before you will need to connect to your data share. 

Recall, we recommended creating a creating a `project` directory for all of our work, at `/home/exouser/project` and link it to the shared data storage which should be located at `/mnt/ceph`. Be sure to change the `YOURUSERNAME` to the name that matches your directory name we saw earlier, i.e. your initial and last name.

In [None]:
ln -s /mnt/ceph/YOURUSERNAME /home/exouser/project

We will also need to be able to read from the small tutorial data which is at `/mnt/ceph/chamecrista_fast5/0831_np_ac_small`. Let's create a link so we can easily read in this data. 

In [None]:
sudo ln -s /mnt/ceph/chamecrista_fast5/0831_np_ac_small /home/exouser/project/fast5_small

Let's check on what we have

In [None]:
sudo ls -R /home/exouser/project/fast5_small

You should get an output showing 11 fast5 files that we can work with:

```
/home/exouser/project/fast5_small:
FAU30260_pass_3910fb5d_0.fast5	 FAU30260_pass_3910fb5d_5.fast5
FAU30260_pass_3910fb5d_1.fast5	 FAU30260_pass_3910fb5d_6.fast5
FAU30260_pass_3910fb5d_10.fast5  FAU30260_pass_3910fb5d_7.fast5
FAU30260_pass_3910fb5d_2.fast5	 FAU30260_pass_3910fb5d_8.fast5
FAU30260_pass_3910fb5d_3.fast5	 FAU30260_pass_3910fb5d_9.fast5
FAU30260_pass_3910fb5d_4.fast5
```

Let's create a folder to hold the rest of our tutorial outputs inclding the fastq files we will hopefully produce.  

In [None]:
sudo mkdir -p /home/exouser/project/tutorial/fastq_output

Let's check what we have made - the `-p` option should have made both the `tutorial` and the `fastq_output` directory. 

In [None]:
sudo ls -R /home/exouser/project/

Finally, let's make permissions on these folder permissve so we don't need to constantly use sudo.

In [None]:
sudo chmod -R 777 project
sudo chmod -R 777 project/tutorial
sudo chmod -R 777 project/tutorial/fastq_output

## Download the Dorado software

We will have to install dorado following the instructions on their [GitHub](https://github.com/nanoporetech/dorado) site. We will use `wget` to dowload the software. But let's make a place for it. 

*Note*: The file extension `.tar.gz` is often seen when we have data that is compressed or when there is a collection of software. Data we usually won't need to decompress to use, but software (i.e. a program) often will need to be decompressed. 

In [None]:
mkdir -p project/software

In [None]:
wget https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.1.0-linux-x64.tar.gz -O project/software/dorado-0.1.0-linux-x64.tar.gz

Let's move to this new directory so we can decompress the downloaded [tar](https://en.wikipedia.org/wiki/Tar_(computing)) file. 

In [None]:
cd ~/project/software && ls

We need to decompress the file before using the software inside. 

In [None]:
tar -xvf dorado-0.1.0-linux-x64.tar.gz 

The actual software is now in `/home/exouser/project/software/dorado-0.1.0+4b0e9a6-Linux/bin`

Let's make the name of the directory a bit easier to work with

In [None]:
mv /home/exouser/project/software/dorado-0.1.0+4b0e9a6-Linux /home/exouser/project/software/dorado-0.1.0

We need to give permissions so we can access it

In [None]:
sudo chmod -R 777 /home/exouser/project/software/dorado-0.1.0/bin

we can see the file with `ls`

In [None]:
ls /home/exouser/project/software/dorado-0.1.0/bin

We can add this to the computer's [PATH](http://www.linfo.org/path_env_var.html)

In [None]:
PATH=$PATH:/home/exouser/project/software/dorado-0.1.0/bin

This command fixes a known issue with the software. Won't be explained, but just trust that it works for now. 

In [None]:
DIR=/home/exouser/project/software/dorado-0.1.0/lib/libcublasLt-17d45838.so.11
LD_LIBRARY_PATH=${LD_LIBRARY_PATH/${LD_LIBRARY_PATH/#$DIR:*/$DIR:}/}${LD_LIBRARY_PATH/${LD_LIBRARY_PATH/*:$DIR*/:$DIR}/}

We can now run `dorado`

In [None]:
#Run dorado command
dorado

You should get an error message is useful because it tells us what commands we can run:

```
Usage: dorado [options] subcommand

Positional arguments:
basecaller
download
duplex

Optional arguments:
-h --help               shows help message and exits
-v --version            prints version information and exits
```

##  Use Dorado to generate basecalls 

Following instruction on the software [GitHub page](https://github.com/nanoporetech/dorado), we will download a model. [models](https://learn.microsoft.com/en-us/windows/ai/windows-ml/what-is-a-machine-learning-model) that tells it how to decode the information stored in the fast5 files.

When we preformed DNA sequencing, we knew some information about the flowcell and protocols we were using:

1. We sequenced DNA (not RNA) data
2. We used a flow cell that was version 10.4.1
3. The speed at which bases were read was 260 bases per second.

We could therefore choose (according to the [GitHub page](https://github.com/nanoporetech/dorado) instructions): 

- `dna_r10.4.1_e8.2_260bps_fast@v4.0.0`
- `dna_r10.4.1_e8.2_260bps_hac@v4.0.0`
- `dna_r10.4.1_e8.2_260bps_sup@v4.0.0`

We can also choose which version of the model - fast, high accuracy, or super high accuracy

For our test data we can choose:
`dna_r10.4.1_e8.2_260bps_sup@v4.0.0
`

We can make a directory for the model and then download the model

In [None]:
mkdir -p /home/exouser/project/software/dorado-0.1.0/models

In [None]:
dorado download --model dna_r10.4.1_e8.2_260bps_sup@v4.0.0\
 --directory /home/exouser/project/software/dorado-0.1.0/models

Recall, our reads in **fast5** format are located at:

In [None]:
ls /home/exouser/project/fast5_small

Using this information, we will use the download model and the basecaller command to take all of the reads in the **fast5** files and translate them into the **fastq** format. 

In [None]:
dorado basecaller --emit-fastq\
 /home/exouser/project/software/dorado-0.1.0/models/dna_r10.4.1_e8.2_260bps_sup@v4.0.0\
 /home/exouser/project/fast5_small\
 > /home/exouser/project/tutorial/fastq_output/called_reads.fastq 

We can now see the output fastq file here:

In [None]:
ls /home/exouser/project/tutorial/fastq_output/

Using the `head` command gives the first fastq record in the file

In [None]:
head -n 4 /home/exouser/project/tutorial/fastq_output/called_reads.fastq 