## EANBiT Residential Training 2021


### Link to the data
[This](https://www.dropbox.com/s/nkyunpxtwa4s0td/recov_data.zip?dl=0) is the link to the fast5 files we will use for this training.


[This](https://drive.google.com/file/d/1Jyf_yMIhor7NnqhbIfBJSBlRTuKvJvhx/view?usp=sharing) is the link to the basecalled fastq files.


In [1]:
#Install bash kernel
import sys
!{sys.executable} -m pip install bash-kernel



In [2]:
#jupyter console --kernel bash
!jupyter kernelspec list

Available kernels:
  python3    C:\Users\SecondFiddle\Anaconda3\share\jupyter\kernels\python3


### Visualizing fast5 files using Bulkvis
BulkVis is a program written in Python3 using Bokeh to visualise raw squiggle data from Oxford Nanopore Technologies (ONT) bulkfiles. [Bulkvis](https://github.com/LooseLab/bulkvis) scans a folder containing bulk FAST5 files at startup. An individual file is selected and specific channels plotted. Basic metadata are displayed to the user. To navigate coordinates are input in the format `channel: start_time-end_time`. Alternatively pasting a FASTQ read header from a base called read will jump to its channel, time and raw signal from the bulk FAST5 file (Payne et al., 2019). Further information on how to strt bulkvis can be found [here](https://bulkvis.readthedocs.io/en/latest/quickstart.html).

Before you install bulkvis, make sure you have pip installed.
To check whether you have pip installed, run this command:

`pip --version`

If pip is not installed, no worries! Let's fix that.
Let's get on with the installation.
To install pip:

In [None]:
## On Debian/Ubuntu
apt install python3-pip
## On CentOS
yum install epel-release 
yum install python-pip
## Using conda 
conda install -c anaconda pip

Once you have `pip` installed, you will be required to create and activate a virtual environment to work on.
To do this, follow [these](https://bulkvis.readthedocs.io/en/latest/installation.html) simple instructions. Please edit the paths to match yours where need be, and make sure you are using the **bulk fast5 file** not a single fast5 file. This bulk fast 5 file is obtained by setting up the sequencing run as shown [here](https://github.com/LooseLab/bulkvis/issues/28) on MinKNOW.

### Basecalling
fast5 files generated from the sequencing step are basecalled
##### Basecalling script
This script was developed by [Sirselim](https://gist.github.com/sirselim) for faster basecalling with the free GPU resources on [Google collaboratory](https://research.google.com/colaboratory/). If you would like to try out/use this resource, please use your google account to open google collaboratory, and  create a notebook of your own to implement the script below.

[This](https://gist.github.com/sirselim/13f70ae69f2a512e7d9e1f00f9704f53) is the link to his basecalling script on his github page. However, there is an alternative to use CPU resources, whose basecalling capabilities are way slower.

Alternatively, we can perform CPU basecalling as shown below. It is recommended you run this step on the HPC where more threads can be allocated per caller. 

Remotely sign into your HPC using your ssh address: 

In [None]:
ssh <username>@host.address

Guppy can be installed using the code shown below:

In [None]:
%%shell
#Install Guppy
GuppyBinary=("https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_5.0.11_linux64.tar.gz")
wget $GuppyBinary

Unpack the guppy binary files using the following command:

In [None]:
%%shell
tar -xvzf ont-guppy_5.0.11_linux64.tar.gz

To perform the basecalling step you need to know the [flowcell and ONT kit](https://denbi-nanopore-training-course.readthedocs.io/en/latest/basecalling/basecalling.html) used to generate your fast5 files. In this case the kit used was SQK-PCB109, and FLO-MIN106 flowcell. The config file used for this combination is:
**dna_r9.4.1_450bps_hac**

In [None]:
guppy_basecaller --compress_fastq -i ~/workdir/data/fast5_tiny/ \
-s ~/workdir/basecall_tiny/ --cpu_threads_per_caller 14 --num_callers 1 -c dna_r9.4.1_450bps_hac.cfg

If you encounter a `libzmq.so.5` error go [here](https://github.com/bulwark-crypto/Bulwark/issues/28) and use the code chunk to recompile zeromq library 

We will be using the script [ONT_analysis.sh](https://github.com/eKariuki-sleepy/ONT_Analysis/blob/main/ONT_analysis.sh) for basecalling. Please edit the variables accordingly to suit your directory structure. The exercise will take approximately 25 minutes running on 16 threads and one-caller. [This](https://denbi-nanopore-training-course.readthedocs.io/en/latest/basecalling/basecalling.html) resource provides more information about basecalling with guppy.

### Filter Low-quality reads with nanofilt

This tool is written for Python 3. It performs filtering on quality and/or read length, and optional trimming after passing filters.
Intended to be used:

 - directly after fastq extraction
 - prior to mapping
 - in a stream between extraction and mapping

In [None]:
#Install nanofilt
#conda
conda install -c bioconda nanofilt
#pip
pip install nanofilt

In [None]:
#This below runs Nanofilt, trimming quality at <int>7> (based on pycoQc analysis).
#Remember a quality of 10 indicates 90% accuracy.
gunzip -c reads.fastq.gz | NanoFilt -q 10 | gzip > highQuality-reads.fastq.gz

### Adapter and Quality Trimming with Porechop

Begin by cloning the repo (below). Running the setup.py script will compile the C++ components of Porechop and install a porechop executable:

In [None]:
#Clone the repo
git clone https://github.com/rrwick/Porechop.git
# move to the porechop working directory
cd Porechop
#Run the script setup.py to compile and install porechop
python3 setup.py install
#Check whether porechop has been successfully installed, as well as how its used
porechop -h

Basic adapter trimming can be done as follows:

In [None]:
porechop -i $fastq_dir -o output_reads.fastq.gz

In [None]:
python utils/set_confi.py -b recov_data.zip -i /home/ekariuki/eanbitRT21 -e /home/ekariuki/eanbitRT21/fastqs -m /home/ekariuki/eanbitRT21/fastqs -c config.ini

### Getting the Reference Genome for mapping.
Our data is sourced from the Black Soldier Fly (BSF). The BSF reference genome can be found [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_905115235.1/)

### References
1. Payne, A., Holmes, N., Rakyan, V., & Loose, M. (2019). BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics, 35(13), 2193–2198. https://doi.org/10.1093/BIOINFORMATICS/BTY841

