<hr>

# <font color='Purple'>GENTETIC ANALYSIS: ONT SEQUENCING DATA ANALYSIS</font>

<hr>
    
Dr Graham S Sellers *g.sellers@hull.ac.uk*

![image](./images/unix.png)

# <font color='Purple'>Overview</font>

This practical will introduce you to the bioinformatic analysis pipeline used to generate results from ONT MinION sequencing data.  
At the end of this practical session you will have bioinformatically analysed a small test set of sequencing data.  
Importantly, you will have experienced the steps used in this kind of analysis.  

# <font color='Purple'>1. Introduction to the bioinformatic analysis workflow</font>

<hr>

This section of the practical assumes the completion of the previous section: **Introduction to the Unix command line**.  
You will use your newly aquired skills to perform the exploration of the sequencing data and execute the analysis in this section.  

## <font color='Purple'>1.1. What has happened so far</font>
Last week we created a final baterial 16S rRNA library for sequencing on Oxford Nanopore Technologies' MinION.  
The final stages of library preparation have been performed. The prepared DNA sequencing libaray has been loaded onto the MinION and sequenced. The sequencing output has been "basecalled" and we have data in a meaningful file format ready for us to analyse bioinformatically.

## <font color='Purple'>1.2. What happens now?</font>
**We can at last analyse the data!**  

This is done in 3 stages:
1. Quality control
2. Taxonomic assignment
3. Format outputs

## <font color='Purple'>1.3. Workflow programs</font>

As a breif outline, we will use 3 specific command line based programs for data anlysis.  
These are all pre-installed and ready to use, so do not worry.  

`seqkit` is used to look at the stats of the data (read length, read count etc.).  
`fastp` performs quality control.  
`kraken2` performs taxonomic assignment.  
We will use a custom `python` script to format the outputs.


# <font color='Purple'>2. Organising the working directory</font>

<hr>

Before we get started we have to make this look and perform more like a bioinformatic analysis working directory.  
We need to create some directories for each stage of the anlysis to output results to.  
To do this we will use a new command line utility `mkdir`.  

`mkdir` creates a directory where you specify, or in your working directory (`pwd`).

## <font color='Purple'>2.1. Create a *results* directory</font>


We need to create a **results** directory where all the outputs from the analysis go.  
Use the following command:

```
$ mkdir results
```
**Do it now:**

## <font color='Purple'>2.2. *fastp* and *Kraken 2* directories</font>

In the **results** directory we need to create **qc** (quality control) and **kraken2** (taxonominc assignment) directories.  

These are for `fastp` and `kraken2` to output to respectively.  
Use the following command to make the **qc** directory:

```
$ mkdir results/qc
```

**Note:** you did not have to change directory here to do it. The path to the directory is sufficient. Another unix command line skill learned :)

**Run the required commands to create both the qc and kraken2 directories:**

You should now have two directories: **qc** and **kraken2**, in your **results** directory. Use your command line skills to check:  
*hint* `ls`

### <font color='Purple'>Important: Before continuing, check with a demonstrator that you have done this all correctly!<font>

# <font color='Purple'>3. The bioinformatic analysis workflow</font>

<hr>

## <font color='Purple'>3.1. The fastq format</font>

The data we have is in "fastq" format. This is kind of similar to "fasta" format, just another way to store genetic data.  
However, fastq files hold more information - specifically for DNA sequencing quality.

A fastq file has 4 lines per sequence:  
1. Sequence name 
2. DNA sequence 
3. Spacer ("+")  
4. Quality score 

For example:
```
@sequence_name
GCGAACTTTGCTAGCGGCAAGGCGCTTACAGCAAGTCGAGCGAGGAC
+
%$%%$$$&''&&'((),,*%$%$#$%'''&&&'(((((-/////)((
```

For quality control the quality of each base is an important factor.  
Base quality is given as a "Q-Score": a character-based code for the quality of the base.  
Go to this link to see the actual values (look at the "Symbol" and "Q-Score" columns):

https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm  

Now you understand the format of the data you need to analyse, use your command line skills to look at the first 4 lines of the "test_sample.fastq" in the **data** directory:

Does this look like the format you expected?  

Discuss with the person next to you and check with a demonstrator.

## <font color='Purple'>3.2. Quality control of fastq data</font>
We have the raw reads from the MinION sequencer but we need to know which reads to keep.  
Which are of good enough quality to consider for further analysis?  

We will use `fastp` to do this. We will also utilise `seqkit` to look at before and after quality control.  

### <font color='Purple'>3.2.1. *SeqKit* to inspect raw fastq data</font>
  
Let's look at the vital statistics of the "test_sample.fastq" file using `seqkit` prior to quality control:

In [None]:
seqkit stats data/test_sample.fastq

Can you describe exactly what the output is showing? Check with a demonstrator if you are unsure.

### <font color='Purple'>3.2.2. *fastp* qc of raw sequencing data</font>

OK, lets quality control the fastq data using `fastp`:

In [None]:
fastp \
    --disable_adapter_trimming \
    --in1 data/test_sample.fastq \
    --out1 results/qc/test_sample.qc.fastq \
    -j results/qc/test_sample.json \
    -h results/qc/test_sample.html \
    --qualified_quality_phred 7 \
    --unqualified_percent_limit 40 \
    --average_qual 7 \
    --length_required 1000 \
    --max_len1 1700

There is a lot going on here. **Don't panic!**  

`fastp` has a lot of functions and subsequently a lot of "flags".  
It is a very powerful tool and well worth knowing how to use for quality control of sequencoing data!  

Refer to the `fastp` docs to see what was just done:  

https://github.com/OpenGene/fastp#readme  

See if you can understand what the command above did. Check with a demonstrator if unsure.  



### <font color='Purple'>3.2.3. *SeqKit* to inspect quality controlled fastq data</font>
  
Let's look at the vital statistics of the quality controlled fastq data we have just generated.  
It is now in the **results/qc** directory as "test_sample.qc.fastq".  

Shall we see what quality control did? `seqkit` to the rescue:

In [None]:
seqkit stats results/qc/test_sample.qc.fastq

Compare the post quality control `seqkit` output to that of the `seqkit` output for the raw sequencing data.  

Does this make sense as to what `fastp` has done?  

Discuss with the person next to you or check with a demonstrator if unsure.

## <font color='Purple'>3.3. Taxonomic assignment with *Kraken 2*</font>
The raw reads have been quality controlled with `fastp`. They are now considered of sufficient quality for taxonomic assignment.  

`kraken2` is an ideal program to use for analysisng Oxford Nanopore Technologies' MinION sequencing data. It is a "kmer" based taxonomic assignment method. It is again an incredibly powerful tool worth knowing how to use correctly.  

In your spare time, and if you are interested, read the supplied literature in the Genetic Analysis module's "to read" section for an insight as to what `kraken2` does. Google for further examples of kmer based methods to get a better overview of the approach.

## <font color='Purple'>3.3.1. *Kraken 2* database</font>


`kraken2` requires a database with which to perform it's analysis. This is supplied.  
The database *is* the directory **bact_db**: a `kraken2` database specifically created for this example anlysis.  

## <font color='Purple'>3.3.2. *Kraken 2* taxonomic assignment</font>

Now we have a database and some quality controlled fastq data.  

Unleash the Kraken 2:

In [None]:
kraken2 \
    results/qc/test_sample.qc.fastq \
    --db bact_db \
    --report results/kraken2/test_sample.txt \
    --output results/kraken2/test_sample.krk \
    --minimum-base-quality 7 \
    --confidence 0.02

Again, here is a lot going on here. **Don't panic!**  

The important bit is that it has done it's job. the terminal output should have shown something happen and a lot of text will have been produced. Have a look at this and see if you can make sense of it.  

Check with a demonstrator if unsure.

The **results/kraken2/reports** and **results/kraken2/outputs** should now have a file in each.

**Explore them using you command line skills:**

## <font color='Purple'>3.4. Formatting the outputs</font>

`kraken2` has given us outputs but they are not easilly readable, nor are they ideal for downstream analysis with *R*, for example.  

The next step is to modify these outputs to be in a format for our use.  

To do this we will use a custom `python` script to transform the output.  
`python` is it's own coding language, but we have a script (regardless of coding language) that forms an essential urpose. We shall now use it.  

In the cell below we call the interpretter `python`, the path to the script and the required variables: `-i` and `-o`.

Run the command below to execute the `python` script:

In [None]:
python scripts/to_tsv.py -i results/kraken2/test_sample.txt -o results/test_sample.tsv

Can you determin what it has done from the command?  

For those who are really interested, look at the script itself, it is located in the **scripts** directory.  

## <font color='Purple'>3.5. The final output</font>

We have now generated an output that is suitable for our requirements.  

Here we will use the `cat` command to view the whole file **results/test_sample.tsv**:

In [None]:
cat results/test_sample.tsv

Is this a more approachable format for downstream analysis?  

Discuss with the person next to you or with a demonstrator.

## <font color='Purple'>-- PAUSE --<font>

You have just experinced the workflow for taxonomic assignment of sequencing data.  
You have used `fastp` to quality control data and `kraken2` to taxonomically assign it.  
Using the command line you have explored the data at different stages of analysis.

### <font color='Purple'>**WELL DONE!**<font> 

# <font color='Purple'>4. Future reference<font>


**Important:** Before you leave this jupyter notebook, use the internet browser menu to `print as PDF`. Ask a demonstrator if you need help.  

This can then be kept as a reference for you in the future. Something to look back on.  
    
# <font color='Purple'>End of session<font>


**You are now ready for the next session:**
### <font color='Purple'>*R* analysis of sequencing data<font>