# Module: Taxonomic Annotation of Metagenomic Reads

Taxonomic annotation of metagenomic data is crucial in characterizing the species present in a community. This can be done either at the read or contig (assembled data) level. Read-level taxonomic annotation provides a quick way to profile the taxonomic composition of the community, albeit, at the expense of better resolution due to shorter sequences. This could be helpful if you want to initially explore the who is in your samples and roughly estimate their relative proportions.

Below, we will see how to taxonomically annotate clean reads from shotgun metagenomic data using Kraken2 + Bracken, and generate an interactive visual using Krona.

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate read-tax-annot-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **Kraken2**
2. **Bracken**
3. **Krona**

To install these tools, find the `read-tax-annot.yaml` file located in the same folder as this notebook (in repository). Then run the command below in the terminal:

```bash
conda env create -f read-tax-annot.yaml
```

---
## Starting Files 

1. Clean paired-end reads (FASTQ format; See **Quality Control Module**).

---
## Expected Outputs

1. Kraken2 + Bracken output taxonomy files.
2. Krona interactive visual.

---
## Table of Contents
 * [**Taxonomic Annotation**](#Taxonomic-Annotation)
     * [Kraken2](#Kraken2)
     * [Bracken](#Bracken)
 * [**Visualize**](#Visualize)

----
# <font color = 'gray'>Taxonomic Annotation</font>

## Kraken2

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

Source: _[Kraken2 GitHub Wiki](https://github.com/DerrickWood/kraken2/wiki/Manual)_

`kraken2` already has several pre-built indexed databases. Many of these are extremely large and cannot be practically used in personal desktops. Below, we will demonstrate its usage in the command line interface using a much smaller reference database (MiniKraken v2). But note that it is highly recommended that you use the larger databases to improve sensitivity of taxonomic assignments (you can find the reference databases [here](https://benlangmead.github.io/aws-indexes/k2)). In that case, you can access an HPC or use the [Galaxy Webserver](https://usegalaxy.eu/).

The arguments used below are:

| option/input | description |
| :-: | :- |
| `--db` | Directory/folder containing indexed database. |
| `--paired` | Paired end mode. Specify the forward read FASTQ file (`PE_1.fastq`) first followed by the reverse read (`PE_2.fastq`). |
| `--report` | Output report file. |
| `--classified-out` | Per-read taxonomic classification. |

In [None]:
!kraken2 \
    --db minikraken2_v2_8GB_201904_UPDATE \
    --paired 'PE_1.fastq' 'PE_2.fastq' \
    --report kraken2.kreport \
    --classified-out kraken2.classification

The `kraken2.kreport` output file will contain a summary of the number of reads assigned per taxonomic lineage. Meanwhile, the `kraken2.classification` is a multi-column tab-separated file of the taxonomic classification of each read. The descriptions of the columns can be found here: [Kraken2 classification output](https://github.com/DerrickWood/kraken2/wiki/Manual#standard-kraken-output-format).

## Bracken

Kraken2 classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species. We use the Kraken database itself to derive probabilities that describe how much sequence from each genome is identical to other genomes in the database, and combine this information with the assignments for a particular sample to estimate abundance at the species level, the genus level, or above. Combined with the Kraken classifier, Bracken produces accurate species- and genus-level abundance estimates even when a sample contains two or more near-identical species.

Source: _[CCB JHU](https://ccb.jhu.edu/software/bracken/)_

Below, we demonstrate how to use `brakcen` to get a more accurate estimate of the abundances of taxonomic groups. `bracken` is also available in the [Galaxy Webserver](https://usegalaxy.eu/) so if you ran `kraken2` there, you can easily continue with the workflow.

| option/input | description |
| :-: | :- |
| `-d` | Directory/folder containing indexed database. |
| `-i` | Kraken2 report file. |
| `-o` | Bracken report file. |
| `-l` | Level to estimate abundance at {options: D,P,C,O,F,G,S,S1,etc} (default: S = Species) |

In [None]:
!bracken \
    -d minikraken2_v2_8GB_201904_UPDATE \
    -i kraken2.kreport \
    -o bracken.report \
    -l S

This will generate a new tab-separated file (`bracken.report`) displaying the number of assigned reads of Kraken2 for each taxonomic group and the new estimated reads inferred by Bracken. Additionally, a file similar to `kraken2.kreport` is also produced, but with the re-estimated read counts.

----
# <font color = 'gray'>Visualize</font>

Before we generate the visualization, if Krona is newly installed, you must first setup the NCBI taxonomy mapping files.

In [None]:
!ktUpdateTaxonomy.sh 

Then, to visualize the taxonomic profile inferred from `kraken2` + `bracken`, we use the `ktImportTaxonomy` utility from the Krona package. This will generate an interactive Krona chart which allows easy visualization of the proportions of taxonomic groups at different hierarchies or levels.

| option/input | description |
| :-: | :- |
| `-t` | Column number of the NCBI taxonomy ID. |
| `-m` | Number of reads. |
| `-o` | Output visualization. |

In [None]:
!ktImportTaxonomy \
    -t 5 \
    -m 3 \
    -o krona.html \
    kraken2_bracken_species.tabular