# Module: Taxonomic Annotation

Taxonomic annotation of contigs (and bins) is a crucial step in metagenomic analysis, enabling the identification and classification of microbial communities from sequencing data. Contigs, which are contiguous sequences assembled from reads, can be annotated by comparing them to reference databases to determine their likely taxonomic origin. By doing so, we gain insights into the diversity of microorganisms within a sample.

In this module, we explore different tools commonly used for taxonomic classification of contigs and bins.

Created by: Microbial Oceanography Laboratory (MOLab)

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate tax-annot-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate tax-annot-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **EukRep**
2. **VirSorter2**
3. **MMSeqs2**
4. **Bin Annotation Tool (BAT)**

To install tools (1) - (3), find the tax-annot.yaml file located in the same folder as this notebook (in repository). Then run the command below in the terminal:
```bash
conda env create -f tax-annot.yaml
```

Since tool (4) requires high memory (due to size of databased used), it is better if it is accessed through HPCs or [EU Galaxy Webserver](https://usegalaxy.eu/).

---
## Starting Files

1. A contig FASTA file (`assembly_contigs.fa`) generated from assembly (see **Assembly Module**).
2. A FASTA file of a metagenome-assembled genome or bin (`bin_7.fa`; see **Binning Module**).

---
## Expected Outputs

Output(s) may depend on the tool used.

---
## Table of Contents
 * [**EukRep**](#EukRep)  
 * [**VirSorter2**](#VirSorter2)  
 * [**MMSeqs2**](#MMSeqs2)  
 * [**Bin Annotation Tool (BAT)**](#Bin-Annotation-Tool-(BAT))

---
# <font color = 'gray'>EukRep</font>

`EukRep` is a tool designed to identify eukaryotic contigs in metagenomic datasets by analyzing k-mer profiles. While it does not provide specific taxonomic classifications for the identified contigs, it excels in isolating the eukaryotic fraction of the dataset. This makes it particularly useful for downstream analyses that focus on diversity, functional potential, or evolutionary relationships of eukaryotes within a metagenomic sample.

### How to run?

To run `EukRep`, you simply need to provide the FASTA file of contigs.

| option/input | description |
| :-: | :- |
| `-i` | Input contig FASTA file. |
| `-o` | Output filename of predicted eukaryotic sequences. |

In [None]:
!EukRep \
    -i assembly_contigs.fa \
    -o euk_contigs.fa 

### What are its outputs?

As mentioned above, `EukRep` will output a FASTA file containing the contigs inferred as eukaryotic.

<div class="alert alert-block alert-warning">
<b>Warning</b> 

You can also pull out the prokaryotic contigs using the <code>--prokarya</code> argument. However, for <code>EukRep</code>, prokaryotic contigs are simply the sequences not inferred as eukaryotic by their model. This is not entirely accurate since some of the contigs may be of viral nature.
</div>

---
# <font color = 'gray'>VirSorter2</font>

VirSorter2 uses multiple classifiers to identify diverse DNA and RNA viral sequences.

### How to run?

Before doing viral contig inference, the database must be setup first. To do so, download the database using the command below. Alternatively, you can download the database manually from this link: https://osf.io/v46sc/download

In [None]:
!virsorter setup -d db -j 4

Afterwards, extract the downloaded compressed file.

In [None]:
!tar -xzf db.tgz

Then run the setup configuration command.

In [None]:
!virsorter config --init-source --db-dir=./db

<div class="alert alert-block alert-info">
<b>Note</b>: 
    
You only need to setup the database on the first instance that you will be using virsorter.
</div>

After setting up the database, you can now identify viral contigs in your dataset. A sample command is displayed below.

| option/input | description |
| :-: | :- |
| `-w` | Output filename. |
| `-i` | Input contig FASTA file. |
| `--min-length` | Minimum contig length to be included for analysis. |
| `-j` | Number of threads. |
| `all` | Run the whole pipeline. |

In [None]:
!virsorter run \
    -w test.out \
    -i test.fa \
    --min-length 1500 \
    -j 4 \
    all

### What are its outputs?

Detailed descriptions of outputs are discussed here: [VirSorter2 outputs](https://github.com/jiarong/VirSorter2#detailed-description-on-output-files).

---
# <font color = 'gray'>MMSeqs2</font>

`mmseqs` predicts contig taxonomy by first comparing all 6-frame translated sequences within a contig to a reference database. Afterwards, it identifies the Lowest Common Ancestor (LCA) of the taxonomies associated with each coding DNA sequence (CDS).

### How to run?

#### a. Building or downloading the reference database

First, we have to select a reference database. You can find a list of available pre-processed databases here: [MMSeqs2 databases](https://github.com/soedinglab/mmseqs2/wiki#downloading-databases). The databases with "yes" in the "taxonomy" column are the relevant ones for this module. Below, we download the SwissProt database. This is a highly curated protein database. This ensures that reference sequences are accurate for comparison, however, as a downside, it may not be very sensitive because it contains fewer sequences.

In [None]:
!mmseqs databases UniProtKB/Swiss-Prot taxdb/swissprot tmp

<div class="alert alert-block alert-info">
<b>Note</b>: 
    
The available pre-processed databases may not always suit your needs. If you wish to create a custom MMSeqs2 taxonomy database, instructions are provided here: <a href="https://github.com/soedinglab/MMseqs2/wiki#creating-a-seqtaxdb">Creating a seqTaxDB</a>.
</div>

#### b. Build a query database

Next, we create a query database which is basically just an indexed form of your contigs.

In [None]:
!mmseqs createdb \
    assembly_contigs.fa \
    querydb/assembly_contigs_idx

#### c. Assign taxonomies to the query sequences

This step runs the core mmseqs taxonomy assignment algorithm briefly mentioned above.

In [None]:
!mmseqs taxonomy \
    querydb/assembly_contigs_idx \
    taxdb/swissprot \
    mmseqs_tax_res/mmseqs_tax_result \
    mmseqs_tax_res/tmp \
    --tax-lineage 1 \
    --majority 0.4 \
    --vote-mode 1 \
    --lca-mode 3 \
    --orf-filter 0

#### d. Filter contigs by taxonomy assignment (optional)

If you are interested in studying a specific taxonomic group, you can use the `mmseqs filtertaxdb` command. For example, if you want to retrieve the viral contigs only, you can use the command below. The number specified in the `--taxon-list` argument corresponds to the NCBI taxon ID for viruses. You can search for taxonomy IDs in the [NCBI taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi).

In [None]:
!mmseqs filtertaxdb \
    taxdb/swissprot \
    mmseqs_tax_res/mmseqs_tax_result \
    mmseqs_tax_res/mmseqs_tax_result_virus \
    --taxon-list 10239

#### e. Generate reports

We can summarize the taxonomy assignment process above into a TSV file and Krona chart.

In [None]:
# Create a TSV file of the contig taxonomy assignments
!mmseqs createtsv \
    querydb/assembly_contigs_idx \
    mmseqs_tax_res/mmseqs_tax_result \
    mmseqs_tax_res/mmseqs_tax_result.tsv

In [None]:
# Create a Krona chart
!mmseqs taxonomyreport \
    taxdb/swissprot \
    mmseqs_tax_res/mmseqs_tax_result \
    mmseqs_tax_res/mmseqs_tax_result_report.html \
    --report-mode 1

### What are its outputs?

As discussed above, the final outputs are:

1. `mmseqs_tax_result.tsv` - TSV file displaying the taxonomic annotation of each FASTA entry (i.e. contig). More details about the format of the output TSV is discussed [here](https://github.com/soedinglab/MMseqs2/wiki#taxonomy-output-and-tsv).
2. `mmseqs_tax_result_report.html` - Krona chart to visualize taxonomic annotations. Note that it does not represent the relative abundances of these taxonomic groups, only the proportion of contigs with specific taxonomic assignments.

Depending on your objectives, you may want to further explore sequences coming from specific taxonomic groups. To do so, create a TSV file (`mmseqs createtsv`) for the filtered contigs in step E above. This will give you a list of contigs with taxonomy assignments of your group of interest. Afterwards, use the custom Python script below to filter your query contigs. We assume that the output TSV file of the filtered data is named `filt_tsv.tsv`.

In [None]:
!./filter_fasta_by_id.py \
    --filt_file_in filt_tsv.tsv \
    --fasta_in assembly_contigs.fa \
    --filt_file_type GENERAL \
    --fasta_out filt_assembly_contigs.fa

The custom script will produce a filtered FASTA file (`filt_assembly_contigs.fa`) containing only contigs with hits to your taxonomy of interest.

---
# <font color = 'gray'>Bin Annotation Tool (BAT)</font>

"... Bin Annotation Tool (BAT) are pipelines for the taxonomic classification of ... metagenome assembled genomes (MAGs / bins) of both known and (highly) unknown microorganisms, as generated by contemporary metagenomics studies. The core algorithm of both programs involves gene calling, mapping of predicted ORFs against a protein database, and voting-based classification of the entire contig / MAG based on classification of the individual ORFs."

Source: [CAT/BAT](https://github.com/MGXlab/CAT_pack)

### How to run?

BAT requires heavy resource to run due to its large database. Besides accessing HPCs, you can also run BAT through the [EU Galaxy Webserver](https://usegalaxy.eu/).

1. On the tool search panel (left side), search for "CAT bins".

![tools](images/tool_search.png)

2. Assuming you have already uploaded your bin FASTA file, select it under the "metagenome assembled genomes (MAGs/bins)" section. Moreover, choose the latest database version under the "Use a built-in CAT database".

![inputs](images/inputs.png)

3. You can leave the other parameters as default and execute by clicking "Run Tool".

### What are its outputs?

There are several outputs, but the most relevant will likely be `BAT.bin2classification.txt` which will display the inferred lineage of the bin (using NCBI taxonomy ID) and the confidence score for the assignment at each taxonomic level.