# Module: Functional Annotation of Metagenomic Reads

Functional annotation of metagenomic data is crucial in characterizing the functional potential of a community. Typically, this is inferred from assembled reads (i.e. contigs/scaffolds) where (nearly) full-length genes are more easily located. However, there are also tools that infer functional profiles of samples directly from the reads. This provides a relatively quick and easy way of describing the metabolic capabilities of the community of interest.

Below, we will see how to functionally annotate clean reads from shotgun metagenomic data using SUPER-FOCUS.

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate read-func-annot-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **SUPER-FOCUS**
2. **PEAR**
3. **Diamond**

To install these tools, follow the list of commands below:

Create an environment and install tools (2) and (3). You can find the `read-func-annot.yaml` file in the same directory as this notebook.
```bash
conda env create -f read-func-annot.yaml
```

Activate the environment.
```bash
conda activate read-func-annot-env
```

Install tool (1) using `pip`.
```bash
pip install superfocus
```

---
## Starting Files 

1. Clean paired-end reads (FASTQ format; See **Quality Control Module**).

---
## Expected Outputs

1. Count table of functions (rows) versus samples (columns).

---
## Table of Contents
 * [**Prepare Paired-End Data**](#Prepare-Paired-End-Data)
     * [Merge Reads](#Merge-Reads)
     * [Concatenate Reads](#Concatenate-Reads)
 * [**Functional Annotation**](#Functional-Annotation)
     * [Setup Database](#Setup-Database)
     * [Annotate](#Annotate)

----
# <font color = 'gray'>Prepare Paired-End Data</font>

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
This step is only applicable if you are working with paired-end data.
</div>

## Merge Reads

The functional annotation tool (`superfocus`) that we will be using cannot handle paired-end (PE) data directly. Hence, prior to running `superfocus`, we will first merge the PE reads. Merging PE reads will also give us slightly better resolution resulting from the longer sequences.

| option/input | description |
| :-: | :- |
| `-f` | Forward reads. |
| `-r` | Reverse reads. |
| `-o` | Prefix of output files. |

In [None]:
!pear \
    -f PE_1.fastq \
    -r PE_2.fastq \
    -o pear_out

This will produce four outputs:
1. `pear_out.assembled.fastq` - PE reads successfully merged.
2. `pear_out.discarded.fastq` - discarded reads.
3. `pear_out.unassembled.forward.fastq` - forward reads unsucessfully merged.
4. `pear_out.unassembled.reverse.fastq` - reverse reads unsucessfully merged.

## Concatenate Reads

Now, we concatenate the merged and unmerged reads into one file (`concate_reads.fastq`), which will then be used as input to `superfocus`.

In [None]:
!cat \
    pear_out.assembled.fastq \
    pear_out.unassembled.forward.fastq \
    pear_out.unassembled.reverse.fastq > concate_reads.fastq

----
# <font color = 'gray'>Functional Annotation</font>

## Setup Database

If this is your first time running `superfocus`, you need to initially setup the reference database for functional annotation. You can find the list of available databases [here](https://github.com/metageni/SUPER-FOCUS?tab=readme-ov-file#recomendations). Different databases must be downloaded for different aligners. Below, we will be using `diamond`. Make sure to select the database that corresponds to the installed version of `diamond` in your environment (run `diamond help` to check). Additionally, for demonstration, we will be using the 90% cluster database, but the 100% cluster database should (in theory) provide the best sensitivity.

You can either manually download the database from the link provided, or run the `wget` command below (assuming you have `diamond` version 2).

In [None]:
!wget "https://open.flinders.edu.au/ndownloader/files/44075207" -O 90_clusters.db.dmnd.zip

Afterwards, move and unzip the database to the directory where `superfocus` will look for it.

In [None]:
%%bash
mkdir -p  $(which superfocus | sed -e 's#bin/superfocus$#lib/python3.8/site-packages/superfocus_app/db/static/diamond#') &&
unzip -d  $(which superfocus | sed -e 's#bin/superfocus$#lib/python3.8/site-packages/superfocus_app/db/static/diamond#') 90_clusters.db.dmnd.zip

## Annotate

`superfocus` aligns reads using a specified aligner, and identifies hits and classifies hits using the SEED subsystems functional hierarchy. This consists of four levels wherein the "Subsystem Level 1" is the broadest category (e.g. Amino Acids and Derivatives, Carbohydrates, etc.) and "Function" is the most specific level (e.g. Glutamate_racemase_(EC_5.1.1.3), Agmatinase_(EC_3.5.3.11)).

An example of a `superfocus` command is shown below. If you are working with a multi-sample data, you can specify the `-q` argument multiple times, one for each of your sample. Just make sure you re-run the steps above ([Merge Reads](#Merge-Reads) and [Concatenate Reads](#Concatenate-Reads)) to pool the PE reads from each of your sample.

| option/input | description |
| :-: | :- |
| `-q` | Query reads. FASTA or FASTQ format. Can be specified multiple times to accommodate reads from multiple samples |
| `-dir` | Name of output directory. |
| `-a` | Aligner {options: rapsearch, diamond, blast; default=rapsearch}. |
| `-db` | Database {options: DB_90, DB_95, DB_98, DB_100; default=DB_90}. |

In [None]:
!superfocus \
    -q concat_reads.fastq \
    -dir superfocus_out \
    -a diamond \
    -db DB_90

The command will produce several outputs inside the `superfocus_out` directory, namely:
    
1. `concat_reads.fastq_alignments.m8` - alignment of reads and reference.
2. `output_all_levels_and_function.xls` - count table showing full SEED subsystems hierarchy.
3. `output_subsystem_level_1.xls` - count table showing Subsystem Level 1 only.
4. `output_subsystem_level_2.xls` - count table showing Subsystem Level 2 only.
5. `output_subsystem_level_3.xls` - count table showing Subsystem Level 3 only.
6. `output_binning.xls` - Summary of per-sample and per-read SEED assignment, and alignment.

If you intend to perform statistical analysis afterwards, output (2) can be used as input to STAMP software. For more details about this tool, check the following link: [STAMP software](https://beikolab.cs.dal.ca/software/STAMP).