<h1>Functional Prediction From Marker Genes With PICRUSt2</h1>

This notebook outlines the steps involved in running the PICRUSt2 functional prediction pipeline.

This workflow was built with the following as the main references: <a href='https://github.com/picrust/picrust2/wiki'>PICRUSt2 GitHub Wiki</a>, and <a href='https://github.com/LangilleLab/microbiome_helper/wiki/CBW-2021-PICRUSt2-Tutorial'>LangilleLab SOP</a> 

## How to Use This Notebook
1. Activate PICRUSt2 conda environment. Make sure to change the environment name to whatever is applicable.
>`conda activate picrust2`
2. Open jupyter notebook and select this notebook.
>`jupyter notebook`
3. To run the cells, press Shift+Enter

## File Source
<i>Raes, E. J., Karsh, K., Sow, S. L., Ostrowski, M., Brown, M. V., van de Kamp, J., ... & Waite, A. M. (2021). Metabolic pathways inferred from a bacterial marker gene illuminate ecological changes across South Pacific frontal boundaries. Nature communications, 12(1), 1-12.</i>

Data repository:
https://zenodo.org/record/4567694#.YhnG6ehBxPZ

## Starting Files
1. This Jupyter notebook
2. Representative sequences
3. Feature table
4. The <b>resources</b> folder which contains all input and output files

---
## Table of Contents
 * [**Step 1: Phylogenetic placement**](#Step-1:-Phylogenetic-placement)  
 * [**Step 2: Hidden-state prediction**](#Step-2:-Hidden-state-prediction)  
 * [**Step 3: Predicting Number of 16S rRNA genes per representative sequence**](#Step-3:-Predicting-Number-of-16S-rRNA-genes-per-representative-sequence)
 * [**Step 4: Predict metagenome of the samples**](#Step-4:-Predict-metagenome-of-the-samples)
 * [**Step 5: Infer pathways**](#Step-5:-Infer-pathways)  
 * [**Step 6: Add descriptions**](#Step-6:-Add-descriptions)  
---

In [None]:
!mkdir picrust2_out

## Single-command Run
Run the entire PICRUSt2 pipeline using a single command only.

In [None]:
!picrust2_pipeline.py \
    -s rep_seqs.fa \
    -i feature_table.tsv \
    -o output_dir \
    [other options]

## Step-by-step Run
Run the PICRUSt2 pipeline step-by-step using different scripts.

# <font color = 'gray'>Step 1: Phylogenetic placement</font>

This step aligns our representative sequences (OTUs or ASVs) to the sequences in the reference database using HMMER. Afterwards, it runs the phylogenetic placement algorithm (default: EPA-NG) to place the query sequences on the reference phylogenetic tree.

If you want to use custom reference files, please refer to this <a href='https://github.com/picrust/picrust2/wiki/Sequence-placement'>page</a>.

In [None]:
!place_seqs.py \
    -s rep_seqs.fa \
    -o picrust2_out/out_tree.tre \
    -p 1 \
    -t sepp \
    --verbose

# <font color = 'gray'>Step 2: Predicting Abundances of EC Gene Families</font>

This step predicts the abundances of gene families, using Enzyme Commission (EC) system, for each representative sequence with a predicted genome. Alternatively, we could use different pre-calculated count table from the following options: 'EC', '16S', 'COG', 'KO', 'PFAM', 'TIGRFAM', 'PHENO'.

In [None]:
!hsp.py \
    -i EC \
    -t picrust2_out/out_tree.tre \
    -o picrust2_out/out_EC_prediction.tsv \
    --verbose

Let's take a peek at the output (`out_EC_prediction.tsv`)

In [None]:
!head picrust2_out/out_EC_prediction.tsv

# <font color = 'gray'>Step 3: Predicting Number of 16S rRNA genes per representative sequence</font>

This step calculates nearest-sequenced taxon index (NSTI) and the number of 16S genes per predicted genome. 

The NSTI quantifies how close your query sequences are to its nearest 16S sequence in the reference database. Small NSTI values yields more accurate predictions while larger NSTI values results to less accurate predictions. By default, NSTI values greater than 2 are omitted from the predictions.

The 16S gene counts per predicted genome are used to normalize the abundance of each predicted gene family.

In [None]:
!hsp.py \
    -i 16S \
    -t picrust2_out/out_tree.tre \
    -o picrust2_out/out_16S_NSTI_calc.tsv \
    --calculate_NSTI \
    --verbose

Let's take a peek at the output (`out_16S_NSTI_calc.tsv`)

In [None]:
!head -n 35 picrust2_out/out_16S_NSTI_calc.tsv

# <font color = 'gray'>Step 4: Predict metagenome of the samples</font>

From the predicted genomes for each of our representative sequences, we can now generate the metagenome of our samples using this step. By using the previously calculated 16S copies per genome, we take into account the fact that a genome can have multiple copies of the said marker gene and so we get better estimate of the abundance of each gene family in our metagenome.

<h4>Detailed explanation to normalization:</h4>

Consider the table below 

| Representative Sequences in Sample 1 | Abundance of OTU | Number of 16S Genes per Genome | Number of EC 1.1.1.1 per Genome |  Number of EC 1.1.1.2 per Genome |
| --- | --- | --- | --- | --- |
| OTU1 | 20 | 2 | 5 | 3 |
| OTU2 | 35 | 5 | 3 | 6 |
| OTU3 | 16 | 4 | 4 | 10 |

From the predicted number of 16S genes (generated by Step 3) and the expected number of 16S genes per predicted genome, we can calculate the expected number of genomes for each of our representative sequence.

| Representative Sequences in Sample 1 | Abundance of OTU | Number of 16S Genes per Genome | Expected Number of Genome | Number of EC 1.1.1.1 per Genome |  Number of EC 1.1.1.2 per Genome |
| --- | --- | --- | --- | --- | --- |
| OTU1 | 20 | 2 | 20/2 = 10 | 5 | 3 |
| OTU2 | 35 | 5 | 35/5 = 7 | 3 | 6 |
| OTU3 | 16 | 4 | 16/4 = 4 | 4 | 10 |

Finally, we can calculate the number of gene family and generate the functional profile of our metagenome.

| Representative Sequences in Sample 1 | Abundance of OTU | Number of 16S Genes per Genome | Expected Number of Genome | Number of EC 1.1.1.1 per Genome |  Number of EC 1.1.1.2 per Genome | Total Number of EC 1.1.1.1 | Total Number of EC 1.1.1.2 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| OTU1 | 20 | 2 | 20/2 = 10 | 5 | 3 | 5x10 = 50 | 3x10 = 30 | 
| OTU2 | 35 | 5 | 35/5 = 7 | 3 | 6 | 3x5 = 15 | 6x5 = 30 |
| OTU3 | 16 | 4 | 16/4 = 4 | 4 | 10 | 4x4 = 16 | 10x4 = 40 |

And so, Sample 1 is predicted to have 81 (50+15+16) copies of EC 1.1.1.1 gene family and 110 (30+30+40) copies of EC 1.1.1.2 gene family.

In [None]:
!metagenome_pipeline.py \
    -i feature_table.tsv \
    -m picrust2_out/out_16S_NSTI_calc.tsv \
    -f picrust2_out/out_EC_prediction.tsv \
    -o picrust2_out/out_EC_metagenome \
    --strat_out

<b>Output Files:</b>
1. `seqtab_norm.tsv.gz` - Representative sequences' feature table that is normalized by the predicted number of 16S genes
2. `weighted_nsti.tsv.gz` - Table containing weighted NSTI values for each sample
3. `pred_metagenome_unstrat.tsv.gz` - Abundance of each gene family per sample
4. `pred_metagenome_contrib.tsv.gz` - Shows calculations somewhat similar to the tables above

Let's extract the output files.

In [None]:
!gunzip picrust2_out/out_EC_metagenome/*

# <font color = 'gray'>Step 5: Infer pathways</font>

PICRUSt2 uses MinPath to map the predicted EC numbers onto MetaCyc reactions which is then used to infer MetaCyc pathways and their abundances.

In [None]:
!pathway_pipeline.py \
    -i picrust2_out/out_EC_metagenome/pred_metagenome_contrib.tsv \
    -o picrust2_out/out_pathways \

<b>Output Files:</b>
1. `path_abun_contrib.tsv.gz` - Similar to `pred_metagenome_contrib.tsv.gz` but for pathways
2. `path_abun_unstrat.tsv.gz` - Abundance of each pathway per sample

In [None]:
!gunzip picrust2_out/out_pathways/*

# <font color = 'gray'>Step 6: Add descriptions</font>

PICRUSt2, by default, outputs only a table with functional IDs presented. To make the table more informative, we could add descriptions regarding these functional IDs so that we do not have to look up each of the functional ID on their respective websites/databases.

In [None]:
!add_descriptions.py \
    -i picrust2_out/out_EC_metagenome/pred_metagenome_unstrat.tsv \
    -m EC \
    -o picrust2_out/out_EC_metagenome/pred_metagenome_unstrat_desc.tsv

In [None]:
!add_descriptions.py \
    -i picrust2_out/out_pathways/path_abun_unstrat.tsv \
    -m METACYC \
    -o picrust2_out/out_pathways/path_abun_unstrat_desc.tsv