![Biofilm image](../images/Biofilm_Website_2.png)

# Submodule #3: Biomarker Discovery

## Overview

Microbiome community gene prediction and functional annotation are critical steps in the biofilm metagenomics workflow. Functional annotation of shotgun metagenomic data has become an increasingly popular method for identifying the aggregate functional capacities encoded by the community’s biofilm. This analysis relies on comparisons of predicted genes with existing, previously annotated sequences in 16s metagenomics samples. Functional profiling provides insights into what functions are carried out by a given biofilm community.

## Learning Objectives:
At the completion of this module, the learner will be able to:
- Learn how to discover biomarkers in a microbiome
- Run metagenomics marker gene discovery tools
- Predict and evaluate resulting genes, proteins and pathway biomarkers using the following tools:
    - PICRUSt2
    - Qiime2-PICRUSt2 plugin

## Get Started
### Step 4 - Biomarker Discovery (PICRUSt2, q2-PICRUSt2):
The primary tool for functional annotation of metagenomic data is PICRUSt2. This tool can be implemented as a standalone tool, as a Qiime2 plugin, or through the MicrobeAnalystR wrapper workflow. We will show examples of each in this submodule.

### Install PICRUSt2

In [None]:
%%capture
%%bash

wget https://github.com/picrust/picrust2/archive/v2.5.1.tar.gz
tar xvzf  v2.5.1.tar.gz
rm v2.5.1.tar.gz
mamba env create -f picrust2-2.5.1/picrust2-env.yaml
mamba run -n picrust2 pip install --editable picrust2-2.5.1/

### Biomarker Analysis with PICRUSt2 as a standalone tool (duration ~10 mins)

PICRUSt2 uses machine learning to predict functional abundance and capabilities within microbial communities using 16S rRNA marker genes. To start off the analysis we will identify our environment by setting the location of our PICRUSt2 inputs, and outputs. This is a great practice that helps us easily track where our files are located and avoid retyping common paths. First we will run PICRUSt2 as a standalone tool. We start off by defining some data paths as environment variables so that the PICRUSt2 scripts can automatically find them.

### Assign File Paths as ENV Variables

In [None]:
%env PICRUST_IN=qiime2_analysis/qiime2_Output/rep-seqs-unzipped/data/dna-sequences.fasta
%env BIOM=qiime2_analysis/qiime2_Output/table-unzipped/data/feature-table.biom
%env PICRUST_OUT=BioMarker_Discovery/picrust2_output

You will notice that our fasta and biom file are both outputs from the denoise analysis using DADA2 from submodule 2. To break it down:
- The FeatureData is our **fasta** file (also written as **fna**) and contains **amplicon sequence variants (ASV)** of 16S rRNA reads and IDs found accross the human samples.
- The FeatureTable is our **biom** file that contains the IDs of the ASV reads and the number of times these reads were found per sample. 

Next we will assign an environment variable with the number of available cores on this VM. Since the number of cores will change with each machine type, it is important to capture this with a variable rather than pass a hard-coded integer as an argument for each multi-threaded step.

In [None]:
#define number of cores to use.
numthreads=!nproc
numthreadsint = int(numthreads[0])
%env CORES = $numthreadsint

### Run the PICRUSt2 Pipeline

The commands below will do two things. Let's discuss each step as we run them:
1. place_seqs.py will insert our ASVs reads into a reference tree based on the Integrated Microbial Genomes database. This will produce our out.tre file which will be our input for the next command.
2. hsp.py predicts the copy number of gene families for each ASV. You will notice that the script is run twice because we are looking to identify sequences with the 16S rRNA marker and their Enzyme Classification (EC) number.

In [None]:
%%bash
source activate picrust2

python picrust2-2.5.1/scripts/place_seqs.py -s ${PICRUST_IN} -o ${PICRUST_OUT}/out.tre -p ${CORES} --intermediate ${PICRUST_OUT}/intermediate/place_seqs
python picrust2-2.5.1/scripts/hsp.py -i 16S -t ${PICRUST_OUT}/out.tre -o ${PICRUST_OUT}/marker_predicted_and_nsti.tsv.gz -p ${CORES} -n
python picrust2-2.5.1/scripts/hsp.py -i EC -t ${PICRUST_OUT}/out.tre -o ${PICRUST_OUT}/EC_predicted.tsv.gz -p ${CORES}

3. metagenome_pipeline.py does the same thing as the hsp.py script but the difference is that it predicts gene families weighted by the relative abundance of ASVs in their community.

In [None]:
%%bash
source activate picrust2

python picrust2-2.5.1/scripts/metagenome_pipeline.py -i ${BIOM} -m ${PICRUST_OUT}/marker_predicted_and_nsti.tsv.gz -f ${PICRUST_OUT}/EC_predicted.tsv.gz -o ${PICRUST_OUT}/EC_metagenome_out --strat_out

Our output should show something along the lines that some sequences are above max NSTI cut-off of 2.0. The **nearest-sequenced taxon index (NSTI)** is the branch length between the nearest 16S reference sequence and each ASV. The thought is that as the NSTI value descreases, the closer the relationship is between the ASV reads and the corresponding 16S sequence. Anything above 2 is considered noise and will not be used in the analysis. 11 out of 751 ASVs had a NSTI value equal to or higher than 2 so they were removed to not skew the downstream analysis.

4. convert_table.py creates attribute tables that link the functional and taxonomic data.
5. pathway_pipeline.py predicts pathway-level abundances by using our EC number abundances generated in step 2 and uses the   MetaCyc pathway database to see which pathways are associated with these ASV reads.
6. add_descriptions.py will add descriptions of each functional ID to the gene family and pathway abundance tables. 

In [None]:
%%bash
source activate picrust2

python picrust2-2.5.1/scripts/convert_table.py ${PICRUST_OUT}/EC_metagenome_out/pred_metagenome_contrib.tsv.gz -c contrib_to_legacy -o ${PICRUST_OUT}/EC_metagenome_out/pred_metagenome_contrib.legacy.tsv.gz
python picrust2-2.5.1/scripts/pathway_pipeline.py -i ${PICRUST_OUT}/EC_metagenome_out/pred_metagenome_contrib.tsv.gz -o ${PICRUST_OUT}/pathways_out -p ${CORES}
python picrust2-2.5.1/scripts/add_descriptions.py -i ${PICRUST_OUT}/EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o ${PICRUST_OUT}/EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
python picrust2-2.5.1/scripts/add_descriptions.py -i ${PICRUST_OUT}/pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o ${PICRUST_OUT}/pathways_out/path_abun_unstrat_descrip.tsv.gz

Finally, unzip all .gz files

In [None]:
# Postprocess Data
! gunzip -k ${PICRUST_OUT}/*.gz
! gunzip -k ${PICRUST_OUT}/EC_metagenome_out/*.gz

<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Alert: </b> Unfortunately PICRUSt2 does not let you overwrite output files. If you would like to rerun this analysis again make sure you delete the contents within the output folder via the command:
    
    rm -r BioMarker_Discovery/picrust2_output
    
The PICRUSt2 script will make your output directory automatically.
</div>

## Biomarker Analysis with PICRUSt2 as a Qiime2 plugin (duration ~ 30 mins)

## Install q2-picrust2

In [None]:
%%capture
! mamba create  -n qiime2 -c https://packages.qiime2.org/qiime2/2022.11/passed/core/ -c conda-forge -c bioconda qiime2-core -y
! mamba install -n qiime2  q2-picrust2 -c conda-forge -c bioconda -c picrust -y

Now that we understand each step of the PICRUSt2 pipeline we can bridge our Qiime2 and PICRUSt2 analysis via Qiimes2's PICRUSt2 plugin (q2-picrust2). This plugin allows the user to run both PICRUSt2 as part of a larger Qiime2 workflow without the need of installing the two separately. We have to re-define environment variables since we are in a different kernel.

### Assign File Paths as ENV Variables

In [None]:
%env Q2_PI_IN=qiime2_analysis/qiime2_Output/rep-seqs.qza
%env Q2_META=Core_Dataset_Prep/sample-metadata.tsv
%env Q2_BIOM=qiime2_analysis/qiime2_Output/table.qza
%env Q2_PI_OUT=BioMarker_Discovery/q2-picrust2_output

In [None]:
#define number of cores to use.
numthreads=!nproc
numthreadsint = int(numthreads[0])
%env CORES = $numthreadsint

### Run the Qiime2-PICRUSt2 Pipeline

This process is relatively the same as our PICRUSt2 step above with a few additions:
1. **picrust2 full-pipeline** allows to run the full PICRUSt2 pipeline with one command.
2. **feature-table summarize** summarizes the finding from the step 1 and will create visuals, histograms, and stats on how many sequences are associated with each sample and feature.
3. **diversity core-metrics** creates non-phylogenetic diversity metrics and a feature table.

In [None]:
%%bash
source activate qiime2

qiime picrust2 full-pipeline --i-table "${Q2_BIOM}" --i-seq "${Q2_PI_IN}" --output-dir "${Q2_PI_OUT}" --p-placement-tool epa-ng --p-threads ${CORES} --p-hsp-method pic --p-max-nsti 2 --verbose
qiime feature-table summarize --i-table "${Q2_PI_OUT}/pathway_abundance.qza" --o-visualization "${Q2_PI_OUT}/pathway_abundance.qzv"
qiime diversity core-metrics --i-table "${Q2_PI_OUT}/pathway_abundance.qza" --p-sampling-depth 226702 --m-metadata-file "${Q2_META}" --output-dir "${Q2_PI_OUT}/pathabun_core_metrics_out" --p-n-jobs 1

<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Alert: </b> Unfortunately Qiime2-PICRUSt2 plugin does not let you overwrite output files. If you would like to rerun this analysis again run the folowing command:
    
    rm -r BioMarker_Discovery/q2-picrust2_output
    
The plug-in will make your output directory automatically.
</div>

### Postprocess Data

The **qiime tools export** tool extracts ASV tables from the qza or qzv files. **biom convert** allows you to convert file formats such as tsv that typically PICRUSt2 produces. This is great for the next submodule where one of our PICRUSt2 outputs is used to query against the Uniprot database.

In [None]:
%%bash
source activate qiime2

# Export Abundance
qiime tools export --input-path "${Q2_PI_OUT}/pathway_abundance.qza" --output-path "${Q2_PI_OUT}/pathabun_exported"
biom convert -i "${Q2_PI_OUT}/pathabun_exported/feature-table.biom" -o "${Q2_PI_OUT}/pathabun_exported/feature-table.biom.tsv" --to-tsv
qiime tools export --input-path "${Q2_PI_OUT}/pathway_abundance.qzv" --output-path "${Q2_PI_OUT}/pathabun_qzv_exported"

# Export EC Metagenome
qiime tools export --input-path "${Q2_PI_OUT}/ec_metagenome.qza" --output-path "${Q2_PI_OUT}/ec_metagenome_exported"
qiime feature-table summarize --i-table "${Q2_PI_OUT}/ec_metagenome.qza" --o-visualization "${Q2_PI_OUT}/ec_metagenome.qzv"
biom convert -i "${Q2_PI_OUT}/ec_metagenome_exported/feature-table.biom" -o "${Q2_PI_OUT}/ec_metagenome_exported/feature-table.biom.tsv" --to-tsv
qiime tools export --input-path "${Q2_PI_OUT}/ec_metagenome.qzv" --output-path "${Q2_PI_OUT}/ec_metagenome_qzv_exported"

# Export Kegg Orthologs (KO) Metagenome
qiime tools export --input-path "${Q2_PI_OUT}/ko_metagenome.qza" --output-path "${Q2_PI_OUT}/ko_metagenome_exported"
qiime feature-table summarize --i-table "${Q2_PI_OUT}/ko_metagenome.qza" --o-visualization "${Q2_PI_OUT}/ko_metagenome.qzv"
biom convert -i "${Q2_PI_OUT}/ko_metagenome_exported/feature-table.biom" -o "${Q2_PI_OUT}/ko_metagenome_exported/feature-table.biom.tsv" --to-tsv
qiime tools export --input-path "${Q2_PI_OUT}/ko_metagenome.qzv" --output-path "${Q2_PI_OUT}/ko_metagenome_qzv_exported"

In [None]:
#run the following command to take the quiz!
from IPython.display import IFrame
IFrame("../Quiz/QS14.html", width=800, height=350)


## Conclusion

In this submodule you learned how to extract microbiome biomarker using several computational tools and machine learning pre-trained model. You learned using the Qiime output to predict relevant protein and pathways from 16s dataset using PICRUSt2 pre-trained machine learning model.

## Clean up

Remember to stop your notebook instance when you are done!