CBW 2018 PICRUSt2 Tutorial

PLEASE NOTE THAT THIS TUTORIAL IS NOT FOR THE LATEST VERSION OF PICRUST2. You can find links to the latest workflow here: https://github.com/picrust/picrust2/wiki

This tutorial is for Module 3 of the Canadian Bioinformatics Workshop running from June 5-7 2018 in Toronto.

Authors: Gavin Douglas and Morgan Langille

Introduction

PICRUSt2 is a tool that predicts the abundance of gene families and higher-level pathways present in a microbial community based on amplicon sequencing data. This tutorial will demonstrate each major step in the PICRUSt2 pipeline. Throughout the tutorial there will be questions that are meant to aid understanding. The answers are on this page, which you might want to open in a different tab.

For this lab we will be processing 16S sequencing of ileum and rectum biopsy samples from the Inflammatory Bowel Disease Multi'omics Database. Although shotgun metagenomics is becoming cheaper it typically isn't done with biopsy samples since it requires an enormous amount of sequencing (due to the large amount of host DNA!). 16S sequencing doesn't suffer from this issue since only the 16S is amplified - there typically is no host contamination.

This tutorial was meant for PICRUSt v2.0.0-b.2. The input files which are described below can be downloaded here, but these are already present on the student server!

Teaching objectives

Run and understand major steps of PICRUSt2 workflow.
Understand how to assess the prediction quality based on NSTI values.
Understand what files PICRUSt2 outputs and what the abundances in each one refer to.

Bioinformatic tool citations

castor (website, paper,
EPA-NG (website, pre-print)
gappa (website) - wrapper for the genesis library.
iTOL (website)
MinPath (website, paper)
PaPaRa (website, paper)
PICRUSt2 (website, PICRUSt1 paper)
RStudio (website)

Activating the conda environment

PICRUSt2 is installed on the student server in a conda environment, which is a tool for managing environments. This is helpful since sometimes different tools require different versions of packages as dependencies. It also allows users to easily install software with a simple command. The environment we'll be using for this tutorial is called picrust2-dev.

A different version of Python is installed in this environment. Before activating this environment you can check where the global version of Python is installed and its version.

This command will identify where Python is installed.

which python

And this command will let you know its version:

python --version

You can activate this tutorial with this command:

source /usr/local/miniconda3/bin/activate picrust2-dev

Compare the location and version of Python in this environment to what was outputted above.

Note that although conda is often used for managing Python versions it can be used to manage any environment. For example, PICRUSt2 is dependent on certain R packages and a different version of R is installed in this environment.

Question 1: What version of R is installed in the picrust2-dev environment and where is the R script located (note that it's capitalized unlike python)?

When you're finished this tutorial you can deactivate the environment with this command:

source /usr/local/miniconda3/bin/deactivate

Exploring input files

Two input files are required to run PICRUSt2: (1) a FASTA file of each study sequence and (2) the abundance of each study sequence in each sample. The workflow will work with any sequences that can be reliably placed in the reference tree (which is based on 16S rRNA genes [16S] by default), but typically users will input either the representative sequences of OTUs or amplicon sequence variants (ASVs).

These two files are in the /home/ubuntu/CourseData/metagenomics/Module3/input_files folder:

ASV_abun.tsv
ASVs.fna

We will make a new directory in our workspace for this tutorial and then copy the input_files to this new directory:

mkdir ~/workspace/Module3
cd ~/workspace/Module3
cp -r /home/ubuntu/CourseData/metagenomics/Module3/input_files ./

The ASV_abun.tsv represents each sample as a column (sample ids are in first row), each ASV as a row (ASV ids are in first column), and the numbers in the table as counts of ASVs per sample.

View the ASV_abun.tsv file with the less command and the option -S which turns off line wrapping (use the left and right arrow keys to scroll to see the entire file):

less -S input_files/ASV_abun.tsv

There is also a table of sample metadata in this folder named picrust2_lab_metadata.tsv, which you can take a peek at, but we wont use unless you're plotting the weighted NSTI values.

You can also see how many lines there in the sequence table with this command:

wc -l input_files/ASV_abun.tsv

There are 100 sequences in this table (the first line is the header!)

Question 2: How many samples are in this dataset?

Question 3: As a sanity check, how could you check how many sequences are in the FASTA file input_files/ASVs.fna?

Read placement

The first step of the workflow involves taking the study sequences and putting them into the pre-existing phylogenetic tree based on the 16S sequences from the reference genomes with the script place_seqs.py. As with all of the scripts in this workflow you can see the arguments and options for this tool by typing place_seqs.py --help. The default reference tree is based on full-length 16S sequences parsed from genomes in the Integrated Microbial Genomes database. The multiple-sequence alignment (MSA) of these reference 16S sequences is also required as input since it is required for the first placement step (done by PaPaRa).

Question 4: Based on the output of place_seqs.py --help where are the default reference files and how many sequences are in the reference MSA?

place_seqs.py -s input_files/ASVs.fna -o tutorial.tre --threads 4 --intermediate place_seqs_working

These are the inputs to this tool (besides changing the default reference files):

-s FASTA: your study sequences (i.e. FASTA of amplicon sequence variants or operational taxonomic units)
-o TREEFILE: Output tree with placed study sequences.
--threads INT: Number of threads to use.
--intermediate: Option to specify a folder where intermediate files will be written (otherwise they will not be kept).
--print_cmds: Option to specify that wrapped commands will be printed to screen (useful for troubleshooting!).

The output of this tool is the tree tutorial.tre. Take a look at this file with less. The tree is in newick format, which captures how individual tips on the tree are related topologically and also the branch-length distances between them. Importantly, place_seqs.py takes the most likely placement for each input sequence, but that doesn't mean it's correct!

Note that we set the option --intermediate place_seqs_working above. In this folder you can find the intermediate output files from the placement pipeline. One particularly useful file is the JPLACE file (place_seqs_working/epa_out/epa_result.jplace). This file contains information on where each sequence was placed in the reference tree: often there will be multiple placement positions. The JPLACE file is in JSON format and contains the reference tree followed by each placed sequence. The name of each placed sequence is in the field starting with "n:" and the placement locations is in the preceeding "p": field.

Visualizing this output can be very informative. If you have time you can optionally view the placements using the interactive tree of life (iTol) website as described below.

Hidden-state prediction

Now that we have the study sequences placed in the reference tree we can proceed to the heart of the PICRUSt2 pipeline: hidden-state prediction (HSP). For each placed study sequence the predicted abundances of gene families of interest will be predicted, which below will be Enzyme Classification (E.C.) numbers.

We will also predict how many 16S copies are expected to be in the genome corresponding to each study sequence. Predicting how many 16S copies there are is important since this will be used in the next step to normalize the abundance of each sequence. We will also calculate the nearest-sequenced taxon index (NSTI), which is the measure of how distant each study sequence from the nearest reference sequence in the tree.

This command will run HSP using maximum parsimony to predict the number 16S genes per predicted genome and will calculate NSTI values for each study sequence:

hsp.py -i 16S -n -t tutorial.tre -o 16S_predicted -m mp -n --seed 163

Similarly, this command will generate predictions how many copies of each gene family are found in each predicted genome. We wont bother calculating NSTI values again since it would be the same as above. Note that this command will take a lot longer (~5-10 minutes) since predictions are being made for 1000s of gene families instead of just 1 as above. While the below command is running you can get started on the R commands to explore the 16S predictions file.

hsp.py -i EC -t tutorial.tre -o EC_predicted -p 4 -m mp --seed 163

The input arguments and options are:

-t TREEFILE: Newick tree with study sequences placed amongst reference sequences.
-i TRAIT_OPTION: Which default pre-calculated count table to use (one of '16S', 'COG', 'EC', 'KO', 'PFAM', 'TIGRFAM')
-o PREFIX: Prefix for output files: RDS (R object containing state probabilities), predicted counts, and optionally a table of CIs.
--observed_trait_table TRAIT_COUNTFILE: Trait file to use if a non-default file is needed (most users should use one of the default trait options above).
-m METHOD: Hidden-state prediction method to use.
-p INT: Number of processes to run in parallel.
--chunk_size INT: Number of gene families to read in for a given processor.
-n: Indicates that Nearest-sequenced taxon index (NSTI) values should be calculated.
-c: Specifies that pseudo-confidence intervals will be output.

Question 5: Note that a random seed (--seed 163) was specified for each above command. Why is this important to set?

It's a good idea to check whether any of your input sequences have very high NSTI values. It's possible that sequences with high NSTI values (e.g. > 1) could be uncharacterized taxa, but often they are simply garbage sequences.

To take a look at the NSTI values open RStudio in your browser and run these commands in R:

hsp_16S_nsti <- read.table("/path/to/16S_predicted.tsv", header=T, sep="\t", row.names=1)
hist(hsp_16S_nsti$metadata_NSTI, breaks=100, col="grey", main="", xlab="Nearest-sequenced taxon index", ylim=c(0,40))

You should get a histogram that looks like this:

The majority of sequences have reasonably low NSTI values (< 0.3). There is one outlier with a value >0.3. We could identify which sequence this is with these commands:

max_nsti_val <- max(hsp_16S_nsti$metadata_NSTI)

hsp_16S_nsti[which(hsp_16S_nsti$metadata_NSTI == max_nsti_val), ]

The above commands should have returned:

# X16S_rRNA_Count metadata_NSTI
#70334369bb15777be95200a4edaa90be               1      0.338791

Which identifies sequence 70334369bb15777be95200a4edaa90be as having the highest NSTI value of 0.338791.

Question 6: Using a similar approach as above how could you figure out what the lowest NSTI value is?

Question 7: Either within RStudio or by simply lessing the 16S_predicted.tsv file, which sequence is predicted to have the most copies of the 16S gene? How many copies does it have?

Metagenome pipeline

Now that we have the predicted genomes for each study sequence we can predict the metagenomes for each sample. The basic approach here is to multiply the predicted number of genes by the abundance of each corresponding "genome" (i.e. the abundance of the study sequence). However, before predicting the metagenomes the sequence abundances are normalized by the number of marker gene copies in each predicted genome (i.e. how many 16S copies they have).

metagenome_pipeline.py -i input_files/ASV_abun.tsv -f EC_predicted.tsv -m 16S_predicted.tsv -p 4

The input arguments/options are:

-i STUDY.tsv - Table of ASV abundances across all samples (either TSV or BIOM format).
-m MARKER_PREDICTED.txt - Output predicted 16S copy numbers (or other marker) for all study sequences.
-f FUNC_PREDICTED.txt - Output predicted functional abundances for all study sequences.
-p INT - Number of processes to run in parallel to speed up execution.
--max_nsti INT - Max NSTI values per study sequence (sequences with values above this cut-off will be removed). Default: 2.
-o OUTPUT_DIR - Output directory.

The above command output the predicted metagenomes to the folder metagenome_out by default. There are 4 output files in this directory:

pred_metagenome_strat.tsv: predicted metagenomes stratified by which sequence is contributing each function.
pred_metagenome_unstrat.tsv: predicted metagenomes over all sequences.
seqtab_norm.tsv: sequence table normalized by predicted number of marker genes.
weighted_nsti.tsv: weighted NSTI values per sample (over all sequences weighted by the abundance in that sample).

Take a look at the normalized sequence abundance table with less. The abundances in this file are the number of reads divided by the predicted number of marker genes.

Question 8: What is the abundance of sequence 132b62bcbd3f280208e314eeab566a4d in sample HMP.2.206659 in this normalized table? Given that the unnormalized abundance was 11 how many predicted 16S sequences must there have been?

The output weighted NSTI values are also useful to explore. You can check out the option section at the bottom of this page if you have time!

The key output files are the stratified and unstratified metagenome predictions. The key difference in the stratified table (metagenome_out/pred_metagenome_strat.tsv) is that it contains a column called "sequence", which breaks down the number of genes contributed by individual ASVs. Both of these predicted metagenome files contain the numbers of gene families in each sample.

Note that these values are derived from the input read depth; therefore, the magnitude of the values will depend on the sequencing depth per sample. Also, note that the sum of gene family abundances per sample (i.e. column sums) will not be equal.

Question 9: Why would the column sums not be equal even if the input sequence abundance table was rarified so that all samples had the same depth?

Pathway inference

The final step in the PICRUSt2 pipeline is to infer pathway abundances based on the presence of gene families. This step is done by wrapping MinPath, which determines the minimum number of pathways that can be explained given the presence of gene families.

You can run MinPath over all samples with this command:

run_minpath.py -i metagenome_out/pred_metagenome_strat.tsv -m /usr/local/picrust2/MinPath/ec2metacyc_picrust_prokaryotic.txt -p 4 --intermediate minpath_intermediate -o minpath_out

The inputs to this tool are:

-i metagenome_out/pred_metagenome_strat.tsv - Stratified output of metagenome_pipeline.py step.
-m MINPATH_MAPFILE - path to mapfile of gene families to pathways of interest.
-o OUT_PREFIX - Prefix for the two output files.
--intermediate - Optional folder where intermediate files will be written..
-p INT - Number of processes to run simultaneously.

Once again both stratified and unstratified output files are created: minpath_out_strat_path.tsv and minpath_out_unstrat_path.tsv respectively.

The abundance of pathways in these files is based on the mean abundance of the required gene families.

The mapping between pathways and the underlying gene families was given as an input file to the above command (/usr/local/picrust2/MinPath/ec2metacyc_picrust_prokaryotic.txt). The intermediate MinPath output files were saved in minpath_intermediate. There is a lot of information in this folder and it wont be used by most users, but it's useful to have these raw files in case you want to more information on how MinPath predicted the presence of a pathway in a sample.

Note that the stratified abundances sum to be equal to the unstratified abundances, which are the overall abundance of that pathway in the sample (NOTE: this is a different approach then what is outputted by HUMAnN2 which will be discussed in a future module!).

Question 10: Given the above, which one of the two following statements is correct?

1. The stratified pathway abundances represent the abundances expressed by predicted genomes independently
2. The stratified pathway abundances represent the abundances of the community-wide pathway levels contributed by an individual predicted genome.

Explore JPLACE file with iTOL (optional)

If there is time remaining you can try reading in the JPLACE file containing placed reads into the interactive Tree of Life website.

First, go to this website: https://itol.embl.de/

Then click on "Upload" at the top of the page.

Choose the JPLACE file output to the intermediate output folder produced at the sequence placement step: place_seqs_working/epa_out/epa_result.jplace. This file contains the possible placement positions for each ASV onto the reference tree along with each placements likelihood.

After a minute or so you should see a circular tree! Any long branches you see are outlier reference 16S sequences that will likely be pruned out in a later release of PICRUSt2.

Under the "Basic" tab change the "Branch lengths" option to "Ignore".
Click on the switch beside "Phylogenetic Placements"

You should now see red dots where the placements are located within the tree. We hope to improve this functionality in the future (e.g. add taxonomic labels on the tips, colour clades, etc.).

To search for individual placements, click on the "show query form" button within the "Datasets" tab. A search box opens up on the left which can be used to search by ASV ids.

Question 11: Search for the ASV d3d5bc15a5f947217d626ad4a99c5757 - how many possible placements are there for this sequence?

Explore weighted NSTI values (optional)

If you still have time you can explore the weighted NSTI values per sample in R. These metrics are useful measures of how close the ASVs in each sample are to reference 16S sequences on average.

The weighted NSTIs are output automatically at the metagenome_pipeline.py step above to this file: metagenome_out/weighted_nsti.tsv.

You can read this file and the metadata for the samples into R with these commands run in RStudio (you will need to change the point the path to your working directory, which you can identify by typing pwd on the command-line):

weighted_nsti <- read.table("/path/to/metagenome_out/weighted_nsti.tsv", header=T, sep="\t", row.names=1)
in_meta <- read.table("/path/to/input_files/picrust2_lab_metadata.tsv", header=T, sep="\t")

R provides straight-forwards commands for getting basic descriptions of your data. For instance, you can get basic summary statistics with this command:

summary(weighted_nsti$weighted_NSTI)

This max weighted NSTI value per sample is ~0.05 in this dataset, which is quite low (anything less than 0.06 is considered low).

Question 12: Why would it be surprising to see high weighted NSTI values in this dataset?

Let's take a look at the metadata file we read in now. The two key columns are biopsy_location (either Rectum or Ileum) and diagnosis (either CD or nonIBD). Note that the biopsies were taken from the same individuals, so the ileum and rectum samples aren't independent of each other. There can be potential biases if more ASVs with higher NSTI values were in samples from one particular biopsy location or with one particular state. It's a good idea to check if the weighted NSTI values differ depending on sample features and conditions of interest. We'll do so by visualizing the weighted NSTI values per sample across both of these key variables.

To begin we'll need to combine the metadata and weighted NSTI values into the sample dataframe object. These commands will combine the two dataframes and remove unnecessary columns.

rownames(in_meta) <- in_meta$External.ID
combined_tab <- in_meta[rownames(weighted_nsti), ]
combined_tab$weighted_NSTI <- weighted_nsti$weighted_NSTI
combined_tab <- combined_tab[, -c(1,2)]

To make quick boxplots of how weighted NSTI values differ depending on the conditions described above you can use this command:

boxplot(weighted_NSTI ~ ., data=combined_tab, ylim=c(0, 0.055), ylab="Weighted NSTI")

Based on these plots it looks like there could be a difference in weighted NSTI values depending on disease state (this isn't significant with the tutorial dataset, but is significant with all samples)! This isn't necessarily surprising since increases in the relative abundance in Proteobacteria and other less common gut bacteria is a key signature of Crohn's disease. Based on this result, these microbes may have slightly higher NSTI values on average, which could result in the PICRUSt2 predictions being slightly less accurate. In this example the difference is quite subtle and is unlikely to have much of an effect, but hopefully this hammers home that this is an important bias to check for in your own data!

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to morgan.langille@dal.ca.