Amplicon Sequencing Data Analysis with QIIME 2

Christian Diener, Gibbons Lab

from the 2022 ISB Virtual Microbiome Series

CC-BY-SA gibbons.isbscience.org gibbons-lab @thaasophobia

Hold your horses 🐴

Let's get the slides first (use your computer, phone, TV, fridge, anything with a 16:9 screen)

https://gibbons-lab.github.io/isb_course_2022/16S

Organization of the course

Note:

this allows asynchronous work / different timezones
questions on Slack not on Zoom please

Setup

💻 Let's switch to the notebook and get started

Click me to open the notebook!

Wait... what?

All output we generate can be found in the treasure_chest folder at

https://github.com/gibbons-lab/isb_course_2022

or materials/treasure_chest in the Colaboratory notebook.

The human gut microbiome 🦠

38 trillion bacterial cells (~1/2 pound) vs 30 trillion human cells 😅
Supplies about 90 percent of the body's supply of serotonin.
The gut microbiome can be the only road to cure C. dificile infections.
Success of PD-1 cancer therapy can be modulated by probiotics.

How do we "see" bacteria?

many bacteria are difficult to culture outside of their resident environment
as a proxy we can study bacteria in diverse environments by their DNA
thus a large part of microbiome research involves analyzing sequencing data

So what can we use to analyze sequencing data?

Photo by Nadine Shaabana

Note:

sequencing/culture-free approaches have allowed us to vastly expand our knowledge about bacteria and their evolution
however, harder to map to phenotypes / ecology
sequencing data needs to be transformed first to be useful
what tools can we use for that?

QIIME

Pronounced like wind chime.

Created ~2010 during the Human Microbiome Project (2007 - 2016) under the leadership of Greg Caporaso and Rob Knight.

What is QIIME?

QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data processing and analysis transparency.

Quantitative Insights into Microbial Ecology

So what is it really?

Essentially, QIIME is a set of commands to transform microbiome data into intermediate outputs and visualizations.

It's commonly used via the command line.

QIIME 2 was introduced in 2016 and improves upon QIIME 1, based on user experiences during the HMP.

Major changes:

integrated tracking of data provenance
semantic type system
extendable plugin system
multiple user interfaces (in progress)

Where to find help?

QIIME 2 comes with a lot of help, including a wide range of tutorials, general documentation and a user forum where you can ask questions.

Artifacts, actions and visualizations

QIIME 2 manages artifacts, which are basically intermediate data that feed into actions to either produce other artifacts or visualizations.

https://docs.qiime2.org/2022.8/tutorials/overview/

Remember

Artifacts often represent intermediate steps, but Visualizations are end points meant for human consumption ☝️.

Artifacts and Visualizations in Qiime 2 are just zip files with annotations and a data folder that contains the actual output data.

What is amplicon sequencing?

Note:

very efficient, every paired read covers the full area of interest
great sensitivity
but not genomics (not even a full gene)

Why the 16S gene?

The 16S gene is universal and contains interspersed conserved regions perfect for PCR priming and hypervariable regions with phylogenetic heterogeneity.

The data set

Photo by Hu Chen.

Note:

the advent of cheap sequencing has generated a lot of publically available data
however do we really know the human microbiome?

A few countries account for the majority of microbiome data

RJ Abdill et al., https://doi.org/10.1371/journal.pbio.3001536

Note:

not really, many populations are not represented
where the $/institutions/research teams are based
institutionalized biases
heavily skewed towards populations from a few countries
this propagates to reference databases, functional annotations, etc.
see the symposium talks for a much more thorough discussion

Who is studying who?

Hadza in Tanzania - 3 samples
Hunter-gatherer ethnic group living in northern Tanzania. Obtain food exclusively through hunting and foraging edible plants.
Chepang in Nepal - 3 samples
Tibeto-Bhurman ethnic group living in the Mahabharat mountain range. Have recently transitioned from a semi-nomadic lifestyle to a more settled lifestyle.
Me'Phaa in Mexico - 3 samples
Also known as Tlapanec, live in the mountains of Guerrero state. Subsist primarily upon maize, beans, and chili peppers that they grow themselves.

Photos by Ben Preater, Giuseppe Mondi, Daniel Apodaca.

Note:

variety of cultures and lifestyles
distinct geographic regions

Though all of those manuscripts studied indigenous marginalized communities, none of them were led by members of the community and were often conducted by foreign institutions. Thus, the fact that indigenous communities were studied does not necessarily mean that the interests of the community were represented.

figure and content courtesy of Emily Wissel
Also check out the Microbes and Social Equity Working group
https://doi.org/10.1128/mSystems.00471-21
https://doi.org/10.1128/msystems.01240-21

What will we do today?

Illumina FastQ files (Basespace)

@SRR2143527.13917 13917 length=251
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAA...
+
BBBBAF?A@D2BEEEGGGFGGGHGGGCFGFHHCFHCEFGGH...

We have our raw sequencing data, but QIIME 2 only operates on artifacts. How do we convert our data into an artifact??

🐥 or 🥚?

Our first QIIME 2 commands

💻 Let's switch to the notebook and get started

Time to bring out the big guns 💣⚡

We will now run the DADA2 plugin, which will do 3 things:

filter and trim the reads
find the most likely original sequences in the sample (ASVs)
remove chimeras
count the abundances

Preprocessing sequencing reads

trim low quality regions
remove reads with low average quality
remove reads with ambiguous bases (Ns)
remove PhiX (bacteriophage genome commonly added as a control to sequencing runs)

Identifying amplicon sequence variants (ASVs)

Expectation-Maximization (EM) algorithm used to build a dataset-specific error model and find true amplicon sequence variants (ASVs), all at once.

PCR chimeras

The primers used in this study were F515/R806. The numbers denote positions along the 16S gene. So, how long is the amplified fragment?

We now have a table containing the counts for each ASV in each sample. We also have a list of ASVs.

:thinking_face: Do you have an idea for what we could do with these two data sets? What quantities might we be interested in?

How do the organisms in our samples relate to one another?

One of the basic things we might want to look at is how the ASVs across all samples are evolutionarily related to one another. That is, we are often interested in their phylogeny.

How to build a phylogenetic tree?

Phylogenetic trees are built from multiple sequence alignments and sequences are arranged by sequence similarity (branch length).

We can visualize this tree with EMPRESS.

First glance at the ASVs

💻 Let's switch to the notebook look at our data and build a tree.

Ecological Diversity metrics

In microbial community analysis we are usually interested in two different families of diversity metrics, alpha diversity (ecological diversity within a sample) and beta diversity (ecological differences between samples).

Alpha diversity

How diverse is a single sample?

richness: how many taxa do we observe (richness)?
→ total number of observed taxa
evenness: how evenly are abundances distributed across taxa?
→ Evenness index
mixtures: metrics that combine both richness and evenness
→ Shannon index, Simpson's Index

Statistical tests for alpha diversity

Alpha diversity will provide a single value/covariate for each sample.

It can be treated as any other sample measurement and is suitable for classic univariate tests (t-test, Mann-Whitney U test).

Beta diversity

How different are two or more samples/donors/sites from one another other?

unweighted: how many taxa are shared between samples?
→ Jaccard index, unweighted UniFrac
weighted: do shared taxa have similar abundances?
→ Bray-Curtis distance, weighted UniFrac

UniFrac

Do samples share genetically similar taxa?

Weighted UniFrac further scales phylogenetic branch lengths by abundances.

Principal Coordinate Analysis

Statistical tests for beta diversity

More complicated. Usually not normal and very heterogeneous. PERMANOVA can deal with that.

Run the diversity analyses

💻 Let's switch to the notebook and calculate the diversity metrics

What organisms are present in our samples?

We are still just working with sequences and we have no idea what organisms those sequences correspond to.

:thinking_face: What would you do to go from a sequence to an organism's name?

Taxonomic ranks

Even though directly aligning our sequences to a database of known genes seems most intuitive, this does not always work well in practice. Why?

Multinomial Naive Bayes

Instead, use subsequences (k-mers) and their counts to predict the lineage/taxonomy with machine learning methods. For 16S amplicon fragments, this approach often provides better generalization and faster results.

Let's assign taxonomy to our samples

💻 Let's switch to the notebook and assign taxonomy to our ASVs

Your turn

Is there phylogenetic diversity between taxa from different populations?

And we are done 👏

Christian Diener
Nick Bohmann
Sean Gibbons
Sue Ishaq
Emily Wissel
Alex Carr
Noa Rappaport
Samantha Piekos
James Johnson
Kathryn Stephenson

Dominic Lewis
Allison Kudla
Audri Hubbard
Joe Myxter
Thea Swanson
Victoria Uhl
Connor Kelly
Shanna Braga
ISB Facilities Team

Files

talk.md

Latest commit

History

talk.md

File metadata and controls

Amplicon Sequencing Data Analysis with QIIME 2

Christian Diener, Gibbons Lab

Hold your horses 🐴

Organization of the course

Setup

Wait... what?

The human gut microbiome 🦠

How do we "see" bacteria?

QIIME

What is QIIME?

So what is it really?

Where to find help?

Artifacts, actions and visualizations

Remember

What is amplicon sequencing?

Why the 16S gene?

The data set

A few countries account for the majority of microbiome data

Who is studying who?

What will we do today?

Illumina FastQ files (Basespace)

Our first QIIME 2 commands

Time to bring out the big guns 💣⚡

Preprocessing sequencing reads

Identifying amplicon sequence variants (ASVs)

PCR chimeras

How do the organisms in our samples relate to one another?

First glance at the ASVs

Ecological Diversity metrics

Alpha diversity

Statistical tests for alpha diversity

Beta diversity

UniFrac

Principal Coordinate Analysis

Statistical tests for beta diversity

Run the diversity analyses

What organisms are present in our samples?

Taxonomic ranks

Multinomial Naive Bayes

Let's assign taxonomy to our samples

Your turn

And we are done 👏

Thanks! ❤️