# Submodule 01: Introduction to 16S rRNA Sequencing and Microbial Community Analysis

# Table of Contents
- [Overview](#overview)
- [Learning Objectives](#learning-objectives)
- [Background](#background)
   - [1. 16S rRNA Sequencing](#16s-rrna-sequencing)
   - [2. Gut Microbiome](#gut-microbiome)
      - [Microbiota: The "Who" in the microbiome](#microbiota-the-who-in-the-microbiome)
      - [Microbiome: What the microbes actually do](#microbiome-what-the-microbes-actually-do)
   - [3. Sequencing Technologies for 16S rRNA](#sequencing-technologies-for-16s-rrna)
   - [4. Taxonomy Hierarchy](#taxonomy-hierarchy)
      - [How does 16S rRNA sequencing use the taxonomy hierarchy?](#how-does-16s-rrna-sequencing-use-the-taxonomy-hierarchy)
   - [5. Microbial Community Profiling and Taxonomic Classification](#microbial-community-profiling-and-taxonomic-classification)
   - [6. Study Design and Hypothesis Development](#study-design-and-hypothesis-development)
      - [Hypothesis and Research Questions](#hypothesis-and-research-questions)
      - [Sample Population](#sample-population)
   - [7. Introduction to Using AWS for Cloud-Based Data Analysis Labs](#introduction-to-using-aws-for-cloud-based-data-analysis-labs)
      - [AWS S3 for Data Storage and Access](#aws-s3-for-data-storage-and-access)
      - [AWS Lambda for Data Processing Automation](#aws-lambda-for-data-processing-automation)
- [Quiz](#quiz)
- [Conclusion](#conclusion)
- [References](#references)

<center>
    <img src="./images/workflow.png" alt="workflow" width="1000"/>
</center>


## Overview <a name="ov"></a>
This module introduces participants to the principles of 16S rRNA sequencing and its applications in microbial community analysis. 16S rRNA gene sequencing is a commonly used method to study the diversity and composition of microbial communities by focusing on a specific region of the ribosomal RNA gene that is present in all bacteria. Understanding how to process and analyze this data is critical in studies related to the human gut microbiome, among others. We will review key concepts such as taxonomic classification, microbial profiling, and sequencing technologies.


## Learning Objectives <a name="lo"></a>
+ Learning Objective 1: Understand the basic principles of 16S rRNA sequencing
+ Learning Objective 2: Gain insights into microbial community profiling and its applications

# Background 
<center>
    <img src="./images/microbiome.jpg" alt="micro"/>
</center>
<!-- image refernce: https://www.genome.gov/about-nhgri/Director/genomics-landscape/june-6-2019-Human-Microbiome_Project-->

## 1. 16S rRNA Sequencing
We will be analyzing microbiota community composition using 16s sequencing. 16S is the ribosomal RNA (rRNA) found in prokaryotes, distinct from eukaryotic ribosomal RNA, thus an efficient filter for prokaryotes from our samples<sup>1</sup>. The 16S gene, found in the small subunit of the prokaryotic ribosomes, has many highly conserved areas across species, enabling the creation of primers that bind and amplify almost all bacterial species. Additionally, this gene has areas of highly variable sequences, allowing us to differentiate which species we have. 16S rRNA analysis determines the diversity and relative abundance of microbes in a sample. We can then associate differences in diet and lifestyle with microbiome community structure.

<center>
    <img src="./images/16S.png" alt="16S" width="600"/>
</center>

<!-- image refernce: https://microbenotes.com/wp-content/uploads/2024/07/16S-rRNA-Gene-Variable-Regions-1907x2048.jpeg-->

The 16S rRNA gene is widespread, has a relatively slow mutation rate in conserved regions, and high variability in other parts of the sequence, which makes it suitable for taxonomic identification. Researchers can sequence the 16S rRNA gene from mixed microbial samples, such as those from soil, water, or the human gut, to identify and quantify the microbes present without needing to culture them in a lab. This gene provides a snapshot of the microbial diversity within a sample, enabling researchers to explore how microbial communities influence human health, environmental processes, and other biological systems. This method has been widely used to investigate the gut microbiome’s relationship with dietary patterns, lifestyle, and various diseases.

## 2. Gut Microbiome

The gut microbiome refers to the vast community of microorganisms living in the digestive tracts of humans and other animals, as well as their interactions with that environment. These microbes are impacted by their host and in turn, impact their host<sup>2</sup>.

### Microbiota: The "Who" in the microbiome.

The microbiota are the community of microorganisms making up the microbiome. This includes:
 - *Bacteria*
 - *Archaea*
 - *Fungi*
 - *Protists*
 - *Algae*

### Microbiome: What the microbes actually do.

Microorganisms can impact their environment (or the host) in various ways. The term "theater of activity" refers to what they are doing in their environment. Microbes contain and produce structural elements (proteins, lipids, polysaccharides), nucleic acids (DNA and RNA), and metabolites, through which the microbiome interacts dynamically within itself and with the environment (in this case--you).  

These microbes play essential roles in digestion, nutrient absorption, immune system function, and even mental health<sup>2-3</sup>. An imbalance in the gut microbiome, known as dysbiosis, has been linked to a range of health conditions, including inflammatory bowel disease, obesity, diabetes, and even mental health disorders like depression<sup>4</sup>.

<center>
    <img src="./images/Microbiome_SF.png" alt="microbiome_details" width="600"/>
</center>
<!-- image refernce: Freese lab slides-->

## 3. Sequencing Technologies for 16S rRNA

Several sequencing technologies are available for studying microbial communities, the most commonly used platforms include **Illumina, PacBio, and Oxford Nanopore Technologies.** These platforms offer high-throughput capabilities that allow researchers to sequence hundreds to thousands of microbial DNA samples simultaneously.

Illumina sequencing is the most widely used technology for 16S rRNA sequencing. It provides short paired-end reads (150–300 bp) and is highly accurate, making it ideal for targeting specific regions of the 16S rRNA gene<sup>5</sup>. The platform is cost-effective and allows for the generation of large amounts of data in a short time.

+ **Drawback:** Illumina typically only sequences the variable regions of the 16S gene, rather than the full-length gene, which can limit resolution at lower taxonomic levels.

PacBio and Oxford Nanopore Technologies offer longer read lengths and allow for full-length 16S rRNA gene sequencing. This can increase the resolution of taxonomic identification and provide more comprehensive data on microbial diversity<sup>6</sup>.

+ **Drawback:** These platforms are generally more expensive and can have higher error rates than Illumina sequencing, though error correction methods have improved this.

The 16S rRNA gene sequencing process involves:

+ DNA extraction from microbial communities in the sample.
+ Amplification of the 16S rRNA gene using polymerase chain reaction (PCR) with primers targeting conserved regions of the gene.
+ Library preparation with adding adaptors. 
+ Sequencing of the amplified gene region.
+ Bioinformatics analysis to process the sequence data, identify microbial taxa, and quantify their relative abundances.


<center>
    <img src="./images/seq_steps.png" alt="sequencing steps" width="500"/>
</center>
<!-- image refernce:https://microbenotes.com/amplicon-sequencing/-->




## 4. Taxonomy Hierarchy

Taxonomy is the science of classifying organisms based on shared characteristics. In microbiome studies, taxonomy helps us organize microorganisms (like bacteria) into different classification levels, allowing us to group similar organisms and identify specific types of bacteria within a sample.

The **taxonomy hierarchy** follows a standardized structure that classifies organisms from broad groupings to more specific ones. The main levels of taxonomy used in microbiome studies are:

 - **Kindom:** At the highest level in 16S rRNA studies, we are mainly interested in the Bacteria kingdom, but other domains like Archaea can also be included.
 - **Phylum:** Groups of related classes; for example, the phylum Firmicutes includes many bacteria commonly found in the human gut.
 - **Class:** A division within phyla; for example, the class Clostridia falls under Firmicutes.
 - **Order:** A further subdivision, grouping similar families.
 - **Family:** Groups of closely related genera. For example, Lactobacillaceae is a family within the order Lactobacillales.
 - **Genus:** A group of species with shared characteristics; for example, Lactobacillus is a genus within the family Lactobacillaceae.
 - **Species:** The most specific level, identifying individual types of bacteria. For example, Lactobacillus acidophilus is a species within the genus Lactobacillus.

In microbiome analysis, each level gives us a different resolution of the microbial community, from broad to specific<sup>7</sup>. The deeper we go in the hierarchy, the more specific the classification becomes, allowing us to identify bacteria at finer levels. 

### How does 16S rRNA sequencing use the taxonomy hierarchy?

When we perform 16S rRNA sequencing, we can analyze our data at different levels of this taxonomy hierarchy, depending on our research questions and the primers we select.

1. **Higher levels (Phylum, Class, Order):**

    - These levels provide a broad overview of the bacterial community and help us see general patterns, like shifts in major bacterial groups due to diet, disease, or environment.
    - Useful for general comparisons, such as comparing the overall diversity between different environments or conditions (e.g., comparing the gut microbiome composition at the phylum level between healthy and diseased individuals).
    - It is often more reliable, as there are fewer ambiguities in classifying bacteria at these broad levels.

2. **Lower levels (Family, Genus, Species):**

    - These levels give us more specific information about the bacteria present and are more informative for understanding specific bacterial functions or identifying particular bacterial strains that may be beneficial or harmful.
    - Useful for targeted studies, such as identifying specific genera or species associated with particular health conditions or environmental changes.
    - Just looking at the species level may overcomplicate the data making differences between populations difficult to detect. 
    - However, 16S sequencing often struggles to resolve species-level distinctions, especially if two species have very similar 16S rRNA sequences. In these cases, we may only be able to confidently identify bacteria down to the genus level.

In practice, researchers often examine multiple levels to get a comprehensive picture. For example, we might examine broad patterns at the phylum level, then zoom in to the genus level to identify specific bacteria of interest.


<center>
    <img src="./images/taxonomy.png" alt="taxonomy" width="500"/>
</center>
<!-- image refernce:https://www.mometrix.com/academy/biological-classification-systems/-->



## 5. Microbial Community Profiling and Taxonomic Classification

Microbial community profiling refers to identifying the composition and structure of microbial populations within a given environment. Through 16S rRNA sequencing, researchers can generate a comprehensive profile of the different bacterial species present in a sample and their relative abundances<sup>1</sup>. This profiling helps to characterize complex microbial ecosystems, such as the human gut, which contains thousands of bacterial species that play crucial roles in health and disease.

Key steps in microbial community profiling include:

1. **Data preprocessing:** Quality filtering, trimming, and chimera checking of sequences to remove low-quality reads and errors.
2. **Operational Taxonomic Unit (OTU) or Amplicon Sequence Variant (ASV) assignment:** Grouping sequences based on similarity to form clusters that represent distinct microbial taxa. Traditionally, sequences are grouped into OTUs based on a similarity threshold (e.g., 97%), but newer methods like DADA2 generate ASVs, which provide higher resolution by differentiating sequences down to single-nucleotide differences.
3. **Taxonomic classification:** Assigning taxonomy to OTUs or ASVs based on comparison with reference databases such as SILVA, Greengenes, or RDP (Ribosomal Database Project). This allows researchers to classify the microbes at various taxonomic levels, including phylum, genus, and species.
4. **Diversity analysis:** Understanding the diversity within a sample (alpha diversity) and between samples (beta diversity). Alpha diversity measures such as Shannon diversity matrix<sup>8</sup> or Pielou’s evenness index<sup>9</sup> provide insights into species richness and evenness. Beta diversity metrics like Bray-Curtis dissimilarity<sup>10</sup> for evenness and  Jaccard distance<sup>11</sup> for richness are used to compare microbial communities between different samples and often conditions.

The outcome of taxonomic classification and microbial community profiling provides a detailed map of the microbial landscape in a sample, enabling researchers to explore relationships between microbial communities and external factors, such as diet, lifestyle, and health outcomes. This analysis is crucial for understanding how changes in microbial populations correlate with various physiological states or environmental conditions.


## 6. Study Design and Hypothesis Development
<center>
    <img src="./images/wolfpack.png" alt="wolfpack" width="200"/>
</center>

In this training module, we will be analyzing gut microbiome data from a University of Nevada, Reno study. The **WOLFPACK Study** (Wide Open Local Fecal sample collection comparing Pharmaceutical intake, ACtivity, and dietary intaKe) is designed to explore how diet, health, and many lifestyle aspects impact the gut microbiome of adults living in Northern Nevada. By examining the bacterial composition of fecal samples using 16S rRNA sequencing, and linking these findings to lifestyle and dietary information collected through surveys, this study aims to provide insights into how daily habits and health status influence gut health.

The study design involves collecting data through three main sources:

 - **Lifestyle Questionnaire:** This is a lengthy survey capturing information on participants' habits, including physical activity, home environment, social connectedness, socioeconomic status, and health history.
 - **[Food Frequency Questionnaire:](https://epi.grants.cancer.gov/diet/usualintakes/ffq.html)** This is a robust survey that assesses a variety of dietary patterns, asking about consumption frequency of many foods.
 - **16S rRNA Sequencing:** Participants also provided a fecal sample, which was processed and sequenced.


### Hypothesis and Research Questions
This study aims to explore the overarching question:

**How do diet and lifestyle influence the composition and diversity of the gut microbiome in Northern Nevadans?**

Based on this question, we can formulate several hypotheses and sub-questions to guide our analysis. 

 1. **Hypothesis 1:** Dietary patterns significantly impact the diversity and composition of the gut microbiome.
    - **Question:** How does the frequency of consuming protein correlate with the abundance of bacteria, such as those in the Firmicutes phylum which are associated with obesity?
    - **Question:** Are there specific dietary patterns (e.g., high-protein) associated with distinct gut microbiome profiles?
      

 2. **Hypothesis 2:** Gut microbiome composition is shaped by outside factors like sex and body mass index (BMI).
    - **Question:** Is there a correlation between lower BMI and greater microbiome diversity?
    - **Question:** Are there district differences between sexes in the abundance of specific bacteria?

Predetermined hypotheses guide the research workflow and create well-defined results. Take some time to think about your own hypothesis and sub-questions to explore. 

### Sample Population
This study's sample population is **adult residents of Northern Nevada** who are interested in participating. To ensure a broad representation, participation is open to adults of various ages, genders, ethnic backgrounds, and health statuses. The study's open enrollment and anonymous data collection aim to make participation accessible and encourage a wide range of community members to join. By focusing on the Northern Nevada population, the study can provide insights into region-specific factors that may influence gut health, allowing for comparisons with other populations in future research.

## 7. Introduction to Using AWS for Cloud-Based Data Analysis Labs

This training module will use Amazon Web Services (AWS) for computational resources. AWS provides powerful cloud computing tools that enable researchers to perform large-scale data analysis without the need to maintain physical infrastructure. In a cloud-based lab environment, AWS allows users to store, process, and analyze complex datasets using scalable resources on demand. This flexibility is particularly useful for bioinformatics and microbial studies, where large datasets, such as 16S rRNA sequencing data, require significant computational power and storage.

<div style="padding: 20px;">
</div>

<center>
    <img src="./images/aws.png" alt="aws" width="250"/>
</center>

### AWS S3 for Data Storage and Access
Amazon Simple Storage Service (S3) is a highly scalable and durable cloud-based storage service that provides a secure and efficient way to store and retrieve large amounts of data. It is designed to handle various data types, making it particularly useful in data-intensive applications, where large datasets from sequencing experiments need to be stored and accessed quickly.

Key features of AWS S3 for data storage and access include:

+ **Object Storage:** Amazon S3 stores data as objects within buckets. An *object* is a file and any metadata that describes the file. A *bucket* is a container for objects. To store your data in Amazon S3, you first create a bucket and specify a bucket name and AWS Region. Then, you upload your data to that bucket as objects in Amazon S3. Each object has a *key* (or *key name*), which is the unique identifier for the object within the bucket.

+ **Durability and Availability:** AWS S3 offers high durability and availability, ensuring that your data is protected and available when you need it. This is particularly important in research settings where losing large datasets could compromise entire projects.

+ **Scalability:** AWS S3 is elastic, scaling automatically, meaning you don’t have to worry about running out of storage space or paying for more than you need as your datasets grow or shrink. You can store virtually unlimited data, making it suitable for long-term projects.

+ **Cost-Effective Storage Tiers:** S3 offers multiple storage tiers, such as Standard, Infrequent Access, and Glacier (for archival purposes), allowing you to optimize costs based on how frequently data is accessed. For instance, frequently used data can be stored in Standard storage, while historical data can be archived in Glacier.

+ **Data Access and Management:** S3 provides a variety of tools to manage and access your data:

    - *AWS CLI and SDKs:* You can interact with S3 through the AWS Command Line Interface (CLI) or Software Development Kits (SDKs) for Python, R, and other languages, which allows programmatic access to store, retrieve, and organize data.
    - *Versioning:* S3 supports versioning, enabling you to keep multiple versions of the same object, which is useful when working with evolving projects.
    - *Lifecycle Policies:* Automate the movement of objects between different storage classes (e.g., moving less-used data to a cheaper storage class) to optimize costs.

+ **Data Security:** AWS S3 includes multiple layers of security:

    - *Encryption:* Amazon S3 supports both server-side encryption and client-side encryption for data uploads.
    - *Access Control:* You can control who has access to your data through bucket policies and IAM roles, which allows researchers and collaborators to work on the same datasets without compromising security.


### AWS Lambda for Data Processing Automation
AWS Lambda is a serverless computing service that allows you to run code without provisioning or managing servers. In scientific research, where you may need to process large amounts of sequencing data, Lambda can be used to automate routine tasks such as data preprocessing, quality control, and formatting for downstream analysis. Lambda’s serverless nature makes it highly efficient and cost-effective, as it only charges for the compute time used during function execution.

Key features of AWS Lambda for data processing automation include:

+ **Serverless Computing:** Lambda operates without the need to set up or maintain servers. This makes it particularly attractive in workflows that require scalable, on-demand processing. You simply write and deploy your code (as a Lambda function), and AWS handles the execution environment, scaling the infrastructure automatically based on the workload.

+ **Automatic Scaling:** Lambda automatically responds to code execution requests at any scale, from a dozen events per day to hundreds of thousands per second.

+ **Pay-as-you-go Pricing:** Lambda saves costs by paying only for the compute time you use—by the millisecond—instead of provisioning infrastructure upfront for peak capacity.

+ **Optimized Performace:** Lambda optimizes code execution time and performance with the right function memory size. It can respond to high demand in double-digit milliseconds with Provisioned Concurrency.


AWS Lambda can streamline complex workflows, allowing researchers to focus on analysis and interpretation rather than manual data management. By integrating with S3 for storage and EC2 for compute-heavy tasks (if needed), Lambda can become a key component of a scalable, automated, and cost-efficient analysis pipeline.

## Quiz

In [None]:
#Install jupyterquiz library
%pip install jupyterquiz

In [None]:
#Load jupyterquiz library
from jupyterquiz import display_quiz

In [None]:
#Display quiz as html
#Instructions for creating quiz .json files and converting to html provided in the links below
from IPython.display import IFrame
IFrame('questions/Quiz_Submodule1.html', width=800, height=400)

## Conclusion
In this module, we explored using 16S rRNA sequencing analysis as a powerful tool to investigate the diversity and composition of the microbiome. By leveraging this technology, we can explore the intricate relationships between microbial populations and their host’s health, lifestyle, and diet. Sequencing analysis enables comprehensive profiling, providing insights into species richness and community composition.

Through microbial community profiling and taxonomic classification, we can better understand how microbiota influences biological processes, contributing to fields such as gut health, disease research, and environmental microbiology. This knowledge has far-reaching implications, helping to uncover the role of the microbiome in health and disease, guiding future research, and informing personalized approaches to healthcare and lifestyle interventions.

## References
1. Weinroth MD, et al. (2022). Considerations and best practices in animal science 16S ribosomal RNA gene sequencing microbiome studies. Journal of Animal Science 100(2), 1525-3163. doi:10.1093/jas/skab346
2. Clemente JC, Ursell LK, Parfrey LW, et al. (2012) The impact of the gut microbiota on human health: an integrative view. Cell 148, 1258–1270
3. Xu Z, & Knight R. (2015). Dietary effects on human gut microbiome diversity. British Journal of Nutrition, 113(S1), S1–S5. doi:10.1017/S0007114514004127
4. Turnbaugh PJ, Ley RE, Hamady M, et al. (2007) The Human Microbiome Project. Nature 449, 804–810
5. Caporaso,  J.  G., C.  L.  Lauber, W.  A.  Walters, D.  Berg-Lyons, C. A. Lozupone, P. J. Turnbaugh, N. Fierer, and R. Knight. 2011. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl. Acad. Sci. U. S. A. 108 (Suppl 1):4516–4522. doi:10.1073/pnas.1000080107
6. Johnson, J.S., Spakowicz, D.J., Hong, BY. et al. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun 10, 5029 (2019). doi:10.1038/s41467-019-13036-1
7. Shah, N., Meisel J., and Pop M. (2019) Embracing Ambiguity in the Taxonomic Classification of Microbiome Sequencing Data. Frontiers in Genetics. 10, 1664-8021. doi:10.3389/fgene.2019.01022
8. Shannon,  C.  E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27:379–423. doi:10.1002/j.1538-7305.1948.tb01338.x
9. Pielou, E. C. 1966. The measurement of diversity in different types of biological collections. J. Theor. Biol. 13:131–144. doi:10.1016/0022-5193(66)90013-0
10. Bray, J. R., and J. T. Curtis. 1957. An ordination of the upland forest communities of Southern Wisconsin. Ecol. Monogr. 27:325–349. doi:10.2307/1942268
11. Jaccard, P. 1901. Étude comparative de la distribution florale dans une portion des Alpes et du Jura. doi:10.5169/seals-266450