# Submodule 3: Taxonomic Classification and Diversity Analysis


## Overview


## Learning Objectives
+ Perform taxonomic classification of 16S rRNA sequences
+ Evaluate and explore rarefaction curves to assess sampling depth and species richness
+ Analyze alpha and beta diversity within microbial communities


# 1. Import Data

### Files
 
 1. **Food Frequency Questionnaire (FFQ):**
    - This CSV file contains data from a food frequency questionnaire completed by study participants. It includes columns for participant identifiers (e.g., SampleID), demographic information (e.g., SEX, AGE), and food consumption frequencies and quantities (e.g., BREAKFASTSANDWICHFREQ, EGGSFREQ). This data will allow us to analyze dietary habits and link them with microbiome profiles.
 2. **Lifestyle Questionnaire:**
    - This text file contains responses to lifestyle questions, such as physical activity levels, smoking status, and other lifestyle factors. It has a row for each question, an answer key, and responses for individual participants in columns.
   
 3.  **FASTQ Files:**
    - These files contain raw sequencing data from 16S rRNA sequencing. Each read in a FASTQ file has a sequence identifier, nucleotide sequence, and quality scores. 

Let's start by loading the survey files into DataFrames and performing some initial quality checks to ensure data integrity.

## Import Survey Data

### Step 1: Load the Survey Data
Let's load both survey files:

In [None]:
# Install and load required packages
if (!requireNamespace("dplyr", quietly = TRUE)) {
    install.packages("dplyr")
}

if (!requireNamespace("ggplot2", quietly = TRUE)) {
    install.packages("ggplot2")
}


library("dplyr")
library("ggplot2")  

In [None]:
#Load in the data
ffq <- read.csv('datach/FFQ_Data.csv')
lsq <- read.csv("cleanMetaData.txt", sep = '\t')
# Display the first few rows of the dataframe to understand its structure
head(ffq)

Our Food Frequency Questionaire has columns representing the Sample ID, Respondent ID number, and all the questions asked. The heading of the columns represents abbreviations for the questions. Now let's look at the dimensions of our data.

In [None]:
# Full dimensions
dim(ffq)
# Number of row
nrow(ffq)
# Number of columns
ncol(ffq)

We can see that we have 100 participants (every row is a sample) and 1074 columns which means over 1000 food questions! Now let's look at the lifestyle survey. 

In [None]:
# Display the first few rows of the dataframe to understand its structure
head(lsq)

Our Lifestyle Questionaire has columns representing the Sample ID and all the lifestyle questions asked. The heading of the columns represents abbreviations for the questions. Now let's look at the dimensions of our data.

In [None]:
# Full dimensions
dim(lsq)
# Number of row
nrow(lsq)
# Number of columns
ncol(lsq)

We have less questions in this survey, only 33, and 96 participants.

### Step 2: Quick Quality Control Checks
Let's perform several initial checks to assess data quality:

 1. **Collect Intersection of Samples:** We need to find the Sample IDs in the FFQ data rows and in the LSQ data columns. 


In [None]:
# Find sample intersection
samples <- intersect(ffq$SampleID, lsq$SampleID)

In [None]:
length(samples)

 2. **Basic Summary Statistics:** We can get an overview of the data, which can help identify outliers or unexpected values. We will look at the demographic questions in the FFQ and Lifestyle Questionnaire separately.

In [None]:
# Summary statistics for the first 10 columns
summary(ffq[, 1:9])


In [None]:
# Summary statistics for 3 columns
summary(lsq[, 3:5])


3. **Visualize Data:** Ensure the data looks accurate with exploratory visualizations.

In [None]:
# Load necessary library
library(ggplot2)

# Set up the plotting theme
theme_set(theme_minimal())

# 1. Bar plot for Sex distribution
# Bar plot for Sex distribution without modifying the original data
ggplot(ffq, aes(x = factor(SEX, levels = c(1, 2), labels = c("Male", "Female")))) +
  geom_bar(fill = "#0073C2FF") +  # Blue color for bars
  labs(title = "Distribution of Sex", x = "Sex", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
    axis.text = element_text(size = 18))


In [None]:
# Load necessary library
library(ggplot2)

# Set up the plotting theme
theme_set(theme_minimal())

ggplot(lsq, aes(x = factor(Sex))) +
  geom_bar(fill = "#0073C2FF") +  # Blue color for bars
  labs(title = "Distribution of Sex", x = "Sex", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
        axis.text = element_text(size = 18))


In [None]:
# 2. Histogram for Age distribution
# Dropping NA values for AGE
ggplot(ffq, aes(x = AGE)) +
  geom_histogram(bins = 10, fill = "lightgreen", color = "black") +
  labs(title = "Age Distribution", x = "Age", y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
    axis.text = element_text(size = 18))


In [None]:
ggplot(lsq, aes(x = Age)) +
  geom_histogram(bins = 10, fill = "lightgreen", color = "black") +
  labs(title = "Age Distribution", x = "Age", y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
        axis.text = element_text(size = 18))

In [None]:
# 3. Histogram for Weight distribution
# Dropping NA values for WEIGHT
ggplot(ffq, aes(x = WEIGHT)) +
  geom_histogram(bins = 15, fill = "lightcoral", color = "black") +
  labs(title = "Weight Distribution", x = "Weight", y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
    axis.text = element_text(size = 18))

In [None]:
ggplot(lsq, aes(x = Weight)) +
  geom_histogram(bins = 15, fill = "lightcoral", color = "black") +
  labs(title = "Weight Distribution", x = "Weight", y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
        axis.text = element_text(size = 18))


### Step 3: Combine Metadata

We have two different surveys, let's combine them into one so we only have one metadata file to work with. 

In [None]:
# Load in data from Submodule 2
sample.names <- readRDS('sampleNames.rds')
taxa         <- readRDS("taxa.rds")
reads        <- readRDS("seqtabnochim.rds")

In [None]:
# Merge data
metadata <- merge(ffq, lsq, by = "SampleID", all = TRUE)
# Rename samples to match FastQ files
metadata$SampleID <- gsub("WP_", "WP-", metadata$SampleID)

We have a lot more samples in our metadata than we do sequenced, therefore we need to pull out only the samlpes who have sequenced reads from our metadata.

In [None]:
# Add in a if they have sequences
dfseqIndo <- data.frame(sampleNames = sample.names,
                       inSeq       = rep('yes', length(sample.names)))
# Only keep those who do
metadata <- merge(dfseqIndo, metadata, by.x = 'sampleNames', by.y = 'SampleID', all.x = TRUE)
colnames(metadata)[1] <- 'SampleID'
rownames(metadata)    <- metadata$SampleID

### Step 4: Asssign Taxonomy to our Sequencing Data

We will be using the package `phyloseq` to analyze our microbiomes.

In [None]:
# Assigning taxonomy
library(phyloseq)
ps <- phyloseq(otu_table(reads, taxa_are_rows = F),
               tax_table(taxa),
               sample_data(metadata))

ps

In [None]:
write.table(otu_table(seqtab.nochim, taxa_are_rows = F), 
            "ASV_table.txt", row.names = TRUE, quote = FALSE, sep = '\t')

In [None]:
library(vegan)
library(dplyr)
library(ggpubr)

### Step 5: Data Analysis and Visualization 

#### Rarefaction Curves

Rarefaction curves are essential tools in microbiome analysis used to assess the richness of microbial communities and the adequacy of sequencing depth. These curves plot the number of observed species against the number of sequencing reads, providing insight into species diversity within a sample.

The shape of a rarefaction curve indicates whether sequencing captured the community’s full diversity: a plateau suggests that most species have been identified, while a steep incline indicates that additional sequencing may uncover more species. Rarefaction also enables comparisons across samples with varying sequencing depths, ensuring that observed differences in diversity are not artifacts of unequal sampling effort.

Let's start by looking at the differences between sex and the number of animals in the house.

In [None]:
# First edit some variables 
# Treat the number of animals as discrete not continuos
metadata$num_animals = factor(metadata$num_animals, levels = c('0', '1', '2', '3', '4'))
# Order BMI
metadata$BMI_ordinal = factor(metadata$BMI_ordinal, levels = c('Healthy weight', 'Overweight', 'Obesity Class 1', 'Obesity Class 2 & 3'))

In [None]:
rare <- rarecurve(reads,step = 1000, tidy = TRUE)
rare <- inner_join(rare, metadata, by=c("Site"="SampleID"))

ggplot(rare, aes(x=Sample, y=Species, col=num_animals, linetype=Sex, group=Site))+
  geom_line(linewidth = 2, alpha = 0.8)+
  scale_color_manual(values = c('0' = '#f7b801', '1' = '#f18701', 
                                '2' = '#f35b04', '3' = '#780116', 
                                '4' = '#52006A')) +
  theme_pubr(legend="right") +
  labs(x="Sample Size")

Now let's plot BMI and sex.

In [None]:
ggplot(rare, aes(x=Sample, y=Species, col=BMI_ordinal, linetype=Sex, group=Site))+
  geom_line(linewidth = 2, alpha = 0.8) +
  scale_color_manual(values = c('Healthy weight' = '#f7b801', 'Overweight' = '#f18701', 
                                'Obesity Class 1' = '#f35b04', 'Obesity Class 2 & 3' = '#780116')) +
  theme_pubr(legend="right") +
  labs(x="Sample Size")

#### Diversity in Microbiome

Alpha and beta diversity are key concepts in microbiome studies used to understand the composition of microbial communities.
 - **Alpha diversity** refers to the diversity within a single sample, capturing the richness (number of species) and evenness (distribution of species). It provides insights into how diverse a community is at a specific site or condition. Common metrics include species richness, Simpson’s index, and the Shannon index.
 - **Beta diversity**, on the other hand, measures the variation in microbial communities between samples. It highlights differences in composition and can reveal patterns of microbial distribution across environments or conditions. Common beta diversity metrics include Bray-Curtis dissimilarity and UniFrac distances.

##### Alpha Diversity

Let's look at the alpha diversity in our samples with the **Shannon index**. The Shannon index measures both the richness and evenness of species within a sample. A higher Shannon index indicates greater diversity, while a value of 0 means only one species is present in the community.

In [None]:
shannon <- diversity(reads, index = "shannon") %>%
  as_tibble(rownames = "SampleID")%>%
  inner_join(., metadata, by="SampleID")

In [None]:
ggplot(shannon, aes(x=Sex, y=value, col=Sex))+
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width = 0.2) +
  theme_pubr(legend="right") +
  labs(x="",y="Shannon Weaver Index") +
  stat_compare_means(aes(x=Sex, y=value, col=Sex), hide.ns = F, method="t.test",label = "p.signif",
                     label.x = 1.5, label.y= 4.8, show.legend = F) +
  theme(legend.position="none")

Let's look at BMI within sex.

In [None]:
ggplot(shannon, aes(x=Sex, y=value, col=BMI_ordinal))+
  geom_boxplot()+      # outlier.shape = NA
 # geom_jitter() +     # @ cassandra, if you get the jitter to work here, let me know what you did. I can't find my code and don't recall how to fix this
  facet_grid(col=vars(Sex), scales = "free_x")+
  scale_color_manual(values = c('Healthy weight' = '#f7b801', 'Overweight' = '#f18701', 
                                'Obesity Class 1' = '#f35b04', 'Obesity Class 2 & 3' = '#780116')) +
  theme_pubr(legend="right") +
  labs(x="",y="Shannon Weaver Index") 

Next, let's look at the alpha diversity with the **Inverse Simpson Index**. The Inverse Simpson Index reflects both species richness and evenness within a community. It accounts for the number of species present and their relative abundances, emphasizing the dominance of common species. A higher Inverse Simpson value indicates greater diversity, as it signifies more species evenly distributed in the community. This index is particularly useful for understanding community composition and detecting dominance patterns in microbiome studies.

In [None]:
invsimp<-diversity(reads, index = "invsimpson")%>%
  as_tibble(rownames = "SampleID")%>%
  inner_join(., metadata, by="SampleID")

In [None]:
ggplot(invsimp, aes(x=Sex, y=value, col=Sex))+
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width = 0.2) +
  theme_pubr(legend="right") +
  labs(x="",y="Inverse Simpson Index") +
  stat_compare_means(aes(x=Sex, y=value, col=Sex), hide.ns = F, method="t.test",label = "p.signif",
                     label.x = 1.5, label.y= 60, show.legend = F) +
  theme(legend.position="none")

##### Beta Diversity

Non-metric Multidimensional Scaling (NMDS) is an ordination method used to visualize differences in community composition between samples, making it a key tool for analyzing beta diversity. It represents pairwise dissimilarities (e.g., Bray-Curtis distances) in a low-dimensional space, preserving the rank order of distances rather than their exact values.

The result is a visual representation where samples with similar microbial communities appear closer together, while dissimilar communities are farther apart. NMDS is particularly useful for identifying patterns and grouping structures in complex ecological or microbiome datasets. The quality of the ordination is assessed using a stress value, with lower values indicating a better representation of the original data.

In [None]:
library(tidyverse)
library(dada2)
library(vegan)
library(ggvegan) 
library(ggpubr)
library(rstatix)
library(stringr)

In [None]:
nmds <- metaMDS(reads, distance = "bray", trymax = 40, autotransform = TRUE)
nmds

In [None]:
nmds2 <- metaMDS(reads)
plot(nmds2, type = "t")
autoplot(nmds)

In [None]:
f_nmds <- fortify(nmds) %>%
  subset(., score=="sites") %>%
  inner_join(., metadata, by=c("label"="SampleID"))

In [None]:
ggplot(f_nmds)+
  geom_point(aes(x=NMDS1,y=NMDS2, col=BMI_ordinal))+
  theme_pubr(legend = "right")+
  geom_abline(intercept = 0,slope = 0,linetype="dashed", linewidth=0.3)+
  geom_vline(aes(xintercept=0), linetype="dashed", linewidth=0.3)+
  geom_text(aes(x=NMDS1,y=NMDS2, col=BMI_ordinal, label=label))+
  scale_color_manual(values = c('Healthy weight' = '#f7b801', 'Overweight' = '#f18701', 
                                'Obesity Class 1' = '#f35b04', 'Obesity Class 2 & 3' = '#780116')) +
  labs(caption = "Cloud add something here")


ggplot(f_nmds)+
  geom_point(aes(x=NMDS1,y=NMDS2, col=num_animals))+
  theme_pubr(legend = "right")+
  geom_abline(intercept = 0,slope = 0,linetype="dashed", linewidth=0.3)+
  geom_vline(aes(xintercept=0), linetype="dashed", linewidth=0.3)+
  scale_color_manual(values = c('0' = '#e9d8a6', '1' = '#c7e9b4', 
                                '2' = '#41b6c4', '3' = '#225ea8', '4' = '#081d58')) +
  geom_text(aes(x=NMDS1,y=NMDS2, col=num_animals, label=label))
  

---------------------------------------------------

## Conclusion
Provide an overview of the lessons and skills learned from the module.

## Clean up
A reminder to shutdown VM and delete any relevant resources. <br><br>

<br>