# Submodule 3: Taxonomic Classification and Diversity Analysis


## Overview


## Learning Objectives
+ Perform taxonomic classification of 16S rRNA sequences
+ Evaluate and explore rarefaction curves to assess sampling depth and species richness
+ Analyze alpha and beta diversity within microbial communities


# 1. Import Data

### Files
 
 1. **Food Frequency Questionnaire (FFQ):**
    - This CSV file contains data from a food frequency questionnaire completed by study participants. It includes columns for participant identifiers (e.g., SampleID), demographic information (e.g., SEX, AGE), and food consumption frequencies and quantities (e.g., BREAKFASTSANDWICHFREQ, EGGSFREQ). This data will allow us to analyze dietary habits and link them with microbiome profiles.
 2. **Lifestyle Questionnaire:**
    - This text file contains responses to lifestyle questions, such as physical activity levels, smoking status, and other lifestyle factors. It has a row for each question, an answer key, and responses for individual participants in columns.
   
 3.  **FASTQ Files:**
    - These files contain raw sequencing data from 16S rRNA sequencing. Each read in a FASTQ file has a sequence identifier, nucleotide sequence, and quality scores. 

Let's start by loading the survey files into DataFrames and performing some initial quality checks to ensure data integrity.

## Import Survey Data

### Step 1: Load the Survey Data
Let's load both survey files:

In [None]:
# Function to check and install required packages
install_if_missing <- function(package) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package, dependencies = TRUE)
    library(package, character.only = TRUE)
  }
}

# Install and load required packages
install_if_missing("dplyr")
install_if_missing("ggplot2")  

In [None]:
#Load in the data
ffq <- read.csv('datach/FFQ_Data.csv')
lsq <- read.csv('datach/LS_Data.txt', sep = '\t', fileEncoding = "ISO-8859-1", fill = TRUE) # have to specify the separator as we are not reading in a csv

# Display the first few rows of the dataframe to understand its structure
head(ffq)

Our Food Frequency Questionaire has columns representing the Sample ID, Respondent ID number, and all the questions asked. The heading of the columns represents abbreviations for the questions. Now let's look at the dimensions of our data.

In [None]:
# Full dimensions
dim(ffq)
# Number of row
nrow(ffq)
# Number of columns
ncol(ffq)

We can see that we have 100 participants (every row is a sample) and 1074 columns which means over 1000 food questions! Now let's look at the lifestyle survey. 

In [None]:
# Display the first few rows of the dataframe to understand its structure
head(lsq)

We can see that this table is different, with participants being in columns and the questions in rows. Let's look at the dimensions. Column names starting with "WP_" represent one of our sample participants.

In [None]:
# Full dimensions
dim(lsq)
# Number of row
nrow(lsq)
# Number of columns
ncol(lsq)

We have less questions (rows) in this survey.

### Step 2: Quick Quality Control Checks
Let's perform several initial checks to assess data quality:

 1. **Collect Intersection of Samples:** We need to find the Sample IDs in the FFQ data rows and in the LSQ data columns. 


In [None]:
# Find sample intersection
intersect(ffq$SampleID, colnames(lsq)[-(1:2)])
length(intersect(ffq$SampleID, colnames(lsq)[-(1:2)]))

idx.samples = intersect(ffq$SampleID, colnames(lsq)[-(1:2)])


We have 96 samples.

 2. **Basic Summary Statistics:** We can get an overview of the data, which can help identify outliers or unexpected values. We will only be looking at the demographic questions in the FFQ.

In [None]:
# Summary statistics for the first 10 columns
summary(ffq[, 1:9])


3. **Visualize Data:** Ensure the data looks accurate with exploratory visualizations.

In [None]:
# Load necessary library
library(ggplot2)

# Set up the plotting theme
theme_set(theme_minimal())

# 1. Bar plot for Sex distribution
# Bar plot for Sex distribution without modifying the original data
ggplot(ffq, aes(x = factor(SEX, levels = c(1, 2), labels = c("Male", "Female")))) +
  geom_bar(fill = "#0073C2FF") +  # Blue color for bars
  labs(title = "Distribution of Sex", x = "Sex", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
    axis.text = element_text(size = 18))

# 2. Histogram for Age distribution
# Dropping NA values for AGE
ggplot(ffq, aes(x = AGE)) +
  geom_histogram(bins = 10, fill = "lightgreen", color = "black") +
  labs(title = "Age Distribution", x = "Age", y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
    axis.text = element_text(size = 18))

# 3. Histogram for Weight distribution
# Dropping NA values for WEIGHT
ggplot(ffq, aes(x = WEIGHT)) +
  geom_histogram(bins = 15, fill = "lightcoral", color = "black") +
  labs(title = "Weight Distribution", x = "Weight", y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5), axis.title = element_text(size = 20),
    axis.text = element_text(size = 18))

---------------------------------------------------

## Conclusion
Provide an overview of the lessons and skills learned from the module.

## Clean up
A reminder to shutdown VM and delete any relevant resources. <br><br>

<br>