# Notebook 1.1 for System Biology of Aging 
This notebook is part of the first session of the 2024 Systems Biology of Aging Workshop. But don't worry about setup, all of the files and R packages should be installed in the the Sagemaker environment.

This Notebook will use the "R" kernel. Double check that this is correct by looking in the top right corner.

> Outline
> * Data exploration and cleaning
> ** Dispersion of frailty measures
> ** Create baseline proteins, metabolites and clinical
> ** Dimensionality reduction
> ** Scale and impute for PCA
> * Single-Omic WGNCA
> ** Correlate with frailty
> ** Enrichment of select modules
> * Single-Omics DE
> ** DE of Proteins and Metabolites quintiles
> ** Volcano plots
> ** Enrichment

## Setup
The next couple blocks of code will load the R packages into our environment and set some options for nicer visualizaions.

In [None]:
# Load packages, one per line for clarity
suppressMessages(library("tidyverse", quietly = TRUE, warn.conflicts=FALSE))
suppressMessages(library("ggplot2", quietly = TRUE, warn.conflicts=FALSE))
suppressMessages(library("WGCNA", quietly = TRUE, warn.conflicts=FALSE))
suppressMessages(library("org.Hs.eg.db", quietly = TRUE, warn.conflicts=FALSE))
suppressMessages(library("clusterProfiler", quietly = TRUE, warn.conflicts=FALSE))
suppressMessages(library("limma", quietly = TRUE, warn.conflicts=FALSE))
suppressMessages(library("DT", quietly = TRUE, warn.conflicts=FALSE))
suppressMessages(library("ggpubr", quietly = TRUE, warn.conflicts=FALSE))
# Other options
source("Scripts/Workshop_scripts.R") # Functions for plotting
options(stringsAsFactors=FALSE)#Required for WGCNA
#enableWGCNAThreads(nThreads=2) # 
options(repr.plot.width=7, repr.plot.height=10)#Default=7x7
options(repr.matrix.max.rows=75, repr.matrix.max.cols=20)
options(warn=-1)

# Let's Get Started! -Frailty measures

Now that our environment is ready, let's start digging into our data. In your Sagemaker environment, you should see a folders with tsv files for each of the omics and outcomes. We'll go over all of these during the anlaysis, but let's start by looking at our outcome- the frailty indices.

In [None]:
fr_measures = read_delim("../data/frailty/combination_fi_040124.csv")
dim(fr_measures)
head(fr_measures)
colnames(fr_measures)

In [None]:
fr_measures_key = read_delim("../Useful_Files/FI_Features.txt", show_col_types = FALSE, delim="\t")
fr_measures_key

When were the particpants assessed?

In [None]:
hist(fr_measures$days_in_program, main = "Histogram of days in program",xlab = 'Number of Days')

Next, let's look at the distribution for each of the frailty measures: lab, self assessment and the merged.

In [None]:
options(repr.plot.width=10, repr.plot.height=7)#Default=7x7
hist(fr_measures$lab_fi,  xlim=c(0,.7), col=rgb(0,0,1,2/4), breaks=25, main = "Histogram of Frailty Measures",xlab = 'Index Value') # Blue
hist(fr_measures$self_fi, add=T, xlim=c(0,.7), col=rgb(0,1,0,2/4), breaks=25) # Green
hist(fr_measures$merge_fi, add=T, xlim=c(0,.7), col=rgb(1,0,0,2/4), breaks=25) # Red

legend("topright",  c("Lab_FI", "Self_FI", "Merged_FI"), lwd=4, col=c("blue","green", "red"))

Sanity check- Frailty, in general, should increase with age. Is this what we see?

In [None]:
p1 <- ggscatter(fr_measures, x = "age", y = "merge_fi",
          add = "reg.line",                                 
          conf.int = TRUE,                                
          add.params = list(color = "blue",
                            fill = "dimgrey")
          )+
  stat_cor(method = "pearson")
facet(p1, facet.by = "sex")

# Data Cleaning

In [None]:
# Load the data
# Proteins
meta_protein = read_delim("../data/arivale_snapshot_ISB_2019-05-10_0053/proteomics_metadata.tsv", skip=13, delim='\t', show_col_types = FALSE)
proteins = read_delim("../data/arivale_snapshot_ISB_2019-05-10_0053/proteomics_corrected.tsv", skip=13, delim='\t', show_col_types = FALSE)

# Metabolites
meta_metabolites = read_delim("../data/arivale_snapshot_ISB_2019-05-10_0053/metabolomics_metadata.tsv", skip=13, delim='\t', show_col_types = FALSE)
metabolites = read_delim("../data/arivale_snapshot_ISB_2019-05-10_0053/metabolomics_corrected.tsv", skip=13, delim='\t', show_col_types = FALSE)

# Clinical labs
labs = read_delim("../data/arivale_snapshot_ISB_2019-05-10_0053/chemistries.tsv", skip=13, delim='\t', show_col_types = FALSE)


#Add Biochemical name to metabolite dataframe
new_names <- sapply(colnames(metabolites), function(x) {
  if (x %in% meta_metabolites$CHEMICAL_ID) {newrow <- meta_metabolites[meta_metabolites$CHEMICAL_ID %in% x,] 
                                            paste0(x, "(", newrow[,'BIOCHEMICAL_NAME'], ")")
  } else {x}
})
colnames(metabolites) <- new_names
#Add gene name to protein dataframe
new_names <- sapply(colnames(proteins), function(x) {
  if (x %in% meta_protein$name) {newrow <- meta_protein[meta_protein$name %in% x,] 
                                 paste0(x, "(", newrow[,'gene_name'], ")")
  } else {x}
})
colnames(proteins) <- new_names

## Metabolomics
Lets look at the data!

In [None]:
#Load the data
print("Metabolite metadata")
head(meta_metabolites)
print("Metabolites")
head(metabolites)
tail(meta_metabolites)

### Drop metabolites without annotations

In [None]:
print("Metabolites")
print(str_c("- columns: ", ncol(metabolites)))
metabolites <- metabolites[,!grepl("X -", colnames(metabolites))]
print("Metabolites after filtering")
print(str_c("- columns: ", ncol(metabolites)))

### Get baseline metabolites

Considering these participants underwent wellness coaching, we will only look at the baseline analysis for looking at associations with frailty indices. 

In [None]:
barplot(table(table(metabolites$public_client_id)), main = "Histogram of Participant",xlab = 'Number of Occurrences')

In [None]:
hist(metabolites$days_since_first_draw, main = "Days Since Draw",xlab = 'Day')

In [None]:
print("Participants metabolite records")
print(str_c("- columns: ", nrow(metabolites)))
metabolites <-metabolites %>% 
    group_by(public_client_id) %>%
    arrange(days_since_first_draw) %>%
    filter(row_number()==1) %>%
    filter(days_since_first_draw < 75)

print("Participants metabolite records after filtering")
print(str_c("- columns: ", nrow(metabolites)))

### Evaluate Missingness

In [None]:
mets_missing <- colSums(is.na(metabolites))/nrow(metabolites)
hist(mets_missing, main = "Histogram of Missingness",xlab = '% NA')

In [None]:
mets_missing[order(mets_missing, decreasing = TRUE)[1:10]]

### Impute and Scale

When dealing with metabolites, its important to consider the method of imputation! For example, imputing with the mean or median may result in incorrect values for xenobiotics. More sophisticated imputation methods, such as random forest can be costly. 

In [None]:
metabolites_filter <- metabolites[,c(9:ncol(metabolites))] # Drop non metabolite values
metabolites_filter <- metabolites_filter[, colMeans(is.na(metabolites_filter)) <= .5]
metabolites_filter_impute <- as_tibble(impute::impute.knn(as.matrix(metabolites_filter))$data)
head(metabolites_filter_impute)

### PCA

In [None]:
PCA_mets_all <- prcomp(metabolites_filter_impute, center=TRUE, scale=TRUE)
PCA_mets <- cbind(metabolites[,c(1:8)], PCA_mets_all$x[,c(1:4)])
PCA_mets <- merge(PCA_mets, fr_measures, by="public_client_id")

In [None]:
percentVar <- PCA_mets_all$sdev^2 / sum(PCA_mets_all$sdev^2 )

p1 <- ggplot(PCA_mets, aes(x=PC1, y=PC2, color=sex)) +
    geom_point(size=3) + scale_color_brewer(palette="Set1") +
    xlab(paste0("PC1: ",round(percentVar[1] * 100),"% variance")) +
    ylab(paste0("PC2: ",round(percentVar[2] * 100),"% variance")) +
     theme_minimal() +
    coord_fixed()
p2 <- ggplot(PCA_mets, aes(x=PC3, y=PC4, color=sex)) +
    geom_point(size=3) + scale_color_brewer(palette="Set1") +
    xlab(paste0("PC3: ",round(percentVar[3] * 100),"% variance")) +
    ylab(paste0("PC4: ",round(percentVar[4] * 100),"% variance")) +
     theme_minimal() +
    coord_fixed()
options(repr.plot.width=20, repr.plot.height=10)#Default=7x7
ggarrange(p1, p2,  common.legend = TRUE)


In [None]:
options(repr.plot.width=10, repr.plot.height=7)#Default=7x7
ggplot(PCA_mets, aes(x=PC1, y=PC2, color=merge_fi)) +
    geom_point(size=2) + scale_color_gradient(low = "yellow", high = "darkblue") +
    xlab(paste0("PC1: ",round(percentVar[1] * 100),"% variance")) +
    ylab(paste0("PC2: ",round(percentVar[2] * 100),"% variance")) +
    theme_minimal() +
    coord_fixed()



 We'll leave it there for now, but will come back to PCA in session three.

## Proteins

Proteins follow the same analysis pattern as metabolites. We'll skip going over in detail. 

In [None]:
print("Participants protein records")
print(str_c("- columns: ", nrow(proteins)))
proteins <-proteins %>% 
    group_by(public_client_id) %>%
    arrange(days_since_first_draw) %>%
    filter(row_number()==1) %>%
    filter(days_since_first_draw < 75)

print("Participants protein records")
print(str_c("- columns: ", nrow(proteins)))

## *Exercise- Proteins PCA*
What, if anything, needs to be considered when imputing and filtering proteins?
Try 3 imputation methods, Zero, Mean and KNN. Does this change the PCA?

## Clinical Labs

In [None]:
print("Participants clinical labs records")
print(str_c("- columns: ", nrow(labs)))
labs <-labs %>% 
    group_by(public_client_id) %>%
    arrange(days_since_first_draw) %>%
    filter(row_number()==1) %>%
    filter(days_since_first_draw < 75)

print("Participants clinical labs records")
print(str_c("- columns: ", nrow(labs)))

In [None]:
colnames(labs)

In [None]:
### Save these for later to re-run analysis below without having to run the code above. 
dir.create("./Session1_files/", showWarnings = FALSE)
write_delim(metabolites[,c(1,9:ncol(metabolites))], "./Session1_files/metabolites_baseline.tsv", delim="\t")
write_delim(proteins[,c(1,22:ncol(proteins))], "./Session1_files/proteins_baseline.tsv", delim="\t")
write_delim(labs[,c(1,13:ncol(labs))], "./Session1_files/chemistries_baseline.tsv", delim="\t")

# WGCNA Proteins

## Prepare data

In [None]:
dataset_label <- "Proteomics"#For label purpose; Fix throughout this notebook
omics <- proteins[,c(1,22:ncol(proteins))]

#Select the participants having both data
##Analyte data
data_df <- omics %>%
    dplyr::filter(public_client_id %in% fr_measures$public_client_id) %>%
    dplyr::arrange(public_client_id) %>%#Sort row order
    #Transform tibble for easily applying to the WGCNA functions
    column_to_rownames(var="public_client_id")
data_df <- data_df[, order(colnames(data_df))]#Sort column order
print("Data")
print(str_c("- nrow: ", nrow(data_df)))
head(data_df)
##Sample metadata
sample_tbl <- fr_measures %>%
    dplyr::filter(public_client_id %in% omics$public_client_id) %>%
    dplyr::arrange(public_client_id)#Sort row order
print("Sample metadata")
print(str_c("- nrow: ", nrow(sample_tbl)))
head(sample_tbl)

# Prepare metadata
analyte_tbl <- meta_protein %>%
    #Prepare the same analyte IDs within the data table
    dplyr::mutate(AnalyteID=str_c(name,"(",gene_name,")"), Dataset=dataset_label) %>%
    #Clean
    dplyr::rename(AnalyteID_original=name, PanelID=panel, UniProtID=uniprot, GeneSymbol=gene_name) %>%
    dplyr::select(AnalyteID, Dataset, AnalyteID_original, PanelID, UniProtID, GeneSymbol)

#Filter analytes within the data table
analyte_tbl <- analyte_tbl %>%
    dplyr::filter(AnalyteID %in% colnames(data_df)) %>%
    dplyr::arrange(AnalyteID)#Sort row order
print("Analyte metadata")
print(str_c("- nrow: ", nrow(analyte_tbl)))
head(analyte_tbl)

## Missingness

WGNCA provides a function to filter for missingness. This filter both columns (features) and rows (participants).

In [None]:
#Filter samples and features based on the default WGCNA NA criteria (50%)
gsg = goodSamplesGenes(data_df, verbose = 3);
gsg$allOK
if (!gsg$allOK)
{
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0) 
     printFlush(paste("Removing genes:", paste(names(data_df)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0) 
     printFlush(paste("Removing samples:", paste(rownames(data_df)[!gsg$goodSamples], collapse = ", ")));
  # Remove the offending genes and samples from the data:
  data_df = data_df[gsg$goodSamples, gsg$goodGenes]
}

print("After the filter:")
print(dim(data_df))

#Filter metadata to match columns in filtered data frame
sample_tbl <- sample_tbl %>%
    dplyr::filter(public_client_id %in% rownames(data_df))
print("Sample metadata after the filter")
print(str_c("- nrow: ", nrow(sample_tbl)))

analyte_tbl <- analyte_tbl %>%
    dplyr::filter(AnalyteID %in% colnames(data_df))
print("Analyte metadata after the filter")
print(str_c("- nrow: ", nrow(analyte_tbl)))

Why are there so many protiens filtered? The majority of Arivale partipants were run with 3 Olink panels, however a small subset were run with a larger number of panels. Since we need to remove these proteins, we will update the metadata for easier handling. 

## Network construction

### Choose the soft-thresholding power

In [None]:
#Choose a set of soft-thresholding powers
powers <- c(c(1:10), seq(from=11, to=15, by=1))
cutoff <- 0.8

#Call the network topology analysis function
sft <- pickSoftThreshold(data_df, powerVector=powers, verbose=5,
                         corOptions=c(use="p", method="spearman"), networkType="signed")

#Plot the results
options(repr.plot.width=9, repr.plot.height=5)
par(mfrow=c(1,2))
cex1 <- 0.8
##Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)", ylab="Scale Free Topology Model Fit, signed R^2", type="n",
     main=paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers, cex=cex1, col="black")
##Line corresponds to using an R^2 cut-off of h
abline(h=cutoff, col="red")
##Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)", ylab="Mean Connectivity", type="n",
     main=paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1, col="black")

print(str_c("Estimated soft-thresholding power: ", sft$powerEstimate))

### Co-expression similarity and adjacency

> Calculate correlation network adjacency.  

In [None]:
#Choose the power that best approximates a scale free topology while still maintaining high level of connectivity in the network
softPower <- sft$powerEstimate
print(softPower)
#Generate the adjacency matrix using the chosen soft-thresholding power
adjacency <- adjacency(data_df, power=softPower,
                       corOptions=list(use="p", method="spearman"), type="signed")

print(str_c("nrow: ", nrow(adjacency)))
head(adjacency)

### Topological Overlap Matrix (TOM)

> To minimize effects of noise and spurious associations, the adjacency is transformed into the topological overlap measure, and the corresponding dissimilarity is calculated.  

In [None]:
#Turn adjacency into topological overlap
##You can input whatever matrix you want here!
TOM <- TOMsimilarity(adjacency, TOMType="signed")

#Turn into distance matrix
dissTOM <- 1 - TOM

print(str_c("nrow: ", nrow(dissTOM)))
head(dissTOM)

## Module detection

### Hierarchical clustering using TOM dissimilarity

> Cluster the TOM distance matrix to find modules. You can call whatever clusting method you want here.  

In [None]:
#Call the hierarchical clustering function
geneTree <- hclust(as.dist(dissTOM), method="average")

#Plot the resulting clustering tree (dendrogram)
options(repr.plot.width=12, repr.plot.height=6)
plot(geneTree, xlab="", sub="", main="Gene clustering on TOM-based dissimilarity",
     labels=FALSE, hang=0.04)

> In the clustering tree (dendrogram), each leaf corresponds to a gene. Branches of the dendrogram group together densely interconnected, highly co-expressed genes. Module identification amounts to the identification of individual branches (”cutting the branches off the dendrogram”). There are several methods for branch cutting; the standard method of WGCNA package is Dynamic Tree Cut from the package dynamicTreeCut.  

In [None]:
#Larger modules can be easier to interpret, so we set the minimum module size relatively high
minModuleSize <- max(c(15, round(ncol(data_df)/200, digits=0)))
print(str_c("minClusterSize = ", minModuleSize))

#Module identification using dynamic tree cut
dynamicMods <- cutreeDynamic(dendro=geneTree, distM=dissTOM,
                             deepSplit=4, pamStage=TRUE, pamRespectsDendro=FALSE,
                             minClusterSize=minModuleSize)
table(dynamicMods)

> The above list shows each module size. Label 0 is reserved for unassigned genes.  

In [None]:
#Convert numeric lables into colors
dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)

#Plot with TOM dissimilarity matrix
options(repr.plot.width=7, repr.plot.height=7)
TOMplot(dissTOM^softPower,#For better visualization
        geneTree, as.character(dynamicColors), main="Network heatmap plot, all genes")

### Merge similar modules based on eigengenes

> Dynamic Tree Cut may identify modules whose expression profiles are very similar. It would be prudent to merge such modules since their genes are highly co-expressed. To quantify co-expression similarity of entire modules, their eigengenes are calculated and clustered based on their correlation.  

In [None]:
#Calculate eigengenes
MEList <- moduleEigengenes(data_df, colors=dynamicColors, impute=TRUE, nPC=2)
MEs <- MEList$eigengenes
print(str_c("nrow: ", nrow(MEs)))
head(MEs)

#Calculate dissimilarity of module eigengenes
MEDiss <- 1 - cor(MEs, use="pairwise.complete.obs")

#Cluster module eigengenes
METree <- hclust(as.dist(MEDiss), method="average")

#Plot the result
options(repr.plot.width=10, repr.plot.height=5)
plot(METree, main="Clustering of module eigengenes",
     xlab="", sub="")
MEDissThres <- 0.2
abline(h=MEDissThres, col="red")

In [None]:
#Call an automatic merging function
merge <- mergeCloseModules(data_df, dynamicColors, cutHeight=MEDissThres, verbose=0)

#Eigengenes of the new merged modules
mergedMEs <- merge$newMEs

#The merged module colors
mergedColors <- merge$colors
table(mergedColors)

#Plot the dendrogram and module colors
options(repr.plot.width=12, repr.plot.height=6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
                    c("Dynamic Tree Cut", "Merged dynamic"),
                    dendroLabels=FALSE, hang=0.03,
                    addGuide=TRUE, guideHang=0.05,
                    main="Gene dendrogram and module colors")

In [None]:
#Rename
moduleColors <- mergedColors
MEs <- mergedMEs

#Prepare the module assignment table
module_tbl <- tibble(AnalyteID=colnames(data_df),
                     ModuleID=str_to_title(moduleColors)) %>%
    dplyr::left_join(analyte_tbl, ., by="AnalyteID")
print("Module assignment table (temp)")
print(str_c("- nrow: ", nrow(module_tbl)))
head(module_tbl)

#Clean the module eigengene table
eigengene_df <- MEs %>%
    rownames_to_column(var="public_client_id")
names(eigengene_df)[2:ncol(eigengene_df)] <- names(eigengene_df)[2:ncol(eigengene_df)] %>%
    str_replace(., "^ME", "") %>%
    str_to_title(.)
print("Module eigengene table")
print(str_c("- nrow: ", nrow(eigengene_df)))
head(eigengene_df)

## Relationship between each phenotype and module

### Clean phenotype data

In [None]:
#Check phenotypes
print("Sample metadata")
print(str_c("- nrow: ", nrow(sample_tbl)))
print("- Contingency of sex")
table(sample_tbl$sex)
print("- Contingency of race")
table(sample_tbl$race)
print(colnames(sample_tbl))

In [None]:
#Code sex and race
phenotype_tbl <- sample_tbl %>%
    dplyr::mutate(BinarySex=ifelse(sex=="F", 0, 1),
                  BinaryRace=ifelse(race=="white", 0, 1)) %>%
    dplyr::mutate(BinaryRace=tidyr::replace_na(.$BinaryRace, 1)) %>%#Due to the existence of NA
    dplyr::select(public_client_id, BinarySex, BinaryRace, age, self_fi, lab_fi, merge_fi) %>%
    #Transform tibble for easily applying to the WGCNA functions
    column_to_rownames(var="public_client_id")

#Check phenotypes
print("Sample metadata")
print(str_c("- nrow: ", nrow(phenotype_tbl)))
print("- Contingency of BinarySex")
table(phenotype_tbl$BinarySex)
print("- Contingency of BinaryRace")
table(phenotype_tbl$BinaryRace)

### Module–trait relationship

> For each module and each phenotype, a quantitative measure of module–trait relationship (MTR) is defined as the correlation between the module eigengene and the phenotype profile.  

In [None]:
#Calculate the numbers of modules and samples
nSamples <- nrow(phenotype_tbl)
modNames = substring(names(MEs), 3)

##Check ID order before the cor() function
print(str_c("Matched IDs?: ", all(rownames(MEs)==rownames(phenotype_tbl))))

#Calculate module–trait relationship
moduleTraitCor <- as.data.frame(cor(MEs, phenotype_tbl, use="p")) #Pearson correlation
rownames(moduleTraitCor) <- str_to_title(modNames)

#Calculate statisitcal significance of module–trait relationship
MTRpval <- as.data.frame(corPvalueStudent(as.matrix(moduleTraitCor), nSamples)) # Pvalue
rownames(MTRpval) <- str_to_title(modNames)

#Eliminate the dummy module (Grey)
moduleTraitCor <- moduleTraitCor[rownames(moduleTraitCor)!="Grey",]
MTRpval <- MTRpval[rownames(MTRpval)!="Grey",]

#P-value adjustment across modules (per trait) using Benjamini–Hochberg method
MTRpval_adj <- as.data.frame(apply(MTRpval, 2, function(x){p.adjust(x, length(x), method="BH")}))

#Prepare text labels as matrix
textMatrix <- paste("r = ",signif(as.matrix(moduleTraitCor), 3),"\n(P = ",
                    signif(as.matrix(MTRpval_adj), 2),")", sep="")
dim(textMatrix) <- dim(moduleTraitCor)
#Revert module names back to apply color conversion
temp_c <- rownames(moduleTraitCor) %>%
    str_to_lower(.) %>%
    str_c("ME",.)

#Visualize
options(repr.plot.width=10, repr.plot.height=7)
par(mar=c(5, 5, 3, 2))
labeledHeatmap(Matrix=moduleTraitCor,
               xLabels=colnames(moduleTraitCor),
               yLabels=temp_c,
               #ySymbols=rownames(moduleTraitCor),
               colorLabels=FALSE,
               colors=blueWhiteRed(50),
               textMatrix=textMatrix,
               setStdMargins=FALSE,
               cex.text=1,
               zlim=c(-1,1),
               main=paste("Module–trait relationships"))

> –> Because Grey is not a module but a remnant set, it should be removed during tests.  

## *Exercise - WGCNA Race*
Are any modules related to race? We hotcoded race for this example code. But, it is not ideal considering the huge differences in the non-white category. Try the analysis again with better racial categories. 

### Regression analysis

The analysis above is looking at the correlation between the module eigenvector and our outcomes. What if we want to control for covariates? Lets look at a regression analysis and compare. 

In [None]:
#Processing data
temp_df <- phenotype_tbl %>%
    #Standardize continuous variables
    dplyr::mutate(age=c(scale(age, center=TRUE, scale=TRUE)),
                  self_fi=c(scale(self_fi, center=TRUE, scale=TRUE)),
                  lab_fi=c(scale(lab_fi, center=TRUE, scale=TRUE)),
                  merge_fi=c(scale(merge_fi, center=TRUE, scale=TRUE))) %>%
    #Encode categorical variables -> Already done
    #Clean
    rownames_to_column(var="public_client_id")
temp_df <- eigengene_df %>%
    dplyr::select(-Grey) %>%#Remove the dummy module
    dplyr::left_join(., temp_df, by="public_client_id")

head(temp_df)

In [None]:
#Regression analysis per module and per age/FI
temp_c <- c("age", "BinarySex", "BinaryRace")#Covariates
module = 'Brown'
xvar <- 'merge_fi'

#OLS regression
model <- lm(formula=str_c(module," ~ ",xvar," + ",str_flatten(temp_c, collapse=' + ')), data=temp_df)
print(str_c(module," ~ ",xvar," + ",str_flatten(temp_c, collapse=' + ')))
print(summary(model))
#Visualize the result
##Calculate the covariate-adjusted module eigengene
model_covar <- lm(formula=str_c(module," ~ ",str_flatten(temp_c, collapse=' + ')), data=temp_df)
plot_df <- temp_df %>%
    dplyr::mutate(AdjME=mean(temp_df[[module]])+model_covar$residuals)
##Prepare model result texts
pval_text <- str_c("P = ",scales::scientific(summary(model)$coefficients[,4][xvar], digits=2))
##Plot correlation
p <- ggplot(data=plot_df) +
    geom_point(aes(x=!!as.name(xvar), y=AdjME)) +
    geom_smooth(aes(x=!!as.name(xvar), y=AdjME), method="lm", formula="y~x", se=TRUE) +
    annotate("text", x=Inf, y=Inf, hjust=1, vjust=1, label=pval_text) +
    labs(x=str_c(xvar," (Z-score)"),
         y="Module eigengene (adjusted)",
         title=str_c(module," module vs. ",xvar)) +
    theme_gray(base_size=16, base_family="Helvetica")
options(repr.plot.width=7, repr.plot.height=7)
plot(p)

## Intramodule connectivity

> Intramodular connectivity can be used to find "hub" analytes in each of the modules. This can be useful to understand potential drivers of each module. 

In [None]:
#Prepare target modules
targets <- modNames[modNames!="grey"]

#Repeat for each module
temp_tbl <- tibble()
for (module in targets) {
    #Select module probes
    probes <- colnames(data_df)
    inModule <- (dynamicColors==module)
    modProbes <- probes[inModule]
    
    #Select the corresponding Topological Overlap
    modTOM <- TOM[inModule, inModule]
    dimnames(modTOM) <- list(modProbes, modProbes)
    
    #Calculate intramodular connectivity
    IMConn <- intramodularConnectivity(adjacency[modProbes, modProbes], rep(module, length(modProbes)), scaleByMax=FALSE)$kWithin
    
    #Summary table
    connectivity <- tibble(ModuleID=str_to_title(module),
                           AnalyteID=modProbes,
                           IntramodularConnectivity=IMConn,
                           TOMsimilaritySum=rowSums(modTOM))
    print(str_c(module," module: ", nrow(connectivity)))
    #print(head(connectivity))#Explicitly print due to within for-loop
    
    #Add to the overall table
    temp_tbl <- dplyr::bind_rows(temp_tbl, connectivity)
}
print(str_c("-> Total nrow: ", nrow(temp_tbl)))
head(temp_tbl)

#Update the module assignment table
module_tbl <- dplyr::left_join(module_tbl, temp_tbl, by=c("AnalyteID", "ModuleID"))
print("Module assignment table (updated)")
print(str_c("- nrow: ", nrow(module_tbl)))
head(module_tbl)

## Enrichment analysis

In [None]:
brown_module <- module_tbl %>%
    dplyr::filter(ModuleID %in% "Brown")
blue_module <- module_tbl %>%
    dplyr::filter(ModuleID %in% "Blue")
turq_module <- module_tbl %>%
    dplyr::filter(ModuleID %in% "Turquoise")

In [None]:
orBP_Brown <- enrichGO(gene          = brown_module$GeneSymbol,
                universe      = module_tbl$GeneSymbol,
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                keyType       = "SYMBOL",
                pAdjustMethod = "BH",
                pvalueCutoff  = 1,
                qvalueCutoff  = 1,
        readable      = TRUE)


In [None]:
barplot(orBP_Brown, showCategory=10) 
head(orBP_Brown[,1:8])

### 7-1 Hub gene enrichment

In [None]:
topQ <- 0.10 #Focus on the top hubs
hubs_brown <- module_tbl %>%
    dplyr::filter(ModuleID=="Brown") %>%
    dplyr::filter(IntramodularConnectivity>=quantile(IntramodularConnectivity, 1-topQ))

In [None]:
quantile(module_tbl$IntramodularConnectivity, 1-topQ, na.rm=TRUE)
hist(module_tbl$IntramodularConnectivity, main = "Histogram of Intramodular Connectivity",xlab = 'Connectivity')
abline(v=quantile(module_tbl$IntramodularConnectivity, 1-topQ, na.rm=TRUE))

In [None]:
orBP_Brown <- enrichGO(gene          = hubs_brown$GeneSymbol,
                universe      = module_tbl$GeneSymbol,
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                keyType       = "SYMBOL",
                pAdjustMethod = "BH",
                pvalueCutoff  = 1,
                qvalueCutoff  = 1,
        readable      = TRUE)

In [None]:
barplot(orBP_Brown, showCategory=10) 
head(orBP_Brown[,1:8])

## *Exercise Metabolites and clinical labs*

We have only touched on proteins for the WGCNA. Perform WGCNA with metabolites and/or clinical labs. Do you find informative modules? How would you do enrichment analysis? 

# Differential Quantification

In [None]:
# Split into quintiles to compare distribution extremes. 
num_df_med <- scale(data_df, scale = FALSE)
meta_indices_quin <- sample_tbl %>%
    dplyr::mutate(merge_quin = dplyr::ntile(merge_fi, 5)) %>%
    dplyr::mutate(merge_quin = dplyr::case_when(merge_quin == 1 ~'1st', 
                                            merge_quin == 2 ~'2nd',
                                            merge_quin == 3 ~'3rd',
                                            merge_quin == 4 ~'4th',
                                            merge_quin == 5 ~'5th')) %>%
    dplyr::mutate_each_(funs(factor(.)),c("merge_quin", "age", "sex"))

num_df_med_t <- t(num_df_med)

In [None]:
# Create our groups, as factors (required by LIMMA)
group <- as.factor(meta_indices_quin$merge_quin)
age <- as.factor(meta_indices_quin$age)
sex <- as.factor(meta_indices_quin$sex)
# Set design comparing 5th to 1st quintile
design = model.matrix(~0+group + sex + age)
contrast <- makeContrasts(HighvsLow = group5th-group1st, levels=design)
fit1 <- lmFit(num_df_med_t, design)
fit2 <- contrasts.fit(fit1,contrasts = contrast)
fit3 <- eBayes(fit2)
res <- topTable(fit3, sort.by = "P", n = Inf)

In [None]:
# Plot results
colnames(res) <- c("log2FC", "AveExpr", "t", "pvalue", "padjust", "B")
res$Name <- rownames(res)

x <- res
x$id <- rownames(x)
x$id[(x$padjust > 0.01 | abs(x$log2FC) < 1)] <- NA

fc <- max(abs(x$log2FC), na.rm=TRUE) +.5
clrs = "RdBu"
clrs <- vc_palette(clrs, ramp=FALSE)
#color_ramp <- c(rev(RColorBrewer::brewer.pal(7,"Reds")), "White", RColorBrewer::brewer.pal(7,"Reds"))

p1 <- ggplot2::ggplot(data=x,
              ggplot2::aes(x=log2FC, y= -log10(padjust), label = id)) +
         ggplot2::geom_point(ggplot2::aes(fill = log2FC),
             color="gray20", shape = 21, size=3) +
         ggplot2::xlim( -fc, fc) + ggplot2::theme_light() +
         ggplot2::xlab("Log2 Fold Change") +
         ggplot2::ylab("-Log10 Adjusted P-value") +
         ggplot2::scale_fill_gradientn(colors=clrs, limits=c(-fc, fc),
             guide="none")
options(repr.plot.width=7, repr.plot.height=7)
p1 + geom_text(check_overlap = TRUE)

In [None]:
res$AnalyteID <- rownames(res)
res_modules <- merge(res, module_tbl, by="AnalyteID")
print("Number of DEPs")
print(str_c("- nrow: ", nrow(res_modules %>% filter(padjust <= 0.05))))

print("Number of DEPs overlapping Blue module")
print(str_c("- nrow: ", nrow(res_modules %>% filter(padjust <= 0.05 & ModuleID == 'Blue'))))

print("Number of DEPs overlapping Brown module")
print(str_c("- nrow: ", nrow(res_modules %>% filter(padjust <= 0.05 & ModuleID == 'Brown'))))

print("Number of DEPs overlapping Turquoise module")
print(str_c("- nrow: ", nrow(res_modules %>% filter(padjust <= 0.05 & ModuleID == 'Turquoise'))))


In [None]:
datatable(res_modules, options = list(searching = TRUE))

In [None]:
# Look at enrichment
DE_Prots <- res_modules %>%
    dplyr::filter(padjust>=0.05)

orBP_DE <- enrichGO(gene          = DE_Prots$GeneSymbol,
                universe      = module_tbl$GeneSymbol,
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                keyType       = "SYMBOL",
                pAdjustMethod = "BH",
                pvalueCutoff  = 1,
                qvalueCutoff  = 1,
        readable      = TRUE)


In [None]:
barplot(orBP_DE, showCategory=10) 
head(orBP_DE[,1:8])