# Multi-variate analysis

In this notebook, we will apply principal component analysis and compare different types of  cluster analysis


## Installation of libraries and necessary software

Copy the files  _mmc5_vclust_in.csv_ , _MetaboIonsNormed.csv_ and _FcmClustPEst.R_ into the folder that contains this jupyter notebook or upload them to http://localhost:8888/tree

Install the necessary libraries (only needed once) by executing (shift-enter) the following cell:


In [None]:
install.packages("DAAG", repos='http://cran.us.r-project.org')
install.packages("MASS", repos='http://cran.us.r-project.org')
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager", repos='http://cran.us.r-project.org')
BiocManager::install("Biobase")
BiocManager::install("Mfuzz")
install.packages("e1071", repos='http://cran.us.r-project.org')
install.packages("matrixStats", repos='http://cran.us.r-project.org')

## Loading data and libraries
This requires that the installation above has been finished without error

In [None]:
library(DAAG)
library(MASS)
library(Biobase)
library(e1071)
library(matrixStats)

# load data file (you need to place the file into the same folder)
ExampleData <- read.csv("ExampleFile.csv")
MetabolomicsData <- read.csv("MetaboIonsNormed.csv")
source("FcmClustPEst.R")


### Exercise 1
We will use dimensionality reduction to simplify a given data set. For a more extensive description of PCA in R, see e.g. https://www.datacamp.com/community/tutorials/pca-analysis-r

Carry out principal component analysis for the ```possum``` data. Rows with missing values need to be removed before. Plot the scores of the PCA with different colors for the locations where the possums were trapped  (defined by ```site```). 


In [None]:
data(possum)
A <- possum[,5:ncol(possum)]
## How many rows without missing values


## data.frame without missing values

## PCA ...



##### Question I:  <u>How many percent of the variance are already described by principal component 1?</u>

_Answer_

##### Question II:  <u>Which are the most discriminating traits?</u>

_Answer_

##### Question III:  <u>Which sites (provide numbers) can be separated in the scoring plot of the PCA?</u>

_Answer_



### Exercise 2

We will now compare different types of cluster analyses, applied to a proteomics data set (phosphorylated peptides) and a transcriptomics data set.

Carry out hierarchical clustering, k-means and fuzzy c-means on the table from the file "mmc5_vclust_in.csv" and the ```geneData``` data in R (use a cluster number of 10 for all)



In [None]:
data("geneData")

protData <- as.matrix(read.csv("mmc5_vclust_in.csv", row.names=1))

# heatmap here:

## example code for the geneData set
# For the visualization copy the code from the script of the lecture
scaled_geneData <- t(scale(t(geneData))) # this scales each row to have mean 0 and s.d. 1
nclust <- 10
kmean.out <- kmeans(scaled_geneData,nclust)
cm.out <- cmeans(scaled_geneData, nclust, m=1.1)
par(mfrow=c(3,4))
for (c in 1:nclust) {
  # plot centroid
  plot(kmean.out$centers[c,], type="l", lwd=2, col=2, ylim=c(-4,4))
  clustc <- scaled_geneData[kmean.out$cluster==c,]
  # plot genes
  apply(clustc, 1, lines, , col="#00000033")
}
par(mfrow=c(1,1))


## fuzzy c-means clustering
#cm.out$cluster
par(mfrow=c(3,4))
for (c in 1:nclust) {
  plot(cm.out$centers[c,], type="l", lwd=2, col=2, ylim=c(-4,4), xlab="Condition", ylab="Expression pattern")
  # get members of cluster c
  c_indices <- cm.out$cluster==c
  if (sum(c_indices)>1) {
   #  print(sum(c_indices))
   clustc <- scaled_geneData[c_indices,]
   # get membership values, multiply by 100 and round -> number between 0..100
   clustmem <- round(cm.out$membership[c_indices,c]*100)
   # color for each of 100 levels
   colors <- rainbow(100)
   for (m in 1:nrow(clustc)) {
     lines(clustc[m,], col=colors[clustmem[m]])
   }
  }
}
par(mfrow=c(1,1))








##### Question I:  <u>Read the help describing ```geneData```. What does this dataset contain?</u>

_Answer_

##### Question II:  <u>Why should fuzzy c-means be superior to k-means?</u>

_Answer_

##### Question III:  <u>How many parameters are required for fuzzy c-means? How are they called?</u>

_Answer_

##### Question IV:  <u>Which difference do you see between all 3 clustering methods?</u>

_Answer_

##### Question V:  <u>What is a membership value?</u>

_Answer_

##### Question VI:  <u>Do you see any specific pattern in the proteomics data? What is the reason to see this behavior?</u>

_Answer_




### Exercise 3


Extract the columns corresponding to the first replicate of _protData_. Normalize the data to the median and again apply the cluster analysis (all from last exercise) on the resulting four-dimensional data set. 




In [None]:
# Show first lines of example file
head(ExampleData)

colnames(ExampleData)
ExampleDataLog <- as.matrix(log2(ExampleData[,19:22]))

# Normalization by median
NormalizedData <- t(t(ExampleDataLog) - colMedians(ExampleDataLog,na.rm=T))

# remove rows with missing values for kmeans and cmeans
NormalizedRedData <- NormalizedData[complete.cases(NormalizedData),]

# heatmap here


# kmeans + cmeans (10 clusters)
StandardizedData <- t(scale(t(NormalizedRedData)))



##### Question I:  <u>What does the function colMedians give?</u>

_Answer_

##### Question II:  <u>What do the row names of protData stand for?</u>

_Answer_

##### Question III:  <u>Is this data log-transformed? If yes, what tell us that it is already transformed?</u>

_Answer_

##### Question IV:  <u>How do we check whether the median normalization was correctly executed?</u>

_Answer_

##### Question V:  <u>Which samples are most similar and how does this show?</u>

_Answer_

##### Question VI:  <u>Why do we have to _scale_ the data before using k-means and fuzzy c-means?</u>

_Answer_



### Exercise 4
We will now look into the consequences of using different parameters of fuzzy c-means clustering. The fuzzifier will be automatically set to an optimal value which is much higher than previously used $m=1.1$.

Carry out fuzzy c-means using the parameter estimation from the lecture on ```StandardizedData```. Compare the results to the ones in the exercise above.

In [None]:
PExpr <- new("ExpressionSet",expr=as.matrix(StandardizedData))
parameters <- FcmClustPEst(PExpr, maxc = 25)

# fuzzy c-means clustering with these here:


##### Question I:  <u>Do the validation indices agree on the number of clusters?</u>

_Answer_

##### Question II:  <u>What are the main differences of the results between running fuzzy c-means clustering in the exercise above and here?</u>

_Answer_

##### Question III:  <u>What is the total number of clustered proteins when not considering proteins with max. membership value $>$ 0.5?</u>

_Answer_



### Exercise 5
We now will look into a metabolomics data set with strong temporal behavior and use a version of fuzzy c-means clustering that includes the variance of the replicates which is usually discarded

Carry out hierarchical clustering on metabolomics data (paper: https://www.ncbi.nlm.nih.gov/pubmed/26373870) and test different distance measures. For that, check the help pages of ```heatmap``` and ```dist```.

Load the file into VSClust (http://computproteomics.bmb.sdu.dk/Apps/VSClust) and carry out the analysis there (the app can become irreponsive while multiple users apply the analysis). Use the PCA plot to see whether you read the file with the correctly set number of replicates and conditions. Estimate the parameter values and then apply the variance-based clustering.

In [None]:
# create the heatmap  here:
head(MetabolomicsData)
rownames(MetabolomicsData) <- MetabolomicsData$X
MetabolomicsDataM <- as.matrix(MetabolomicsData[,2:ncol(MetabolomicsData)])
heatmap(MetabolomicsDataM,cexRow = 0.2, cexCol= 0.5, distfun = function(x) dist(x,method = 'euclidean'))

##### Question I:  <u>What are the main differences between heatmap and variance-sensitive clustering?</u>

_Answer_

##### Question II:  <u>Do you recognize the same groups?</u>

_Answer_

##### Question III:  <u>Why can the calculation of the heatmap take long?</u>

_Answer_

##### Question IV:  <u>Do the replicates of all 12 time points cluster together? If not, when do they fail to group and why do think this happens?</u>

_Answer_

##### Question V:  <u>Does this improve when using another distance measure?</u>

_Answer_



In [None]:
?dist
