In [None]:
library(ggplot2)
library(tidyverse)
library(knitr)
library(DT)
library(Matrix)

# Analysis of High Dimensional Data

## What is high dimensional data?
### * Capturing many data points on the same experimental unit
### * Challenge we face in "Omics" data
### - **Phenomics** - Many phenotypes measured
### - **Genomics** - Gene expression, DNA Sequencing
### - **Metabolomics** - HPCL-MS

# Problems arising in high dimensional data 

## **Hypothesis testing**
### - Conducting significance tests for multiple variables
## **Visualization**
### - Clustering of samples in N-dimensional space
## **Modeling**
### - Few Observations relative to parameters (N<<P) 

# Multiple testing problem
## Common Examples
### - Genome-wide association (GWAS)
### - Gene expression analysis

## * If you have 10,000 tests for which the null hypothesis is true, how many times would you reject with a cut-off of $\alpha$ < 0.05?


# Multiple Test Correction

### Let's assume we a drawing samples from the same population (null hypothesis is true). How often will we find a significant difference between samples?

In [None]:
pValsNull <- c() 
for(i in 1:10000){
  # Draw two samples from the sample distribution (standard normal)
  sample1 <- rnorm(10)
  sample2 <- rnorm(10)
  # Use a t-test to test if the two samples have statistically different means
  pVal <- t.test(sample1, sample2)$p.value
  # Save result in a vector PvalsNull
  pValsNull <- c(pValsNull, pVal)
}
# Plot distribution of p values
hist(pValsNull)
# Count how many p values are less than alpha (0.05)
sum(pValsNull < 0.05)

## How often will we find a significant difference if the samples are draw from two different populations (alternative hypothesis)?


In [None]:
pValsAlt <- c() 
for(i in 1:10000){
  # Draw sample one from a standard normal
  sample1 <- rnorm(10)
  
  # Draw sample two from a normal with a different mean
  sample2 <- rnorm(10, mean=2)
  # Use a t-test to test if the two samples have statistically different means
  pVal <- t.test(sample1, sample2)$p.value
  # Save result in a vector PvalsAlt
  pValsAlt <- c(pValsAlt, pVal)
}
# Plot distribution of p values
hist(pValsAlt)
# Count how many p values are less than alpha (0.05)
sum(pValsAlt < 0.05)

## In reality you have a mixture of null and alternative tests.

In [None]:
pValsReality <- c()
# On thousand tests null
for(i in 1:1000){
  pValsReality <- c(pValsReality,t.test(rnorm(10), rnorm(10))$p.value)
}
# One hundred tests alt
for(i in 1:100){
  pValsReality <- c(pValsReality,t.test(rnorm(10, mean=2.5), rnorm(10))$p.value)
}
hist(pValsReality)

# Controlling false positives
- ### Multiple test corrections need strike a balance between Type 1 (false positive rate) and Type 2 error (lack of power)

- ### Which type of error is worse?

|  |Null Hypothesis|  |
|------|----|-----|
|      |True|False|
|Reject|Type 1 Error| Correct|
|Accept|Correct|Type 2  Error|


## Bonferroni correction
### - $\frac{\alpha}{m}$
### - Often too conservative

## Benjamini-Hochberg correction
### - Rank p-values
### - $\frac{rank}{m}q$


## Bonferroni

In [None]:
# Multiply by m to get adjusted P values
pValsRealityAdj <- pValsReality * 1100
sum(pValsRealityAdj < 0.05)

In [None]:
# Using built-in R function p.adjust
sum(p.adjust(pValsReality, method="bonferroni") <  0.05)

## Benjamini-Hochberg correction

In [None]:
sum(p.adjust(pValsReality, method="BH") <  0.05)

# Visualizing data

In [None]:
kable(cars[1:10,])
plot(cars[1:10,], xlab="Speed (mph)", ylab="Stopping distance (ft)", main="Speed vs. Stopping Distance")

# Visualizing high dimensional data


In [None]:
fakeGene <- data.frame(Gene1=rpois(10,3),
                       Gene2=rpois(10,4),
                       Gene3=rpois(10,4.5),
                       Gene4=rpois(10,6),
                       Gene5=rpois(10,6),
                       Gene6=rpois(10,10))
kable(fakeGene)

# Principal components analysis

- ### Finds combination of variables that produces a new axis (PC)
- ### Designed to maximize variance explained by each PC
- ### PCs are orthagonal to each other
- ### Useful for visualization and can be used in a regression contexts



![](images/pca.webp)


https://doi.org/10.1038/srep25696 

# Principal components analysis

- ### $\mathbf{X}$ is a n x m incidence matrix
- ### $\mathbf{X}'\mathbf{X}$ is m x m
- ### $S.D(\mathbf{X}'\mathbf{X}) = \mathbf{U}\mathbf{D}\mathbf{U}'$ 
- ### $\mathbf{U}$ contains **eigenvectors**
- ### $\mathbf{D}$ is a diagonal matrix of **eigenvalues**
- ### We want the **scores** in matrix $\mathbf{T}$
 + ### $\mathbf{T} = \mathbf{X}\mathbf{U}$


## Example

In [None]:
diamSub <-diamonds[1:40, c("carat", "depth", "table",
                                  "price", "x","y","z")]
diamSubLabs <- diamonds[1:40,"price"]
diamSubScaled <- scale(diamSub)
pcaRes <- princomp(diamSubScaled)

## Make a Scree plot

In [None]:
plot(pcaRes)

## Plot the scores

In [None]:
pcs <- data.frame(pc1=pcaRes$scores[,1], 
                  pc2=pcaRes$scores[,2],
                  cut=diamSubLabs)
pca.gg <- ggplot(pcs, aes(x=pc1, y=pc2, color=price)) +
          geom_point()
pca.gg

# Analyzing high dimensional data (N<<P)

## Simulate data

The following section of code simulates multiple continuous covariates and the corresponding phenotypes. This data set will be used to explore various methods for dealing with high-dimensional data. The simulated data set is small for demonstration purposes, but is generated to show the issues that arise from over-parameterization and how various modeling approaches deal with these challenges.

In [None]:
set.seed(100)
# generate multiple continuous independent variables for 100 observations
#initializing the full incidence matrix
Xfull=matrix(0,100,70)
for(i in c(1:70)){
  Xfull[,i]=rnorm(100,0,1)
}
# Xfull now has 70 independent covariates that were generated with no correlation structure
# generate phenotypes that are a function of the randomly generated covariates
# sample beta values
betaTrue=runif(70,.1,1)
#generate phenotpyes
y=Xfull%*%betaTrue+rnorm(100,0,(.5*var(Xfull%*%betaTrue))**.5)
var(y)
var(Xfull%*%betaTrue)

## Creating an over-parameterized situation

The dataset as simulated is not overparameterized - the number of parameters is less than the number of independent observation used to solving for the unknown effects. we will generate an over-parameterized dataset by taking a random subset of data such that the number of independent of observations is less than the number of parameters.

## First let's get the OLS estimates for beta using the full dataset

In [None]:
XfXf=t(Xfull)%*%(Xfull)
#XfXf matrix is full rank so we can invert and solve for bhat
rankMatrix(XfXf)
# solving for bhat
bhat=solve(XfXf)%*%t(Xfull)%*%y
hist(bhat)

Now let's take a subset of this data to create an over parameterized dataset-
where p > n. To do this we will create y.train which we will use to estimate bhat and y.validation that we will use to test how good the estimates are at predicting the value of observations that were not used to get the solution.
Cross-validation is a good way to examine overfitting, a common problem with over parameterized data.

In [None]:
# Create training set from initial data set
Xt=Xfull[1:30,]
y.train=y[1:30]

# Create validation set from initial data set
Xv=Xfull[31:100,]
y.validation=y[31:100]

#now let's try to solve for bhat using the reduced data set
XtXt=t(Xt)%*%(Xt)
rankMatrix(XtXt)
#bhat=solve(XtXt)%*%t(Xt)%*%y

# Ridge Regression
XtXt is a 70x70 matrix with a rank of 30 (we only have 30 observations). As a result XtXt is singular meaning we cannot invert it and we cannot get solutions for bhat. We need to find a way to solve for bhat as best we can while dealing with the issue of over-parameterization. Feature selection is one option, but in this case we know each covariate explains variation in y, so eliminating 40 of these covariates to get a full rank XtXt is not an ideal solution.

Ridge regression works by adding some constant value to the diagonal of XtXt to enable us to invert it and solve. Adding a value to the diagonal means our solutions are no longer unbiased - so not good if we are using the estimates for hypothesis testing, but bias in estimates are acceptable if the goal is prediction and introducing the bias leads to good predictors.

In [None]:
# Randomly choosing the constant to add to the diagonal. In practice you would want to find the constant that maximizes prediction accuracy (usually measure using cross-validation)
RRconstant=10
# Adding constant
RRXtXt=XtXt + diag(RRconstant,70,70)
# testing to see if it is full rank
rankMatrix(RRXtXt)

# Given RRXtXt is full rank let's go ahead and solve for bhat
bhatRR=solve(RRXtXt)%*%t(Xt)%*%y.train

# generating predicted yhat values for the 70 obseervation we left out of the training dataset
yhatv=Xv%*%bhatRR

# calculating the correlation
cor(yhatv,y.validation)

In [None]:
hist(bhatRR)

In [None]:
cor(bhatRR,bhat)
plot(bhatRR,bhat)

## Principle Component Regression

Unlike penalized methods that fit all covariates but use a penalty to prevent overfitting, PCR seeks to reduce the dimensions of the data. Rather than removing the columns of X as would be done using feature selection, PCR decomposes X into orthogonal components and retains components that explain the majority of variation in X.

In [None]:
#In this example we will use a spectral decomposition of X'X to form a matrix T wich contains orthagonal vectors (columns)
#first let's center X - given the values in X were simulated to have a mean of zero this won't have much of an impact but it is best pratice to do so.
Xtc=scale(Xt, center=TRUE, scale=F)
# Here we use the function eigen to decompose X'X
XtcXtc=t(Xtc)%*%Xtc
E=eigen(XtcXtc)
# E is an object that contains both the eigen values and the eigen vectors
# Lets plot the eigen values
plot(E$values)

Since Xtc'Xtc in this case has a rank of 30, the first 30 eigenvectors explain all of the variation in Xtc. Our next step is to construct a full rank matrix with orthogonal columns (T) and the solve for OLS using T.

In [None]:
# First let's pull the eigen vectors out of the object E
U=E$vectors
# We now want to set up P such that T'T can be inverted. For starters let's take the first 29 vectors in U
P=U[,1:29]
# Now we calculate T as T=XP
T=Xtc%*%P
# Solving OLS using T
bpcr=solve(t(T)%*%T)%*%t(T)%*%y.train
# Now let's do cross validation to see how good our predictor is
Xvc=scale(Xv, center=TRUE, scale=F)
Tv=Xvc%*%P
yhatvpcr=Tv%*%bpcr
cor(yhatvpcr,y.validation)

## Ridge regression is a penalized method
- ### Random effects in a mixed model are also a penalized method and can be used to prevent overfitting
- ### When working with high-dimensional data penalized methods are commonly used
    - ### Methods differ based on the type of penalty and how it is estimated
    - ### For mixed models the penalty is $\frac{\sigma^2_{e}}{\sigma^2_{u}}$ 
- ### Genomic prediction is a common breeding application for penalized methods
  