# Lecture 11

## Principal Component Analysis: Directions of Highest Variance

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/a9551ad2" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

Let us first load the wine data and look at it with the `pairs` function.

In [None]:
wine <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")
pairs(wine[,-1], cex = .5) # the first column contains the label of the cultivar; we pretend here that we don't know it.

We will scale the data first with `scale` and perform a principal component
analysis on this data set with the function `prcomp` and plot the scores of the
data onto the first two components.

In [None]:
wine.sc <- scale(wine[,-1])
pca <- prcomp(wine.sc, scale = F)
# pca <- prcomp(wine[,-1], scale = T) # is an alternative, where we don't scale manually
plot(pca$x[,1:2])

The function `prcomp` returns the scores in the variable `x` and the loadings in
variable `rotation`. We can verify that there are $p$ loading vectors of length
$p$ that have norm 1 and are orthogonal.

In [None]:
dim(pca$rotation)
norm(pca$rotation[,1], "2") # use the standard Euclidean 2-norm
sum(pca$rotation[,1] * pca$rotation[,2]) # the scalar product is approximately 0

Let us now produce the `biplot` from the slides

In [None]:
biplot(pca, col = c('gray', 'red'), scale = 0)

You can verify by looking at `pca$x[,1:2]` and `pca$rotation[,1:2]` that the
texts are placed at the right position.

To produce a similar figure to the one in the slides without scaling we can use the code

In [None]:
biplot(prcomp(wine[,-1], scale = F), col = c('gray', 'red'), scale = 0)

Let us check the relation between Singular Value Decomposition and PCA.

In [None]:
s <- svd(wine.sc)
sum((s$v - pca$rotation)^2)
sum((s$u %*% diag(s$d) - pca$x)^2)

We can see in the output above that the loadings matrix `pca$rotation` is the
same as the $V$ matrix of SVD (up to numerical errors) and the scores `pca$x`
are the same as $U\Sigma$ where $\Sigma$ is given by `diag(s$d)`.

You can now answer the questions on the first page of the
[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).

## Principal Component Analysis: Closest Low-Dimensional Subspace

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/7242b3e4" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

In the next cell you can see that we can perfectly reconstruct the original data
$X$ by computing either $Z\Phi^T$ or $X\Phi\Phi^T$.

In [None]:
wine.sc.rec1 <- pca$x %*% t(pca$rotation)
wine.sc.rec2 <- wine.sc %*% pca$rotation %*% t(pca$rotation)
sum((wine.sc - wine.sc.rec1)^2)
sum((wine.sc - wine.sc.rec2)^2)

Now we reconstruct the data using only the first two principal components and
compare the reconstruction error to the one we obtain by using the first and the
third principal component. You can see that the reconstruction error is higher
in the latter case. No matter which pair of principal components we use to
reconstruct the original data: the reconstruction error will always be higher
than the one with the first two principal components. The first two principal
components find indeed the two-dimensional plane in this 13-dimensional space
that is closest to the original data.

In [None]:
wine.sc.rec.2 <- wine.sc %*% pca$rotation[,1:2] %*% t(pca$rotation[,1:2])
sum((wine.sc - wine.sc.rec.2)^2)
wine.sc.rec.2b <- wine.sc %*% pca$rotation[,c(1, 3)] %*% t(pca$rotation[,c(1, 3)])
sum((wine.sc - wine.sc.rec.2b)^2)

You can now answer the questions on the second page of the
[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).

## Proportion Variance Explained

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/6531e78a" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

The standard deviation of the scores of each component are in `pca$sdev`; we can
get the variance by squaring these numbers and the proportion of variance
explained by normalizing with the total variance.

In [None]:
pca.var <- pca$sdev^2
pca.vare <- pca.var / sum(pca.var)
plot(pca.vare, xlab = "Principal Component", ylab = "Prop. Variance Explained", col = "blue", type = "b")
plot(cumsum(pca.vare), xlab = "Principal Component", ylab = "Cumulative Prop. Variance Explained", col = "blue", type = "b")

In the following cell we download and plot the image of the boat you saw in the
video.

In [None]:
library(tiff)
fname <- tempfile()
download.file("http://sipi.usc.edu/database/download.php?vol=misc&img=boat.512", fname)
img <- readTIFF(fname)
image(t(img)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)

Now, let us scale the data, perform PCA and plot the variance explained, the
original image and the reconstructed image with the first 25 components.

In [None]:
img.s <- scale(img)
pZ <- prcomp(img.s)
library(DMwR) # used for the unscale function
newimg <- unscale(pZ$x[,1:25] %*% t(pZ$rotation[,1:25]), img.s)
par(mfrow = c(1, 3), oma = rep(1, 4), cex = 1)
plot(pZ$sdev^2/sum(pZ$sdev^2), ylab = 'Prop. Variance Explained')
image(t(img)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)
image(t(newimg)[,512:1], col = gray.colors(100, 0, 1, 1), axes = F, asp = 1)

You can now answer the questions on the third page of the
[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1117364).

## Limitations of PCA

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/32c8c376" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

Let us generate below an artificial dataset with 8 point clouds.

In [None]:
set.seed(517)
library(MASS)
x1 <- mvrnorm(100, c(1, 1, 1), .1*diag(3))
x2 <- mvrnorm(100, c(-1, 1, 1), .1*diag(3))
x3 <- mvrnorm(100, c(1, -1, 1), .1*diag(3))
x4 <- mvrnorm(100, c(-1, -1, 1), .1*diag(3))
x5 <- mvrnorm(100, c(1, 1, -.8), .1*diag(3))
x6 <- mvrnorm(100, c(-1, 1, -.8), .1*diag(3))
x7 <- mvrnorm(100, c(1, -1, -.8), .1*diag(3))
x8 <- mvrnorm(100, c(-1, -1, -.8), .1*diag(3))
x <- rbind(x1, x2, x3, x4, x5, x6, x7, x8)
cols <- c(rep(1, 400), rep(2, 400))
library(plotly)
plot_ly(x = x[,1], y = x[,2], z = x[,3], type = 'scatter3d', mode = 'markers', marker = list(size = 2), color = cols)

Without scaling the data you get a very similar image as the one shown in the
slides.

In [None]:
pca <- prcomp(x)
order <- sample(1:800)
plot(pca$x[order, 1:2], col = cols[order], xlab = "PC1", ylab = "PC2")

With scaling before applying PCA the result looks a bit better, but there are
still overlaps of the point clouds after projection to the lower-dimensional
space.

In [None]:
x <- scale(x)
pca <- prcomp(x)
order <- sample(1:800)
plot(pca$x[order, 1:2], col = cols[order], xlab = "PC1", ylab = "PC2")

The result is very different when we apply t-SNE instead.

In [None]:
library(Rtsne)
tsne <- Rtsne(x)
plot(tsne$Y, col = cols)

## Principal Component Regression

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/f20de63b" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

The code in this section may look complicated. Don't worry, if you don't
understand the details. I included the code for transparency: you can see,
how I produced the figures in the slides. It is totally fine if you do not want
to spend time on reading it. The only thing that you may want to remember is
that there is the `library(pls)` that contains the function `pcr` to do
principal component regression; however, the Lasso or Ridge Regression is
usually preferable.

In the following cell we define a data generator function that samples
potentially low-dimensional, hidden predictors `z` and linearly transforms them
to a higher dimensional representation `x` using transformation matrix `P`.
For the first dataset we leave the transformation matrix `P` the identity
matrix, i.e. `z` is as high-dimensional as `x`, but we will use a sparse `beta`,
i.e. as `beta` with some zero entries. For the second artificial dataset we will
use a low-dimensional `z` and a dense `beta`.
The response `y` is then sampled following the ordinary assumption of multiple
linear regression.
The functions `bias.and.variance`, `glm.fit` and `pcr.fit` will be used to
compute the bias and variance of the different methods and fit the data with the
Lasso or PCR, respectively.

In [None]:
data.generator <- function(beta, n = 50, p = 45, P = diag(rep(1, p))) {
    z <- matrix(rnorm(n * nrow(P)), nrow = n)
    x <- z %*% P
    y = x %*% beta + rnorm(n)
    list(x = x, y = y, data = data.frame(X = x, Y = y))
}
bias.and.variance <- function(beta, beta.hats, x0) {
    m.beta.hats <- rowMeans(beta.hats)
    list(test.error = mean(colMeans((x0 %*% as.matrix(sweep(beta.hats, 1, beta)))^2)),
         bias = mean(rowSums(x0 %*% (beta - m.beta.hats))^2),
         variance = mean(colMeans(((x0 %*% as.matrix(sweep(beta.hats, 1, m.beta.hats))))^2)))
}
library(glmnet)
glm.fit <- function(data, ...) as.matrix(coef(glmnet(data$x, data$y, intercept = F, ...)))[seq(2, ncol(data$x)+1)]
library(pls)
pcr.fit <- function(data, ...) as.matrix(coef(pcr(Y ~ ., data = data$data, ...)))

In this cell we fit multiple training sets with the Lasso and different values
for the regularization parameter `lambda` and we repeat the same for PCR and
different number of components. Here we use data with 45-dimensional
predictors and sparse `beta`.

In [None]:
set.seed(123)
beta <- rnorm(45)
beta[sample(45, 10)] <- 0             # like this we obtain a sparse beta, where 10 entries are 0
x0 <- data.generator(beta, n = 500)$x # test data
lambdas <- 10^seq(-5, 0, length = 20) # lambdas to try
res1 <- sapply(lambdas, function(lambda) {
                bias.and.variance(beta, data.frame(replicate(500, glm.fit(data.generator(beta), alpha = 1, lambda = lambda))), x0) })
res1 <- data.frame(res1)
ncomps <- seq(1, 45, 2)
res2 <- sapply(ncomps, function(ncomp) {
                bias.and.variance(beta, data.frame(replicate(500, pcr.fit(data.generator(beta), ncomp = ncomp))), x0) })
res2 <- data.frame(res2)

par(mfrow = c(1, 2))
plot(log10(lambdas), unlist(res1[1,]), main = "Lasso", type = 'l', col = 'red', ylim = c(0, 20), ylab = 'MSE')
lines(log10(lambdas), unlist(res1[2,]), col = 'blue')
lines(log10(lambdas), unlist(res1[3,]))
legend("topleft", c("bias^2", "variance", "test"),
       col = c("blue", "black", "red"), lty = 1)

plot(ncomps, unlist(res2[1,]), main = "PCR", type = 'l', col = 'red', ylim = c(0, 20), ylab = "MSE")
lines(ncomps, unlist(res2[2,]), col = 'blue')
lines(ncomps, unlist(res2[3,]))

The plots above show typical curves for bias, variance, test error as a function
of the flexibility of the method. The Lasso is most flexible for `lambda = 0` (=
unregularised linear regression) and we see that already for `lambda = 1e-5` the bias
is the smallest but the variance is highest. PCR is most flexible
when all components are used (= unregularised linear regression) and decreases
in flexibility with decreasing number of components. The lowest test error is
lower for the Lasso than for PCR.

Next, we run the same analysis for data with 10-dimensional predictors embedded
in a 45-dimensional space (if you want to visualize this situation think of data
points that lie in a plane that is arbitrarily oriented in 3D space).

In [None]:
set.seed(123)
beta <- rnorm(45)
P <- matrix(rnorm(10*45), nrow = 10) # this transforms the predictors from 10 to 45 dimensions.
x0 <- data.generator(beta, n = 500, P = P)$x
lambdas <- 10^seq(-5, 0, length = 20)
res1 <- sapply(lambdas, function(lambda) {
                bias.and.variance(beta, data.frame(replicate(500, glm.fit(data.generator(beta, P = P), alpha = 1, lambda = lambda))), x0) })
res1 <- data.frame(res1)
ncomps <- seq(1, 45, 3)
res2 <- sapply(ncomps, function(ncomp) {
                bias.and.variance(beta, data.frame(replicate(500, pcr.fit(data.generator(beta, P = P), ncomp = ncomp))), x0) })
res2 <- data.frame(res2)

par(mfrow = c(1, 2))
plot(log10(lambdas), unlist(res1[1,]), main = "Lasso", type = 'l', col = 'red', ylim = c(0, 5), ylab = 'MSE')
lines(log10(lambdas), unlist(res1[2,]), col = 'blue')
lines(log10(lambdas), unlist(res1[3,]))
legend("topleft", c("bias^2", "variance", "test"),
       col = c("blue", "black", "red"), lty = 1)

plot(ncomps, unlist(res2[1,]), main = "PCR", type = 'l', col = 'red', ylim = c(0, 5), ylab = "MSE")
lines(ncomps, unlist(res2[2,]), col = 'blue')
lines(ncomps, unlist(res2[3,]))

As expected, PCR has zero bias as soon as at least 10 components are included.
Including more than 10 components leads to higher variance.
The Lasso has the same minimal test error for many values of lambda.

## Exercises

### Conceptual

**Q1.**

(a) Estimate from the figure below the principal component loadings $\phi_{11}, \phi_{21}, \phi_{12}, \phi_{22}$.

In [None]:
set.seed(1)
data = data.frame(as.matrix(data.frame(X1 = rnorm(400, sd = 6), X2 = rnorm(400, sd = 4))) %*% matrix(rnorm(4), 2, 2))
data = scale(data, scale = F)
plot(data, xlim = c(-30, 30), ylim = c(-15, 15))

(b) In the figure below you see the scores of $n = 40$ measuments and the loadings of a principal component analysis. How many features $p$ were measured in this data set?

In [None]:
set.seed(913)
data = data.frame(as.matrix(data.frame(X1 = rnorm(40, sd = 6), X2 = rnorm(40, sd = 4), X3 = rnorm(40, sd = 2), X4 = rnorm(40, sd = 2))) %*% matrix(rnorm(16), 4, 4))
pca = prcomp(data)
biplot(pca, scale = 0)

(c) Estimate from the figure in (b) the loadings of the first two principal components and the scores of data point 28.

(d) In which direction does the data in the figure in (b) vary the most?

**Q2.** In this exercise you will explore the connection between PCA and SVD.

(a) In the lecture we said that the first principal component is the eigenvector $\phi_1$ with the largest eigenvalue $\lambda_1$ of the matrix $X^T X$. From linear algebra you know that a real, symmetric matrix $A$ can be diagonalized such that $A = W \Lambda W^T$ holds, where $\Lambda$ is a diagonal matrix that contains the eigenvalues and $W$ is an orthogonal matrix that satisfies $W^T W = I$, where $I$ is the identity matrix. The columns of $W$ are the eigenvectors of $A$. Let us assume we have found the diagonalization $X^T X = W \Lambda W^T$. Multiply this equation from the right with $W$ and show that the columns of $W$ contain the eigenvectors of $X^T X$.

(b) The singular value decomposition (SVD) of $X$ is $X = U\Sigma V^T$ where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal matrix. Use the singular value decomposition to express the right-hand-side of $X^TX = ...$ in terms of $U$, $V$ and $\Sigma$, show how $U, V, \Sigma, W$ and $\Lambda$ relate to each other and prove the statements on slide 11 of the lecture. *Hint*: the product of two diagonal matrices is again a diagonal matrix.

## Applied

**Q3.** Look at the following three artificial data sets.

In [None]:
data1 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*3), nrow = 2))
data2 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*3), nrow = 2) + 0.1*rnorm(50*3))
data3 <- data.frame(matrix(rnorm(50*2), nrow = 50) %*% matrix(rnorm(2*10), nrow = 2))

(a) Plot the datasets `data1` and `data2`. Hint: you can find an example in
section "Limitations of PCA" of this notebook.

(b) Write down your expectations on the curve of proportion of variance
explained for the tree datasets.

(c) Perform PCA and plot the proportion of variance explained.

**Q4.** In this exercise we look at a food consumption dataset with the relative
consumption of certain food items in European and Scandinavian countries. The
numbers represent the percentage of the population consuming that food type.
You can load the data with

In [None]:
data <- read.csv(url("https://openmv.net/file/food-consumption.csv"))
data.sc <- scale(na.omit(data[,2:21]))

(a) Perform PCA on this dataset and produce a two biplots: one with PC1 versus
PC2 and another with PC1 versus PC3. Hint: see `?biplot`.

(b) Which are the important variables in the first three components?

(c) What does it mean for a country to have a high score in the first component?
Pick one country that has a high score in the first component and verify that
this country does indeed have this interpretation.

**Q5.** (optional) PCA has some nice history in human face recognition research
(see for example [here](https://en.wikipedia.org/wiki/Eigenface). In this
exercise you will look at a small dataset of face images and perform a similar
analysis as we did it with the image of the ship in section "Proportion
Variance Explained". However, instead of treating the rows of a single image
as individual data points, each face image is treated as a single data point by
transforming the two-dimensional images to one vector (concatenating the rows to
one long vector). You can load the dataset with the following code.

In [None]:
download.file('https://lcnwww.epfl.ch/bio322/att_faces.zip', 'faces.zip')
unzip('faces.zip')
library(png)
faces <- sapply(list.files('att_faces', full.names = T), readPNG)
data <- t(faces)

To look at a single image you can use the following function

In [None]:
show.image <- function(x) image(t(matrix(x, nrow = 112))[,112:1], col = gray.colors(256), axes = F, asp = 1)
show.image(data[101,])

(a) Perform PCA on this dataset and plot the first few principal components by
using the `show.image` function above.

(b) Plot the curve of the proportion of variance explained.

(c) Plot some of the original images together with their low-dimensional
reconstruction based on a reasonable number of principal components.