# Lecture 2: Dimensionality Reduction

## Overview 

### Main goal

Finding information-preserving or "interesting" projections from high-dimensional feature space to low-dimensional space.

### Applications
 * compression (loss: reconstruction error)
 * feature selection (loss: classification/generalization error)
 * complexity reduction
 * signal recovery (noise removal)
 * data visualization

## Motivation

The following serve to showcase the **intrinsic lower-dimensionality** of high dimensional data (non-obvious at first glance, at least to me).

Most high-dimensional data (e.g. full-HD images of cats in $\mathbb{R}^{6,220,800}$) actually lie on a lower-dimensional non-linear manifold in this high-dimensional space.

### Example: Simple generative model

Look at an artifical dataset's pairwise distances as a function of the **dataset dimensionality**.

Hint: know the properties of Gaussians, and know them well!

Assume data vector $x = (x_1, \dots, x_D)^\top, \> x_d \sim \mathcal{N}(0, 1)$ (standard normal component distribution).

This means:

$$\mathbf{x}, \mathbf{y} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Applying the difference/sum properties of gaussian-distributed RVs, we get:

$$ \mathbf{x} - \mathbf{y} \sim \mathcal{N}(\mathbf{0}, 2\mathbf{I}) \iff 
\delta = \frac{1}{2}(\mathbf{x} - \mathbf{y}) \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) 
$$

The equivalence holds because x and y are independent, and in that case the variance is linear.


**Definition:** The chi-squared distribution with k degrees of freedom is the distribution of a sum of the squares of k independent **standard normal** random variables.

Squaring the previous difference, we get the following, knowing that the squared norm of a multivariate normal RV is $\chi^2(D)$-distributed (properties of Gaussians; $D$ is the dimensionality of our data):

$$ \frac{1}{2} \| \mathbf{x} - \mathbf{y} \|^2 \sim \chi^2(D) $$

Since our difference variable had a $\sigma^2$ of $2\mathbf{I}$, we want to halve that, hence the $\frac{1}{2}$, since the definition of the $\chi^2$ distribution only applies to **standard normal** distributions (0 mean, unit variance).


$\chi^2$ is a special case of the $\Gamma$ distribution:

$$ \chi^2(D) = \Gamma\left(\frac{D}{2}, 2\right) $$

We define the average squared difference between x and y (per dimension). Practically speaking, this is just a scaled version of the halved norm ($\delta$).

$$\Delta(x, y) := \frac{1}{D}\sum_{d=1}^{D}(x_d - y_d)^2 = \frac{2}{D}\delta $$

We know that $\delta$ is a chi-squared RV. We scale that and get the following ([properties of $\Gamma$ RVs](https://en.wikipedia.org/wiki/Gamma_distribution#Scaling); yeah, I don't know them by heart either :/):

$$\delta \sim \chi^2(D) \implies \Delta(x, y) = \frac{2}{D}\chi^2(D) \sim \Gamma\left( \frac{D}{2}, \frac{2}{D} \cdot 2 \right) = \Gamma\left( \frac{D}{2}, \frac{4}{D} \right)$$

So we know how the average pairwise distance between two datapoints is distributed. 

Let's plot that as a function of $D$, the dimensionality of our data!

Mean and variance as a function of D (using definitions from Wikipedia).

$$ \mathbb{E}[\Delta] = k\theta = 2 $$
$$ \mathbb{V}[\Delta] = \frac{D}{2}\left(\frac{4}{D}\right)^2 = \frac{8}{D} $$

We notice that the variance shrinks as the dimensionality $D$ of our data increases!

**Result: we now have a way to veryify whether the intrinsic dimensionality of the data is actually the same as our real dimensionality.**

In [1]:
# TODO(andrei): Plot this!

Oil data: D = 12, Gamma-dist fit is not horrible, but not good either (higher STD than theory would otherwise indicate), but intrinsic dimensionality is still smaller (2 degrees of freedom for gas-water-oil mixture).

Motion capture data: D = 102, no fit; intrinsic dimensionality must be smaller.

In [2]:
# TODO(andrei): Read corresponding section in Elements of Statistical Learning.

In [3]:
# TODO(andrei): StackOverflow answer comparing OLS and PCA.
# TLDR: OLS computes distances perpendicular to axes.
#       PCA computes distances perpendiculat to model.

## Principal Component Analysis

**Goal:** Project data onto $K \le D$ dimensional space while maximizing variance of the projected data.

PCA seeks a space of lower dimensionality, known as the **principal subspace**, such as the **orthogonal** projection of the data points onto this subspace **maximizes the variance** of the projected points.

PCA minimizes the sum-of-squares of the projection errors.

The next few slides start from this goal, and show that **the optimal solution (which maximizes variance) is achieved via the eigendecomposition of the data matrix.**

Our approach has **two** main objective: ensuring that the variance of the lower-dimensional data is maximal, and that the  projection error is minimal. The following two sections, A and B, will show that these objectives are actually equivalent!

### Objective A: Variance maximization

#### K = 1 (First principal direction)

Start with just one component, $u_1$. Unit length, since length doesn't matter.

$$
\Sigma = \frac{1}{N} \sum_{n=1}^N (x_n - \bar{x}) (x_n - \bar{x})^T
$$

Mean of projected data: $u_1^T\bar{x}$ ($\bar{x}$ is the sample mean).

Proof:

$$
\frac{1}{N}\sum_{n=1}^N u_1^T x_n = u_1 \sum_{n=1}^n x_n = u_1 \bar{x}
$$

Variance of projected data: $u_1^T\Sigma u_1$. (Similar proof.)

The variance is what we want to maximize:

$$ u_1^{*} = \max_{u_1} u_1^T \Sigma u_1, \quad \text{s.t.} \quad \|u_1\|_2=1 $$

Note that the constraint on the magnitude of $u_1$ is necessary, since otherwise the maximization would be trivial with $\| u_1 \|_2 \to \infty$.

This is a constrained optimization problem. To incorporate the constraint, we write its Lagrangian:

$$ \mathcal{L} := u_1^T \Sigma u_1 + \lambda \left(1 - \|u_1\|^2_2\right) = u_1^T \Sigma u_1 + \lambda \left(1 - u_1^Tu_1 \right)  $$

$$ \frac{\partial}{\partial{\mathbf{u}_1}}\mathcal{L} \overset{!}{=} \iff 2\Sigma u_1 - 2\lambda u_1 = 0 \iff \Sigma u_1 = \lambda u_1$$

Does this look familiar?

The solutions to this equation are the eigenpairs of the $\Sigma$ matrix!

TODO(andrei): Explain this better. The formula for the variance comes from applying $u_1x_n$ to the formula for the variance of $X$.

$\lambda$ is the variance of the projected data: $\lambda = u_1^T \Sigma u_1$.

To maximize the variance, we just have to pick the eigenvector with the largest eigenvalue! It is called the **principal direction**:

$$\Sigma u_1 = \lambda u_1 \iff u_1^T \Sigma u_1 = \lambda u_1^T u_1 \overset{u_1^T u_1 = 1}{\iff} u_1^T \Sigma u_1 = \lambda
$$

The variance ($=u_1^T \Sigma u_1$) is thus maximized when we pick the largest eigenvalue $\lambda$.

Plain English explanation: Starting off with projecting onto just one direction for simplicity, we write the variance maximization objective. We then write its Lagrangian, since it's a constrained objective, and setting the Lagrangian's gradient to zero leads us to a closed form solution: the eigenvector equation itself!

### K = 2 (Second Principal Direction)

Want to find $u_2$ s.t. $u_2^Tu_1 = 0$ that maximizes the variance $ = u_2^T \Sigma u_2$.

We do this incrementally. We assume prevoius principal components are fixed, and choose new directions which maximize the projected variance **among all possible directions orthogonal to those already considered**.

Same as before, we optimize for the best $u_2$ by writing the Lagrangian.

$$ 
\mathcal{L} = u_2^T \Sigma u_2 + \lambda\left(1 - u_2^T u_2 \right) + 
\eta(u_2^T u_1)
$$

$$
\frac{\partial}{\partial u_2}\mathcal{L} = 2 \Sigma u_2 - 2\lambda u_2 + \eta u_1
\overset{!}{=} 0
$$

TODO(andrei): Why is it allowed to first omit the second constraint?

### Objective B: Error minimization

(shown to be formally equivalent to objective A)

(show "Thus any eigenvector will define a stationary point of the distortion measure.")

It can be shown that the distortion J for projecting down to K dimensions is:

$$
J = \sum_{i=K+1}^D \lambda_i
$$

The distortion, which depends on all the components we "cut off", is minimized by choosing the smallest D - K eigenvalues. We should therefore choose the largest eigenvalues to project on! And this solution is equivalent to the one reached by optimizing objective A.

### Conclusion

Therefore, maximizing variance and minimizing error $\|x - \tilde{x}\|_2$ lead to the same solution, which is awesome for us! Our two main objectives do not diverge! (Which would have required painful tradeoffs.)

**The sum of the discarded eigenvalues is equal to the sum of squared differences between the points and their projections.** This is fucking beautiful!

WHY THE FLIP DO THE SLIDE NOT CONTAIN THIS BEAUTIFUL CONCLUSION EXPLICITLY? I know it was likely stated during the lecture (haven't checked physical notes yet) and at the beginning of the process, but this is just silly. It should be between slides 26 and 27...

## PCA as a matrix factorization

Represent data as matrix: **columns are data points**, rows are features.

 1. Mean-center data.
 2. Compute eigenvalue spectrum.
 3. Take eigenvectors with K highest values into matrix $U_K$. 
 4. Project X onto the space spanned by the eigenvectors and get approximation $Z_K$.
 5. K = D $\implies$ perfect reconstruction. **PCA is a Matrix Factorization**.

How to pick a good K? Look for knee in eigenvalue spectrum.

## Eigenpairs and eigendecomposition

(Mostly taken from Appendix C of Bishop's PRML.)

The eigendecomposition $A = U\Lambda U^T$ of a matrix arises naturally from (a) first selecting a set of orthonormal eigenvectors and (b) generalizing the eigenvector equation for all eigenvalues in matrix notation:

Let $ U \in \mathbb{K \times K} $ be a matrix whose columns are orthonormal eigenvectors of a matrix $A$ of rank $K$. (A is a square $K\times K$ matrix.)

For the orthonormal matrix $U$ it holds that $UU^T = I \land U^TU = I \iff U^T = U^-1$.

We can then "stack" all the eigenvector equations as

$$ Au_i = \lambda_i u_i \mapsto AU = U \Lambda \iff
AUU^T = U \Lambda U^T \iff A = U \Lambda U^T
$$

Whereby $U$ can be interpreted as a rigid rotation of the coordinate system (vectors aren't scaled and all angles remain unchanged).