# Dimensionality Reduction and PCA

Often times when we have very high dimensional feature space, our model can be prone to overfitting.

Often dimes it is desirable to reduce the dimensionality of data for
- data visualization
- data exploration
- efficient resource use
- better generalization
- noise removal
- preprocessing

The 2 main methods of doing so are 
- projection
- manifold learning



## Projection

If we want to reduce dimensionality we can create 2 new features what are projections of each feature on to the feature we want to remove. The new features are not exactly the same as the the original features.

Porjection does not suit well for a lot of datasets. Like the swiss roll data set, where if we are to project the features onto one common feature, then we would gain no information and lse a lot of the relative information about the data points.

## Manifold Learning

We model the manifold that the training instances lie on. Relies on the manifold hypothesis which holds that most real world high dimensional datasets lie close to a much lower manifold. Accompanied with another implicit assumption that the task at hand will be simpler if its expressed in the lower dimensional space of the manifold



## Data Projection

Recall that if
\begin{equation}
x \in R^D, ||w|| = 1\\
\end{equation}

then $(w^Tx)w$ is the orthogonal projection of x onto w

## Dimensionality Reduction

Dimensionality Reduction is a map definition problem.

## Principal Component Analysis

It is a data driven, unsupervised sample

$$S = (x_1, \dots, x_n)$$

Where we want to derive a dimensionality reduction defined by a linear map M. PCA can be derived from several prospectives and here we give a geometric derivation.

### PCA Problem Definition

First consider k = 1, Then the associated reconstruction error is 
$$||x = (w^Tx)w||^2$$

We want to find the direction P allowing the best reconstruction of the training set, ie minimal reconstruction error.

Let $S^{D-1} = {w \in R^D | ||w|| = 1}$ is the sphere in D dimensions. Then the the the empirical reconstruction  minimization problem is
\begin{equation}
min_{w \in S^{D-1}} \frac{1}{n}\sum_{i=1}^n ||x = (w^Tx)w||^2
\end{equation}

And the solution p is known as the first Principal component. This is equivalent to 
\begin{equation}
max_{w \in S^{D-1}} \frac{1}{n}\sum_{i=1}^n (w^T x_i)^2
\end{equation}

## Reconstruction and Variance

We assume that the data has been centered around $\bar{x} = \frac{1}{n} x_i = 0$, then we can interpret the term $(w^Tx_i)^2$ as the variance of x in the direction of w. Thus the first principal component p is the direction along w which has the maximum vairance.

### Centering

If the data points are not centered we should consider
\begin{equation}
max_{w \in S^{D-1}} \frac{1}{n}\sum_{i=1}^n (w^T (x_i- \bar{x}))^2
\end{equation}
As beign equivalent to 
\begin{equation}
max_{w \in S^{D-1}} \frac{1}{n}\sum_{i=1}^n (w^T (x_i^c))^2
\end{equation}

With $x_i^c = (x_i - \bar{x})$


## Projecting

There are infite numbers of choices (directions) for Projecting the data, so we need criteria to pick them. We want to find the direction with minimum reconstruction errror, ie minimum variance loss. Equivalently, we can find the dorection with the maximum variance.

### Preserving the variance

Before you project the training set onto the lower dimensional hyperplane, you first need to choose the proper hyperplane. The hyperplane in this is in the direction of w?

## Eigen Problem

A further manipulation shows that the manipulation of the PCA corresponds to an eigen value problem.

Using the symmetry of the inner product, we can see that
\begin{equation}
\frac{1}{n}\sum_{i=1}^n (w^Tx_i)^2 = 
\frac{1}{n}\sum_{i=1}^n w^T x_i w^T x_i = 
\frac{1}{n}\sum_{i=1}^n w^T x_i x_i w =
w^T (\frac{1}{n}\sum_{i=1}^n x_i x_i) w =
 
\end{equation}

This leads to the problem
$max_{w \in S^{D-1}} w^T C_n w, C_n = \frac{1}{n}\sum_{i=1}^n x_i x_i^T$

## Principal Components

HOw do we find principal Components of a training set.

With Singular Value Decomposition (SVD) that can decompose the training set matrix X into the matrix multiplication of three matrices $U \Sigma V^T$ where V contains the unit vectors that define all the principal Components that we are looking for

- U is an n by k orthogonal matrix
- V i a D by k orthogonal matrix
- $\Sigma$ is a diagonal matix

U and V are the left and right Singular vectors and diagonal entries of $\Sigma$ the singular values.

## The Direction of Principal Components

for each principal Component, PCA finds a zero centered unit vector pointing in the direction of the PC. Since there are two unit vectors in the same but opposite direction the direction of the selected unit vecor may flip, but it will be on the same axis as the other one.

### Projecting down to D dimensions

Once you have Identified all of the principal Components, you can reduce the dimensionality of the dataset down to d dimensions by projecting it onto the hyperplane define by the first d principal components.

This hyperplane preserves as much variance as possible.
$$X_{d_proj} = XW_d$$


## Explained Variance Ratio

Indicates the proportion of the datasets variance that lies along each principal component. Can let us know when we have enough principal components. We can also plot the explained variance vs the number of dimensions and see when we get diminishing returns.

## PCA for compression

We can also use PCA to compress data as much of the data is conserved with much fewer dimensional data points.


## Randomized PCA

We can set the svd_solver to be randomized, which will quickly find an approximation for the first D principal components.


## Incremental PCA

PCA must use the entire dataset to find principal components, so it does not scale as well as something like Incremental PCA where we split the training set into mini batches and fint the PCs one minibatch at a time.


## Kernel PCA

With PCA the implicit assumption is that the dataspace is in a linear subspace. We can use a kernel trick on this linear subspace dataset and have it perform complex non linear projections for dimensionality reduction.

kPCA is good for preserving clusters of instances after a projection, or sometimes even unrolling datasets that lie close to a twisted manifold.

Summary: kPCA makes it possible to perform complex non linear prpjections for dimensionality reductions


### Selecting a Kernel and Tuning Hyper paramters

As it is unsupervised there is no obvious performance measure to help you select the best kernel and hyper paramter values. Dimensionality reduction is often a preparation step for a supervised learning task, so we can use grid search to select the kernel and hyperparamters that lead to the best performance on that task.

## LLE

Locally Linear Embedding

This is a __nonlinear__ dimensionality reduction technique. It is a manifold learning technique that does not rely on projections like the previous algorithms do.

It measures how far each training instance linearlly relates to its closet neighbors, then it looks for a low dimensional representation of the training set where those locla representaions are best percieved.

Very good for unrolling manifolds, especially when there is low noise.