# Chapter 8 Notes: Dimensionality Reduction

## Intro
- usually there is a trade-off, dimensionality reduction speeds up training but reduces performance. 
- sometimes dim red will eliminate noise and result in an increase in performance.
- useful for data visualization and avoiding the curse of dimensionality.
- two main methods: projection and manifold learning





## The Curse of Dimensionality
 - In high dimensional spaces almost all points will be "extreme" values. 
     - i.e. they lie near the borders of the state space
 - Additionally, randomly points in a high dimensionality space will be much farther from each other compared to points in a lower dimensional space. This has two drawbacks:
     - It is much harder to sample a high dimensional space and requires far more training data
     - the relative variance of the distance between points within the space is much reduced.

## The Main Approaches for Dimensionality Reduction
### Projection
 - In real world data the training instances are not distributed evenly across all dimensions
     - some dimensions will have little variance
     - others will be highly correlated with other dimensions
 - the data may be restricted to a smaller dimensionality subspace of the original state space
 
### Manifold Learning
 - if the lower dimensional subspace is twisted or bent within the original higher dimensional space then we call this a manifold. 
     - projection will squash the manifold and not produce a good output
 - For example, the swiss roll is a 2D plane twisted in the 3rd dimension. 
 - Manifold Assumption - most real world datasets lie close to a much lower dim manifold. 
     - e.g. if you were to randomly generate images a vanishingly small number would look like digits. So the dimensionality of the MNIST dataset is restricted to some manifold. 
 - Often times reducing the dimensionality will simplify the decision boundary for an ML algorithm such as regression. But this is not always true. Sometimes the decision boundary is simpler in higher dims. 

## PCA 
 - projects the data onto a smaller dimensional hyperplane which lies closest to the data. 
 
### Preserving the Variance
 - To choose a hyperplane to project the data onto, PCA identifies the hyperplane which minimizes the squared error between the points in the original data and those projected onto the hyperplane. 
 - The preserves the largest amount of variance in the data
 
### Principal Components
  - PCA finds the axis which accounts for the maximum amount of variance in the data, then it searches the axes orthogonal to that one which accounts for the maximum amount of variance. It repeats this process for each dimension of the data. 
  - Each identified axis is called a principal component. This is represented by a unit vector which lies on the identified axis. 
  - Singular Value Decomposition (SVD) - decomposes the training matrix __X__ into three matrices __U $\Sigma$ $V^T$__
  - __V__ contains all the unit vectors of the PCs
  - PCA requires that the mean of the data be centered around the origin. 
  
### Projecting Down to d Dimensions
 - __$X_{d-proj}$__ = __X$W_d$__
 - Where __$X_{d-proj}$__ is the reduced dimensionality for of __X__ and __$W_d$__ is the first d columns of __V__
 - __X__ is m by n, __V__ is n by n, __$W_d$__ is n by d. Output of __X$W_d$__ is m by d
 
### Explained Variance Ratio
 - The proportion of variance which lies along each PC's axis
 
### Choosing the Right Number of Dimensions
 - One common tactic is to keep enough dimensions to explain 95% of the variance.
 - in sklearn's pca, setting n_components to a value betwee 0 and 1 will do this automatically. 
 - Alternatively you can plot the cumulative sum of explained variance as you increase the d and look for an elbow. 
 
### PCA for Compression
 - For MNIST, retaining 95% of the variance reduces the features from 784 to 150.
 - Reconstruction Error - the mean squared distance between the original data and the data you get when you compress then decompress the data. In sklearn this is done with the inverse_transform method
 - inverse transform equation: __$X_{recovered}$__ = __$X_{d-proj}W_d^T$__
 - __$X_{d-proj}$__ is m by d, __$W_d^T$__ is d by n, the output will be m by n
 
### Randomized PCA 
 - much fast that full PCA when d is much smaller than n. 
 - rapidly finds an approximation of d PCs
 
### Incremental PCA
 - Offers batch processing for PCA.
 - requires calling the partial_fit() method on each mini-batch instead of fit()


## Kernel PCA
 - Can use kernels (linear, RBF, sigmoid, etc) to project the data into a higher dimension before projecting it down to d dimensions. 
 - This can help preserve clusters of instances in lower dimensional projections. 
 
### Selecting a Kernel and Tuning Hyperparameters
 - use the performance of a supervised learning algorithm on the output of PCA in order to determine the best hyperparameters for PCA. 
 - Alternatively you can select the hyperparams which yield the lowest reconstruction error. 

## Local Linear Embedding (LLE)
 - nonlinear 
 - manifold technique which does not rely on projections
 - measures how each training instance relates to its closest neighbors (cn)
 - looks for a low dim representation where these local linear relationships are preserved. 
 - For each training instance __x$^i$__ , LLE identifies its k closest neighbors and approximates __x$^i$__ as a linear function of those k instances
 - seeks to minimize the squared distnace between __x$^i$__ and $\sum_{j=1}^{m} w_{i,j}x^j$
 - if $x^j$ is not one of the k closest neighbors then the weights are set to 0
 - produces a weights matrix __W__
 - The second step is to project the instances into a d dimensional space while preserving as much of the local relationships as possible. 
 - __z__^i is the image of __x__^i in d dimensional space
 - therefore minimmize the distance between __z__$^i$ and $\sum_{j=1}^{m} \widehat{w}_{i,j}z^j$