# Dimensionality Reduction
## Data Compression
- Speed up the learning algorithm
- Reduce memory needed to store data
- Lose the interpretability

![W8-DATA-COMP1](Plots/W8-DATA-COMP1.png)
![W8-DATA-COMP2](Plots/W8-DATA-COMP2.png)

## Principal Component Analysis
The Goal: Reduce the Dimension (number of attributes) in the dataset (from $R^n$ to $R^k$, $n > k$)
- Find a direction (a vector of dimension $k$) onto which to project the data so as to **minimize the projection error**
    - Project Error is the distance between the data points with direction
    
PCA is not linear regression
- Linear Regression: minimize the square value of the **vertical distance** between the points and the line
- PCA: minimize the orthorgonal distance between the points and the line
    - And there is no y to be predicted

### Implementation
Preprocessing:
- Mean Normalization/Feature Scaling

Compute the new directions: reduce $R^n$ to $R^k$ 
- Compute "covariance matrix": $\Sigma = \frac{1}{m} \sum_{i=1}^n(x^{(i)})(x^{(i)})^T$ ($n*n$ matrix)
- Compute the "eigenvectors" of matrix $\Sigma$ (Also a $n*n$ matrix)
    - Use **Singular Value Decomposition**: `[U,S,V] = svd(Sigma)`
- If we want $k$ dimensions, we select the first k columns in U matrix ($n*k$ matrix)

Project the original data to the new dimensions $Z \in R^k$
- $X_{PCA} = U_{reduced}'*X$
- X is given in n*m
    - Each column represent a sample and each row is a dimension
- Output: a $k*m$ matrix

In [None]:
Sigma = 1/m*X'*X;
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce'*x;

### Reconstruction from compressed representation
Go back from the compressed $z \in R^k$ back to $x \in R^n$

We will lose variance and the loss is measured with $\frac{\sum_{i=1}^k S_{ii}}{\sum_{i=1}^n S_{ii}}$
- S comes from the Singular Value Decomposition of the covariance matrix
- S is a diagonal and square matrix, the element along the diagonal line is $s_{ii}$

### Choose Number of Principal Components ($k$)
Criteria: keep as much variance of the data as possible

#### Method 1: % Variance Retained
Two Terms
- Average Squared Projection Error: $\frac{1}{m} \sum_{i=1}^m ||x^{(i)} - x^{(i)}_{approx}||^2$
    - $x^{(i)}_{approx}$ comes from the reconstruction from the compressed representation
- Total Variation in the data: $\frac{1}{m} \sum_{i=1}^m ||x^{(i)}||^2$

Set the threshold of **retaining 99% of variance**:
- Average Squared Projection Error/Total Variation $\leq (1-99\%)$, or:
- $\frac{\frac{1}{m} \sum_{i=1}^m ||x^{(i)} - x^{(i)}_{approx}||^2}{\frac{1}{m} \sum_{i=1}^m ||x^{(i)}||^2} \leq 0.01$

We will test different k until finding the **smallest k that meets the threshold**

#### Method 2: Loss of Variance
Same logic as the Method 1, but we use $S$
- $1-\frac{\sum_{i=1}^k S_{ii}}{\sum_{i=1}^n S_{ii}} \leq 0.01$
- S comes from the Singular Value Decomposition of the covariance matrix: `[U,S,V] = svd(Sigma)`
- S is a diagonal and square matrix, the element along the diagonal line is $s_{ii}$

### Advices
PCA is not suitable to avoid overfitting
- The implementation of PCA has no consideration for y
- Regularization is more suitable method

PCA should not be used for granted
- Before implementing PCA, first try running whatever you want to do with the original data. Only if that doesn't do what you want, then implement PCA and consider using $z^{(i)}$