# Dimensionality Reduction

1. PCA
2. PCR
3. PLS
4. PCA vs. Factor Analysis

The lecture draws from Chapter 6 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). "An introduction to statistical learning: with applications in r."

---
# 1. Principal Component Analysis (PCA)


In the previous lectures we tried to reduce the dimensionality of $X$ by learning to choose which variables to keep in $X$ and which to remove (either by throwing them out of the model, or by making the regression coefficients so small that they functionally remove the variable from consideration). This is known as _feature selection_. 

However, another way to reduce the dimensionality of your model is by minimizing any correlation between your variables in $X$. Thus you find what is called the "lower dimensional subspace" of your data, where you best explain $X$ as a smaller series of new variables that are completely independent from each other.

PCA is a fast and efficient method for finding the lower dimensional subspace of your data. The idea is to project your _n x p_ data matrix $X$ into a new _n x p_ matrix $Z$ where all _p_ variables in $Z$ are perfectly orthogonal to each other. 

For example, the figure below shows a case where you have two variables (i.e., _p=2_) that clearly have a strong correlation to each other.

![PCA example](imgs/L18_PCAexample.png)

Using PCA, we return a new matrix $Z$ that reflects two principal components (PCs). The first PC is the one that explains the most variance between the two variables (the green line in both panels). The second PC is the one that explains the second most amount of variance, which in this case is the residual between each observation and the first PC. 

Formally we say that $Z$ has _m_ columns, where $m \leq p$ that represent a linear cominbation of columns of the original predictors.

$$ Z_m = \sum_{j=1}^{p} \phi_{j,m} X_j $$

Each column in $Z$ is a new PC and the constants $\phi_{1,m}$, ..., $\phi_{p,m}$ reflect the weights or _loadings_ of each variable in $Z$ onto the $m^{th}$ component in $Z$. 

<br>

## PCA algorithm

<br>

The algorithm for PCA iteratively goes through each component and removes the variance accounted for in $X$ by the previous components. 

_For the first component_ this boils down to solving the for the following equation.

$$\phi_1 = \arg \max \{ \frac{\phi^T_1 X^T X \phi_1}{\phi^T_1 \phi_1} \} $$

What this says is that you find the right values of the vector $\phi_1$ that maximizes the above equation.

_For all other k components_ you need to subtract the first _k-1_ PCs from X and re-solve. So for PC _k_, where _k>1_, you make a new data vector $\hat{X}_k$

$$ \hat{X}_k = X - \sum_{s=1}^{k-1} X \phi_s \phi_{s}^T $$

and then solve for the next loading vector $\phi_m$ such that

$$\phi_m = \arg \max \{ \frac{\phi^T_m \hat{X}_k^T \hat{X}_m \phi_m}{\phi^T_k \phi_m} \} $$

If you play this out for all components, you'll see that by the time _k=p_, the final component explains all the residual variance not accounted for by the previous _k-1_ components. Which means that all _m_ components, when _m=p_, explain 100% of the variability in $X$.

So an easy way to reduce the dimensionality of $Z$ is to only include the first _m_ components that explain a significant portion (e.g., 90% or 95%) of the variance in $X$. Ideally $m << p$ and thus it makes $Z$ a preferred data object to analyze. 

<br>

### Assumption 


It is worth noting that PCA assumes that X is filled with quantitative variables. Normal PCA doesn't work when X is a mixture of quantitative and qualitative variables, or just qualitative variables. 

---
# 2. Principal Component Regression (PCR)

Let's see how we can use PCA to help us in a simple regression context. Given a set of predictor variables $X$, where $X$ is _n x p_, let's use PCA to find the first _m_ components that explain 90% of the variance of $X$. This gives us an _m x p_ matrix $Z$. 

We can now insert $Z$ into our regression model because it fits all the assumptions of linear least squares regression.

$$\hat{y}_i = \hat{\theta}_0 + \sum_{m=1}^{M} \hat{\theta}_m z_{i,m} + \epsilon_i$$

Here we are using the notation $\hat{\theta}$ instead of $\hat{\beta}$ to indicate the estimated regression coefficients because we want to be clear that this is estimated from the output of the PCA dimensionality reduction, rather than from $X$ itself. This is known as _principal component regression_ (PCR).

Now the beauty of PCA being a linear projection of the original data is that we can see how the equation above boils down to a modification of OLS regression. Since $Z_m = \sum_{j=1}^{p} \phi_{j,m} X_j$, we can rewrite the equation above as

$$\sum_{m=1}^{M} \hat{\theta}_m z_{i,m} = \sum_{m=1}^{M} \hat{\theta}_m \sum_{j=1}^p \phi_{j,m} X_j$$

$$= \sum_{m=1}^{M} \sum_{j=1}^p \hat{\theta}_m \phi_{j,m} X_j $$
$$ = \sum_{j=1}^p \hat{\beta}_{j} X_j$$

Thus, using PCR, you can return to the OLS solution for the regression coefficients as

$$ \hat{\beta}_j = \sum_{m=1}^{M} \hat{\theta}_m \phi_{j,m} $$

Pay attention to the use of the "^" symbol that we use to indicate parameters that are estimated. Since PCA is used at the begining to find the lower dimensional subspace of $X$, the only free parameter that needs to be fit in the model is the regression coefficient $\theta$.

<br>
    
### Bias-Variance Tradeoff

Just like in OLS regression, the flexibility of a PCR model is determined by the number of predictor variables used. Increasing _m_ (i.e., increasing the number of PCs in your model), increases flexibility but begins to impact bias. Just like OLS, you'd find the optimal number of components in your model using cross-validation approaches.

---
# 3. Partial least squares (PLS) regression

As mentioned at the end of the last section, PCR only fits one class of variables: the regression coefficients that best explain the relationship between $Z$ and $Y$. But there might be better ways of finding the lower dimensional subspace of $X$ that take into consideration its relationship with $Y$. This is the idea behind _partial least squares (PLS)_ regression. 

PLS is a supervised learning approach where $Y$ is considered during the dimensionality reduction step. 

Where PCR tries to solve for this objective function

$$  \sum_{i=1}^n (y_i - \hat{\theta}_0 - \sum_{m=1}^{M} \sum_{j=1}^p \hat{\theta}_m \phi_{j,m} x_{i,j})^2 $$ 

PLS tries to solve this objective function

$$ \sum_{i=1}^n (y_i - \hat{\theta}_0 - \sum_{m=1}^{M} \sum_{j=1}^p \hat{\theta}_m \hat{\phi}_{j,m} x_{i,j})^2 $$ 

Thus, PLS searches for values of both $\theta$ _and_ $\phi$ that maximize the best fit to $Y$. In the end, PLS finds the best subspace of $X$ that maximizes the correlation between $X$ and $Y$.

Now the output from PLS can be transformed in the same way as PCR. 

$$ \hat{\beta}_j = \sum_{m=1}^{M} \hat{\theta}_m \hat{\phi}_{j,m} $$

To see how PCR and PLS perform when projected back into the original data space, we return to the original data example shown in the PCA section above. The plot below shows the estimated regression line $\hat{\beta}$ that best explains the relationship between Population and Ad spending from PCR (dotted line) and PLS (solid line).

![PCR vs. PLS](imgs/L18_PCR_v_PLS.png)

This gives a qualitative sense on the subtle, but important differences between these two methods.

---
# 4. PCA vs. Factor Analysis

To some of you, PCA might sound very similar to another dimensionality reduction method called [Factor Analysis](https://en.wikipedia.org/wiki/Factor_analysis). The logic of Factor Analysis is quite similar to PCA: $X$ can be described by a set of _m_ factors (F) that are a linear recombination of X. However, Factor Analysis works in a different way.

In factor analysis, we begin with the assumption of how many factors describe the lower dimensional space in $X$. In other words, **we assume _m_.** 

Given _m_ factors, we can redescribe $X$ as 

$$ X^T = LF + \epsilon$$

Here X is the _n x p_ original data matrix, L is the _p x m_ loading matrix, and F is _m x n_ factor matrix of the factors. 

Paying attention the form of the equation above resembles the description of the PCs in PCA: $Z_m = \sum_{j=1}^{p} \phi_{j,m} X_j$. This is where the logic of the two methods overlaps. 

Beyond assuming a set value for _m_, the implementation is also very different. In factor analysis the goal is to find the best structure of $LL^T$ that resembles the covariance matrix of your data (i.e., $X^TX$).

$$ L = \arg \min \{(X^T X - LL^T)^2 \} $$

Notice that this is very different than the objective function of PCA. Essentially Factor Analysis works by trying to recreate the covariance matrix of your data. As a result, the ordering of factors in F isn't ordered in the same way that PCs are ordered in PCA (i.e., the first column of F doesn't explain the most variance in $X$). 

<br>

### PCA vs. Factor Analysis

You can think of the pros & cons of each method as follows

* PCA is better for exploratory analysis
* Factor Analysis is better for hypothesis testing
* PCA returns all components that explain all the variance in your data.
* Factor Analysis assumes a subset of components and tries to best explain the covariance of yoru data.