**Dimension reduction**: a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.

# Dimension Reduction Methods
Let Z1,Z2, . . . ,ZM represent M < p linear combinations of our original p predictors. That is,

\begin{align}
Z_m=\sum_{j=1}^p\phi_{jm}X_j
\end{align}

for some constants φ1m, φ2m . . . , φpm, m = 1, . . .,M. We can then fit the
linear regression model

\begin{align}
y_i=\theta_0+\sum_{m=1}^M\theta_m z_{im}+\epsilon_i  \quad  i=1,2,3,4,...,n
\end{align}

**Dimension reduction**: reduces the problem of estimating the p+1 coefficients β0, β1, . . . , βp to the simpler problem of estimating the M + 1 coefficients θ0, θ1, . . . , θM, where M < p. In other words, the dimension of the problem has been reduced from p + 1 to M + 1.

\begin{align}
\sum_{m=1}^M\theta_m z_{im}&=\sum_{m=1}^M\theta_m \sum_{j=1}^p\phi_{jm}x_{ij}=\sum_{m=1}^M\sum_{j=1}^p\theta_m \phi_{jm}x_{ij}=\sum_{j=1}^p \beta_jx_{ij}  \\
\beta_j&=\sum_{m=1}^M\theta_m \phi_{jm}
\end{align}

**Dimension reduction methods work in two steps:**

1. Transform  predictors to Z1, Z2, . . . , ZM, which could be the linear combinations of original features.
2. The model is fit using these new predictors. However, the choice of Z1, Z2, . . . , ZM, or equivalently, the selection of the φjm’s, can be achieved in different ways. 

# Principal Components Regression

**PCA** is a technique for reducing the dimension of a n × p data matrix X.

## 1st Principal Component

### The maximum variance

**The first principal component** direction of the data is that along which there is **the greatest variability** in the data.

That is, if we projected the 100 observations onto this line, then the resulting projected observations would have **the largest possible variance**; projecting the observations onto any other line would yield projected observations with lower variance. 

<img src="./images/67.png" width=600>


### The minimum error
**The first principal component vector** also defines the line that is **as close as possible to the data**. The first principal component line minimizes the sum of the squared perpendicular distances between each point and the line. If we reconstruct the original characteristics from the new combined feature (projections of the original variables on the fitted line), the total reconstruction error which is the perpendicular distances should reach the minimum. 

<img src="./images/68.png" width=600>

***It turns out that "the maximum variance" and "the minimum error" are reached at the same time.***

**In mathematics**, it turns out that the eigenvector with the **highest eigenvalue** is the principle component of the data set.

### 1st Principal Component Formula

The first principal component is given by the formula：

\begin{align}
Z_1 = 0.839 × (pop − \bar{pop}) + 0.544 × (ad − \bar{ad})
\end{align}


Here the proportion of each feature $\phi_{11}$ = 0.839 and $\phi_{21}$ = 0.544 are the **principal component loadings**, which define the PC direction referred to above.


>The idea is that out of every possible linear combination of pop and ad such that $\phi_{11}^2+\phi_{21}^2=1$, this particular linear combination yields the highest variance: i.e. this is the linear combination for which $Var(\phi_{11} × (pop − \bar{pop}) + \phi_{21} × (ad − \bar{ad}))$ is maximized.



**Principal Component Scores**


\begin{align}
z_{i1} = 0.839 × (pop_i − \bar{pop}) + 0.544 × (ad_i − \bar{ad})
\end{align}


The values of $z_{11}, . . . , z_{n1}$ are known as the **principal component scores**. The first principal component score can be seen in the right-hand panel of Figure 6.15 as the distance in the x-direction of the ith cross from zero. That is, if we project the n data points $x_{1}, . . . , x_{n}$ onto this direction, the **projected values** are the principal component scores $z_{11}, . . . , z_{n1}$ themselves. 

<img src="./images/104.png" width=400>


We can think of the values of the principal component $Z_1$ as single number summaries of the joint pop and ad budgets for each location.  In this example, if zi1 = 0.839 × (popi − pop) + 0.544 × (adi − ad) < 0, then this indicates a city with below-average population size and below-average ad spending.

<img src="./images/69.png" width=650>

In this case, Figure 6.14 indicates that pop and ad have approximately a linear relationship, and so we might expect that a single-number summary will work well. 

## 2nd Principal Component

The second principal component Z2 is a linear combination of the variables that is **uncorrelated** with the first principal component Z1, and has largest variance subject to this constraint. 

It turns out that the zero correlation condition of Z1 with Z2 is equivalent to the condition that the direction must be perpendicular, or orthogonal, to the first principal component direction. 

<img src="./images/67.png" width=600>


The second principal component is given by the formula:

\begin{align}
Z_2 = 0.544 × (pop − \bar{pop}) − 0.839 × (ad − \bar{ad}).
\end{align}

## More PCs

With two-dimensional data, we can construct at most two principal components. 

However, if we had other predictors, then additional components could be constructed. They would **successively maximize variance, subject to the constraint of being uncorrelated with the preceding components**.

In theory, there's one PC per variable. But in practice, the number of PCs is **either the number of variables or the number of samples, whichever is smaller**.

## Importance of each PC

By construction, the first component will contain the most information. **But how to accurately decide the importance of each principal component?**

<img src="./images/70.png" width=600>
<img src="./images/71.png" width=600>

1. The variance of any projection will be given by a weighted average of the eigenvalues
2. Calculating total variance and variance within each principal component
2. The importance of each principal component is given by how much variance this PC accounts for of the total variance around all PCs.



# The Principal Components Regression (PCR)

The principal components regression (PCR) approach involves constructing **the first M principal components**, Z1,..., ZM, and then using these components as the predictors **in a linear regression model** that is fit using least squares


**Why PCR**

Often a small number of principal components are enough to explain most of the variability in the data, as well as the relationship with the response. And by building a sparse model, we can mitigate overfitting. 


**Standardisation**

When performing PCR, we generally **standardizing each predictor**, prior to generating the principal components. In the absence of standardization, the **high-variance variables** will tend to play a **larger** role in the principal components obtained, and the scale on which the variables are measured will ultimately have an effect on the final PCR model.
 

**Drawbacks**

We created PCs to explain the original predictors. These PCs are identified in an **unsupervised** way, since the response Y is not used to help determine these PC directions. We are not sure whether these PCs are also related with the response and can predict the response well. 


**Note**

even though PCR provides a simple way to perform regression using M < p predictors, it is **not a feature selection** method! This is because each of the M principal components used in the regression is a **linear combination** of all p of the original features.
> PCR is more closely related to ridge regression than to the lasso. One can even think of ridge regression as a continuous version of PCR!

------- 

**Example**

<img src="./images/72.png" width=650>

- As more principal components are used in the regression model, the bias decreases, but the variance increases. This results in a typical U-shape for the mean squared error. **Performing PCR with an *appropriate choice of the number of principal components* can result in a substantial improvement over least squares.**

> How to choose the number of PCs: In PCR, the number of principal components, M, is typically chosen by
**cross-validation**.


- In contrast, PCR will tend to do well in cases when the **first few principal components are sufficient to capture most of the variation** in the *predictors* as well as the relationship with the *response*.

# Partial Least Squares (PLS)

Partial least squares (PLS) is a supervised alternative to PCR. Unlike PCR, PLS identifies these new features in a *supervised* way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response. **Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.**

**Steps**:

First PLS component:
1. Standardizing all p predictors
2. Conduct simple linear regression of the response Y on all predictors X
3. Computes the first direction Z1 by **setting each $\phi_{j1}$** equal to **the coefficient from the simple linear regression** of Y onto Xj. This coefficient is proportional to the correlation between Y and Xj. Hence, in computing $Z_1=\sum_{j=1}^p\phi_{j1}X_j$, PLS places the highest weight on the variables that are most strongly related to the response.

Second PLS component:
4. To identify the second PLS direction we first regressing each variable on Z1 and taking residuals. These resid- uals can be interpreted as the remaining information that has not been explained by the first PLS direction Z1.
5. We then compute Z2 using this orthogonalized data in exactly the same fashion as Z1 was computed based on the original data. 

Multiple PLS components:
6. Repeat this approach M times to identify multiple PLS components Z1,...,ZM. 
7. The tuning parameter M is typically chosen by cross-validation. 
8. At the end of this procedure, we use least squares to fit a linear model to predict Y using Z1,...,ZM


<img src="./images/73.png" width=650>
PLS has chosen a direction that has less change in the ad dimension per unit change in the pop dimension, relative to PCA. This suggests that pop is more highly correlated with the response than is ad. 