# Unsupervised Learning

Unsupervised learning is a set of statistical tools intended for the setting in which we have only a set of features X1, X2, . . . , Xp measured on n observations. 
- We are **not interested in prediction**, because we do not have an associated response variable Y.
- Rather, the goal is to **discover interesting things** about the measurements on X1, X2, . . . , Xp. Is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations?


## Challenges of unsupervised learning

- Unsupervised learning is often performed as part of an **exploratory data analysis**. 
- Furthermore, it can be **hard to assess the results** obtained from unsupervised learning methods, since there is no universally accepted mechanism for performing cross-validation or validating results on an independent data set.

# Principal Components Analysis

- Principal component analysis (PCA) refers to the process by which principal components are computed, and then use of these compo- nents in understanding the data. **Principal components** are some low-dimensional representation of a data set that contains as much as possible of the variation in the data set. Each of these components is a **linear combination** of features in the data. 


- PCA is an unsupervised approach, since it involves only a set of features X1, X2, . . . , Xp, and no associated response Y.


- Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization. We can produce low-dimensional views of the data with principal components against each other. Geometrically, this amounts to projecting the original data down onto the subspace spanned by each componnet $\phi_1$, $\phi_2$, and $\phi_3$, and plotting the projected points.


## Construction First PC

Assume that each of the variables in X has been centered to have mean zero


### The maximum variance


First principal component of a set of features X1, X2, . . . , Xp is the **normalized** linear combination of the features. We refer to the elements $\phi_{11},...,\phi_{p1}$ as the **loadings** of the first principal component; together, the loadings make up the **principal component loading vector**, $\phi_{1}^T = (\phi_{11},...,\phi_{p1})^T$. The loading vector defines a direction in feature space along which the **data vary the most**.

\begin{align*}
& Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + ... + \phi_{p1}X_p \\
& \text {In each n dimension:  } z_{i1} =\phi_{11}x_{i1} + \phi_{21}x_{i2} + ... + \phi_{p1}x_{ip} = \sum_{j=1}^p(\phi_{j1}x_{ij})^2
\end{align*}
that has the **largest variance**

Also if we project all the observations onto this PC line, it would yield projected observations with the highest variance. 

<img src="./images/104.png" width=400>

**The optimization function of PC 1:**

***maximize ${\frac{1}{n}\sum_{i=1}^n(\sum_{j=1}^p\phi_{j1}x_{ij})^2} \rightarrow {\frac{1}{n}\sum_{i=1}^n(z_{ij})^2}$ (largest variance in linear combination $\rightarrow$ largest variance in squared projected values)***
> We refer to z11, . . . , zn1 as the scores of the first principal component. Also the projected values are the principal component scores z11, . . . , zn1 themselves.


#### The minimum error
**The first principal component vector** also defines the line that is **as close as possible to the data**, since such a line will likely provide a good summary of the data. The first principal component line minimizes the sum of the squared perpendicular distances between each point and the line. If we reconstruct the original characteristics from the new combined feature (projections of the original variables on the fitted line), the total reconstruction error which is the perpendicular distances should reach the minimum. 

<img src="./images/68.png" width=650>

<img src="./images/106.png" width=650>

***It turns out that "the maximum variance" and "the minimum error" are reached at the same time.***

**In mathematics**, it turns out that the eigenvector with the **highest eigenvalue** is the principle component of the data set.

### Normalization (Optimization constraints)


PCA should be performed after **standardizing** each variable to have mean zero and standard deviation one.

By **normalized**, we mean that we constrain the loading so that their sum of squares 􏰂$\sum_{j=1}^p\phi_{j1}^2=1$, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance. 


**The optimization function of PC 1:**

**maximize ${\frac{1}{n}\sum_{i=1}^n(\sum_{j=1}^p\phi_{j1}x_{ij})^2} \rightarrow {\frac{1}{n}\sum_{i=1}^n(z_{ij})^2}$ subject to $\sum_{j=1}^p \phi_{j1}^2 = 1$ (normalization)**


> This optimization problem can be solved via an eigen decomposition. The resulting principal component / the score vectors loading vectors are unique up to a sign flip, since the variance of $\phi$/Z is the same as the variance of $-\phi$/−Z.

## 2nd Principal Component

The second principal component Z2 is a linear combination of the variables that is **uncorrelated** with the first principal component Z1, and has **largest variance** subject to this constraint. **It turns out that the zero correlation condition of Z1 with Z2 is equivalent to the condition that the direction must be perpendicular to the first principal component direction.**

The second principal component scores z12, z22,..., zn2 take the form $z_{i2} = \phi_{12}x_{i1} + \phi_{22}x_{i2} + . . . + \phi_{p2}x_{ip}$ where $\phi_2$ is the second principal component loading vector, with elements $\phi_{12}, \phi_{22}, ..., \phi_{p2}$

To find $\phi_2$, we solve a problem similar to the PC1 optimization with $\phi_2$ replacing $\phi_1$, and with the additional constraint that $\phi_2$ is orthogonal to $\phi_1$.

# Visualization example


<img src="./images/105.png" width=550>

The figure represents both the principal component scores and the loading vectors in a single **biplot** display.


- The first loading vector places approximately equal weight on Assault, Murder, and Rape, with much less weight on UrbanPop. Hence this component roughly corresponds to a measure of **overall rates of serious crimes**. 

- The second loading vector places most of its weight on UrbanPop and much less weight on the other three features. Hence, this component roughly corresponds to the level of **urbanization** of the state.

 - Overall, we see that the crime-related variables (Murder, Assault, and Rape) are located close to each other, and that the UrbanPop variable is far from the other three. 
 - This indicates that the **crime-related variables are correlated** with each other—states with high murder rates tend to have high assault and rape rates—and that the UrbanPop variable is less correlated with the other three.
 
> - States with large positive scores on the first component, such as California, Nevada and Florida, have high crime rates, while states like North Dakota, with negative scores on the first component, have low crime rates. 

> - California also has a high score on the second component, indicating a high level of urbanization, while the opposite is true for states like Mississippi.

> - States close to zero on both components, such as Indiana, have approximately average levels of both crime and urbanization.

# The Proportion of Variance Explained (PVE)

Generally, we are interested in knowing the p**roportion of variance explained** (PVE) by each principal component. 


- The **total variance** present in a data set (assuming that the variables have been centered to have mean zero) is defined as 

\begin{align}
\sum_{j=1}^pVar(X_j) = \frac{1}{n}\sum_{j=1}^p\sum_{j=1}^nx_{ij}^2$
\end{align}


- The **variance explained by the mth principal component** is 

\begin{align}
\frac{1}{n}\sum_{i=1}^nz_{im}^2 = \frac{1}{n}\sum_{i=1}^n(\sum_{j=1}^p\phi_{jm}x_{ij})^2
\end{align}


- Therefore, the PVE of the mth principal component is given by 

\begin{align}
\frac{\sum_{i=1}^n(\sum_{j=1}^p\phi_{jm}x_{ij})^2}{\sum_{j=1}^p\sum_{j=1}^nx_{ij}^2}
\end{align}


# Deciding How Many Principal Components to Use

In general, a n × p data matrix X has min(n − 1, p) distinct principal components. But we would like to use **the smallest number of principal components** required to get a good understanding of the data.

We typically decide on the number of principal components required by examining a scree plot. We look for a point at which the proportion of variance explained by each subsequent principal component drops off, This is often referred to as an **elbow** in the scree plot.

<img src="./images/107.png" width=550>

- A fair amount of variance is explained by the first two principal components, and that there is an elbow after the second component. 
- The third principal component explains less than ten percent of the variance in the data, and the fourth principal component explains less than half that and so is essentially worthless.


**Principal components regression**

if we compute principal components for use in a supervised analysis, such as the principal components regression, then there is a simple and objective way to determine how many principal components to use: we can select the number of principal component score vectors to be used in the regression by **cross-validation** or a related approach. 

- Advantages of PCR: This can lead to **less noisy** results, since it is often the case that the signal (as opposed to the noise) in a data set is concentrated in its first few principal components.