# Unsupervised Learning

Unsupervised learning deals with a set of statistical tools intended for the setting in which we have only a set of features $X_1, X_2, \ldots , X_p$ measured on n observations. We are not interested
in prediction, because we do not have an associated response variable Y .
Rather, the goal is to discover interesting things about the measurements
on $X_1, X_2, \ldots , X_p$. 

## Principal Components Analysis

Principal component analysis (PCA) refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data.

### What Are Principal Components?

The
idea is that each of the n observations lives in p-dimensional space, but not
all of these dimensions are equally interesting. PCA seeks a small number
of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each
dimension. Each of the dimensions found by PCA is a linear combination
of the p features.

The first principal component of a set of features $X_1, X_2, \ldots , X_p$ is the
normalized linear combination of the features:
$$
Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \ldots + \phi_{p1}X_p\\
\sum_{i=1}^p \phi_{i1}^2 = 1
$$
The elements $\phi_{11}, \ldots, \phi_{p1}$ are called the *loading* of the first principal. These can be calculated by minimizing the variance subjected to the constraint that the sum of the square of loading is 1.

>The loading forms a vector $\phi$ and they point in the directions in feature space along which the data vary the most,
and the principal component scores as projections along these directions.

After the first principal component $Z_1$ of the features has been determined, we can find the second principal component $Z_2$. The second principal component is the linear combination of $X_1, X_2, \ldots , X_p$ that has maximal
variance out of all linear combinations that are uncorrelated with $Z_1$. The
second principal component scores $z_{12}, z_{22}, \cdots , z_{n2}$ take the form:
$$
z_{12} = \phi_{12}x_{i1} + \phi_{22}x_{i2} + \ldots + \phi_{p2}x_{ip}
$$
It turns out that constraining $Z_2$ to be uncorrelated with
$Z_1$ is equivalent to constraining the direction $\phi_2$ to be orthogonal (perpendicular) to the direction $\phi_1$.

Once we have computed the principal components, we can plot them
against each other in order to produce low-dimensional views of the data.
For instance, we can plot the score vector $Z_1$ against $Z_2$, $Z_1$ against $Z_3$,
$Z_2$ against Z3, and so forth. Geometrically, this amounts to projecting
the original data down onto the subspace spanned by $\phi_{1}$, $\phi_{2}$, and $\phi_{3}$, and
plotting the projected points.

### More on PCA

#### Scaling the Variables

The results obtained
when we perform PCA will also depend on whether the variables have been
individually scaled, that is, if the features are scaled, the PCA will also change.

#### Uniqueness of the Principal Components

Each principal component loading vector is unique, up to a sign flip. This
means that two different software packages will yield the same principal
component loading vectors, although the signs of those loading vectors
may differ. The signs may differ because each principal component loading
vector specifies a direction in p-dimensional space: flipping the sign has no
effect as the direction does not change.

#### The Proportion of Variance Explained

The total variance present in a data set (assuming
that the variables have been centered to have mean zero) is defined as:
$$
\sum_{i=1}^p \text{Var}(X_j) = \sum_{j=1}^p\frac{1}{n}\sum_{i=1}^n x_{ij}^2
$$
While the variance explained by the first principal component is:
$$
\frac{1}{n}\sum_{i=1}^n z_{im}^2 = \frac{1}{n}\sum_{i=1}^n\left( \sum_{j=1}^p\phi_{jm}x_{ij} \right)^2
$$
and hence, PVE is:
$$
\frac{\sum_{i=1}^p \left( \sum_{j=1}^p\phi_{jm}x_{ij} \right)^2}{\sum_{i=1}^p \frac{1}{n}\sum_{i=1}^n x_{im}^2}
$$

#### Deciding How Many Principal Components to Use

In general, a $n\times p$ data matrix X has $\min(n − 1, p)$ distinct principal
components. However, we usually are not interested in all of them. In fact, we would like to use the smallest
number of principal components required to get a good understanding of the
data. We typically decide on the number of principal components required
to visualize the data by examining a *scree plot*. We choose the smallest number of
principal components that are required in order to explain a sizable amount
of the variation in the data. This is done by eyeballing the scree plot, and
looking for a point at which the proportion of variance explained by each
subsequent principal component drops off. This is often referred to as an
elbow in the scree plot.

## Clustering Methods

Clustering refers to a very broad set of techniques for finding subgroups, or
clustering
clusters, in a data set. When we cluster the observations of a data set, we
seek to partition them into distinct groups so that the observations within
each group are quite similar to each other, while observations in different
groups are quite different from each other.

>Both clustering and PCA seek to simplify the data via a small number
of summaries, but their mechanisms are different:
>* PCA looks to find a low-dimensional representation of the observations that explain a good fraction of the variance;
>* Clustering looks to find homogeneous subgroups among the observations.

### K-Means Clustering

K-means clustering is a simple and elegant approach for partitioning a
data set into K distinct, non-overlapping clusters. To perform K-means
clustering, we must first specify the desired number of clusters K; then the
K-means algorithm will assign each observation to exactly one of the K
clusters. 

![](images/10_01.png)

The algorithm for K-means clustering is as follows:
1. Randomly assign a number, from 1 to K, to each of the observations.
These serve as initial cluster assignments for the observations.
2. Iterate until the cluster assignments stop changing:
    - For each of the K clusters, compute the cluster centroid. The
kth cluster centroid is the vector of the p feature means for the
observations in the kth cluster.
    - Assign each observation to the cluster whose centroid is closest
(where closest is defined using Euclidean distance)

K-means algorithm finds a local rather than a global optimum and hence the results obtained will depend on the initial (random) cluster assignment of each observation in Step 1 of the Algorithm. For this reason,
it is important to run the algorithm multiple times from different random initial configurations. Then one selects the best solution, i.e. that for which
the objective is smallest.

### Hierarchical Clustering

One potential disadvantage of K-means clustering is that it requires us to
pre-specify the number of clusters K. Hierarchical clustering is an alternative approach which does not require that we commit to a particular
choice of K. Hierarchical clustering has an added advantage over K-means
clustering in that it results in an attractive tree-based representation of the
observations, called a dendrogram.

![](images/10_03.png)

In the left-hand panel of the figure above, each leaf of the dendrogram represents one of the 45 observations. However, as we move
up the tree, some leaves begin to fuse into branches. These correspond to
observations that are similar to each other. As we move higher up the tree,
branches themselves fuse, either with leaves or other branches. The earlier
(lower in the tree) fusions occur, the more similar the groups of observations are to each other. On the other hand, observations that fuse later
(near the top of the tree) can be quite different. In fact, this statement
can be made precise: for any two observations, we can look for the point in
the tree where branches containing those two observations are first fused.
The height of this fusion, as measured on the vertical axis, indicates how different the two observations are. Thus, observations that fuse at the very
bottom of the tree are quite similar to each other, whereas observations
that fuse close to the top of the tree will tend to be quite different.

#### The Hierarchical Clustering Algorithm

The hierarchical clustering dendrogram is obtained via an extremely simple
algorithm. We begin by defining some sort of dissimilarity measure between
each pair of observations. Most often, Euclidean distance is used. The algorithm proceeds iteratively. Starting out at the bottom of the dendrogram,
each of the n observations is treated as its own cluster. The two clusters
that are most similar to each other are then fused so that there now are
n− 1 clusters. Next the two clusters that are most similar to each other are
fused again, so that there now are n − 2 clusters. The algorithm proceeds
in this fashion until all of the observations belong to one single cluster, and
the dendrogram is complete.

However, for this algorithm to work The concept of dissimilarity
between a pair of observations needs to be extended to a pair of groups
of observations. This extension is achieved by developing the notion of
linkage, which defines the dissimilarity between two groups of observalinkage
tions. The four most common types of linkage—complete, average, single,
and centroid.

![](images/10_04.png)

The algorithm proceeds in the following manner:
1. Begin with n observations and a measure (such as Euclidean distance) of all the $^nP_2 = n(n − 1)/2$ pairwise dissimilarities. Treat each
observation as its own cluster.
1. For $i = n, n − 1, \ldots , 2$:
    - Examine all pairwise inter-cluster dissimilarities among the i
clusters and identify the pair of clusters that are least dissimilar
(that is, most similar). Fuse these two clusters. The dissimilarity
between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
    - Compute the new pairwise inter-cluster dissimilarities among
the i − 1 remaining clusters

![](images/10_05.png)

#### Choice of Dissimilarity Measure

The choice of dissimilarity measure is very important, as it has a strong
effect on the resulting dendrogram. In general, careful attention should be
paid to the type of data being clustered and the scientific question at hand.
These considerations should determine what type of dissimilarity measure
is used for hierarchical clustering.