# 6. Unsupervised Learning

**Clustering** (and outlier detection) and dimension reduction. Focus on former in this unit.

Points are represented in high-dim Euclidean spaces or any metric spaces.

## Types of clustering

### Hierarchical clustering
Build a tree representing distances among the data points.
*Examples:* single-, average-linkage **agglomerative clustering**.

### Partitional approaches
Define and optimize an objective function defined over partitions.
TODO: What are partitions?
*Examples:* spectral clustering, graph-cut based approaches.

### Model-based approaches (main focus)
Maintain cluster "models" and infer cluster membership.
Viewed as the "standard" clustering methods.
*Examples:* k-means, Gaussian mixture models (GMMs), etc.

## The k-means problem
 * Points in Euclidean space, $x_i \in \mathbb{R}^d$.
 * Clusters as centers $\mu_j \in \mathbb{R}^d$.
 * Each point assigned to closest center.
 * **Goal**: pick centers to minimize average squared distance:
 
\begin{equation}
\begin{aligned}
L(\mu) & = L(\mu_1, \dots, \mu_k) = \sum_{i = 1}^{N} \min_{j} \| x_i - \mu_j \|_2^2 \\
\mu^{*} & = \arg \min_{\mu} L(\mu)
\end{aligned}
\end{equation}

NP-hard to solve, so we use **Lloyd's heuristic**, commonly (though not entirely accurately) referred to as the k-means algorithm.

The algorithm:

* Initialize cluster centers (randomly or in a smarter way): $\mu^{(0)} = [\mu_1^{(0)}, \dots, \mu_k^{(0)} ]$
* Assign every point to closest cluster: $z_i = \arg \min_{j} \| x_i - \mu_j^{t - 1} \|_2^2$
* Update cluster centers to be at the center of the newly updated cluster: $\mu_j^{(t)} = \frac{1}{n_j} \sum_{i:z_i = j} x_i$
* Repeat until convergence (e.g. minimum cluster center movement threshold).

### Properties
 * Guaranteed (can be shown) to *monotonically decrease average squared distance* in each iteration: $L(\mu^{t+1}) \le L(\mu^{t})$
 * Converges to *local* optimum
 * $O(nkd)$ per iteration ($n$ elements, $k$ clusters, $d$ dimensions); have to process entire data set in every iteration, so difficult to parallelize.

## Scaling up k-means
TODO