# Lecture 4: Data Clustering via K-means

## Overview and motivation

Goal: find a meaningful partition of the data. Usually: similarity.

A meaningful partition is an **assignments** of each data point to a cluster:

$$ \pi : \{ 1, \dots, N \} \to \{1, \dots, K\} $$

We can easily recover the members of the $k$-th cluster as follows:

$$ \pi^{-1}(k) \subseteq \{ 1, \dots, N \} $$

## K-means clustering

Objective function:

$$
J(U, Z) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{kn} \| x_n - u_k \|^2_2
$$

Sum over data points, and sum over clusters. Note that the inner sum **always** has just one real term, the rest being just zero. This will change once we *upgrade to Mixture Models*!

X contains the data as columns, DxN.
Z contains the cluster assignments. KxN, with every column having exactly one zero!
U contains the cluster centroids themselves. DxK

The objective function can be written as a matrix factorization problem! We want to factor X into U and Z--a centroid matrix and an assignment matrix.

$$
J(U, Z) = \| X - UZ \|_F^2
$$

Currently using Eulcidean metric ($\ell_2$-norm). Can potentially use other norms too! Remember the DM exam prep!

### Solving K-means

Alternate between finding optimal assignments based on centroids and vice-versa.
Start with random centroids. Non-convex but good enough.

#### A. Assignments given centroids.

Fix U. Compute optimal z, for every data point $n$:

(We can do this because our objective function decomposes additively.)

$$
z^*_{:,n} = \arg \min_{z} \sum_{k=1}^{K} z_{kn} \| x_n - u_k \|_2^2
$$

Note that this just picks the $z$ for the closest centroid.


#### B. Centroids given assignments

Center of mass of assigned points (proof in slides using $\nabla_{u_k}J(U, Z) \overset{!}{=} 0$).

Cost of every iteration $O(KND)$

## Choosing K

### A. Use stability

1. Generate several perturbed version of dataset.
2. Cluster each with fixed K.
3. Compute pairwise distances between all clusterings.
4. Compute **instability** = mean distance between all clusterings.
5. Repeat for different Ks, and chose the most stable one.

#### Distance between clusterings

$$
d(C, C') := \min_{\Pi} \|Z - \Pi(Z')\|^2_F
$$

In [1]:
# TODO(andrei): More notes on stability and such.

Kmeans can be seen as a matrix factorization problem, but Z does have some unusual constraints. Can relax Z from $z_{k,n} \in {0,1}$ to $z_{k,n} \in [0,1] \quad \text{with} \quad \sum_{k=1}^{K}z_{k,n} =1 \>\forall n$