# clustering

goal of clustering: find good representatives

e.g., numerical dataset, find distinct *communities* (labels) of butterflies based on *wing size, mass, color* (features) 

e.g., digits image dataset, find *representative digits* (labels) based on *pixels* (featrues)

# K-means clustering

## motivation

we have dataset $X \in \mathbb{R}^{n \times p}$ where column vector $\mathbf{x}_i \in \mathbb{R}^{p}$, $i \in [n]$

- goal of K-means clustering: find centroid vectors $\mu_j$ where $j \in [K]$ that best represent data by minimizing the summation over the Euclidiean distance between $n$ observations and their corresponding center vector

    centroid vectors $\mu_{\pi(i)}$ can be called as centers, representatives

$$
\mu_j, \pi = \underset{\mu, \pi} {\arg \min}\sum_{i=1}^n \left \| \mathbf{x}_i - \mu_{\pi(i)} \right \|_2^2
$$

$\pi(i) \in [K]$ is a mapping function that assigns (maps) an observ $i$ to a center vector $\mu_j$

- so we need to find both $\mu$ and $\pi$, in general this computation is intractable

    but, if given $\pi(i)=j$ where $j \in [K]$, it's easy to find $\mu_j$

    also, if given $\mu_j$, it's easy to find $\pi$

- if given $\pi(i)=j$ where $j \in [K]$, $\hat \mu_j$ is the **mean vector** of data points in cluster $j$

$$
\hat \mu_j = \frac{1}{n_j} \sum_{i|\pi(i)=j} \mathbf{x}_i
$$

where $n_j = \sum_{i|\pi(i)=j} 1 = |\text{cluster j}|$ the number of data points in cluster $j$

this explains the means part of K-**means**

- if given $\mu_j$, $\pi(i)$ is a mapping that minimizing the Euclidean distance between i th observation $\mathbf{x}_i$ and center vector $\mu_{j}$, i.e., map observ $i$ to the closest representative $\mu_{j}$ 

$$
\hat \pi(i) = \underset{\pi(i)=j}{\arg \min} \left \| \mathbf{x}_i - \mu_{j} \right \|_2^2
$$

these yield a natural **alternating minimization algorithm**

## algorithm

1. start with K initial random guess for $\mu_j$ where $j \in [K]$

2. compute $\pi$: assign each observ to the cloest $\mu_j$

3. compute $\mu_j$: mean vector of each clusters

4. repeat 1~3 steps until convergence

# digit e.g.

we have an image dataset $X \in \mathbb{R}^{12593 \times 784}$

use PCA to plot 2 interesting direction of data (2-D)

and colored by 2 labels: digit 1 or 8 (in practice, we don't know the labels)

PCA is a baseline method for computing interesting directions of data

now we have new matrix $X \in \mathbb{R}^{n \times 2}$ where column vector $\mathbf{x}_i \in \mathbb{R}^{2}$, $i \in [n]$


## first visualize data in 2-D first

![image.png](attachment:image.png)

## then do K-means clustering

K is a hyperparameter

set $K=2$, we choose to have 2 clusters

set $K=3$, we choose to have 3 clusters

the output 2 clusters are fuzzy images coz people write digits in differet ways

the images capture different handwriting and take average

![image.png](attachment:image.png)