## Lecture 13: Clustering 1 (MIT Notes)

The deck ```./decks/KMeans.pptx``` is companion to the original MIT content. This will summarise the mathematical theory behind clustering in general and kmeans clustering in particular. 

**Clustering definition**

1. The input is a dataset $S_n = \{x^i|i=1,...,n\}$
2. The number of clusters to be found $K$

The clusters are defined as a partition $C_j$ such that each partition contains the indices of datapoints belonging to that partition. Also $\bigcup C_j = \{1,2,3..n\} \forall j=1,2..K$ and $C_j \cap C_k = \phi; \forall j \neq k$

**Similarity Measures and Cost**

$Cost(C_1,z) = \sum_{i \in C_1}dist(x^i,z)$

$Cost(C_1,C_2...C_k,z_1,z_2...z_k) = \sum^{k}_{j=1}\sum_{i \in C_j}||x^i-z_j||^2$

**Kmeans algorithm: Big Picture**

1. Randomly select $z_1,2_2,...,z_K$
2. 
    - Given $z_1,z_2,...,z_k$, assign each datapoint to $x^i$ to the $z_j$ if
    $$Cost(z_1,z_2,...,z_k) = \sum^{n}_{i=1}min_{j = 1,2...,k}||x^i-z_j||^2$$
    
    - Given $C_1,C_2,...,C_K$ find the best representatives $z_1,z_2,...,z_K$ such that:
    $$z_j = argmin\sum_{i \in C_j}||x^i-z||^2$$
    
**Kmeans algorithm specifics**

How to find the $z_j$? For each $C_j$ the assignment is made independently, so we can try to find the derivative of $\sum_{i \in C_j}||x^i-z_j||^2$

$\frac{\partial(\sum_{i \in C_j}||x^i-z_j||^2)}{\partial(z_j)}=0 => \sum_{i \in C_j}\frac{\partial||x^i-z_j||^2)}{\partial(z_j)} = 0 => \sum_{i \in C_j}\frac{\partial(x^i-z_j)^2}{\partial(z_j)} = 0 => \sum_{i \in C_j} (-2(x^i - z_j)) = 0 =>\sum_{i \in C_j}(x^i - z_j) = 0 => \sum_{i \in C_j}(x^i) - \sum_{i \in C_j}(z_j) = 0 =>\sum_{i \in C_j}(z_j) = \sum_{i \in C_j}(x^i) => |C_j|z_j = \sum_{i \in C_j}(x^i) => z_j = \frac{\sum_{i \in C_j}(x^i)}{|C_j|} $

## KMeans Demo:
- Show what happens in step 2.2 and how $z_j$ may not be from the dataset itself

1. Select randomly 2 centers $z_j$ from the dataset
2. Find the points which are closest to each of the $z_j$ and assign cluster labels
3. Recompute the clusters again by computing the centroids

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
def find_distance(z,x):
    return np.linalg.norm(z-x)
def get_data():
    X = load_diabetes()['data'][0:50]
    col_names = load_diabetes()['feature_names']
    data = pd.DataFrame(X,columns=col_names)
    zj = data.sample(n=2).values
    return data,zj
def get_label(row,z1,z2):
    d1 = find_distance(z1,row.values)
    d2 = find_distance(z2,row.values)
    if d1>d2:
        return 'z1'
    else:
        return 'z2'
def colorize(row):
    if row['label']=="z1":
        color = "#5F4B8BFF"
    else:
        color = "#E69A8DFF"
    return [f"background-color: {color}"]*len(row.values)
def compute_new_z(z="z1"):
    return data[data['label']==z].drop('label',axis=1).mean().values 

### Randomly select $z_j$

In [2]:
data,zj = get_data()
data.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [3]:
zj

array([[-0.09996055, -0.04464164, -0.06764124, -0.10895673, -0.07449446,
        -0.07271173,  0.01550536, -0.03949338, -0.04986847, -0.00936191],
       [-0.06726771,  0.05068012, -0.01267283, -0.04009932, -0.01532849,
         0.00463594, -0.0581274 ,  0.03430886,  0.01919903, -0.03421455]])

### Assign cluster labels to other points

In [4]:
data['label'] = data.apply(get_label,z1=zj[0],z2=zj[1],axis=1)
data.style.apply(colorize,axis=1)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,label
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,z1
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,z2
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,z1
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,z1
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,z1
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.04118,-0.096346,z2
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062913,-0.038357,z1
7,0.063504,0.05068,-0.001895,0.06663,0.09062,0.108914,0.022869,0.017703,-0.035817,0.003064,z1
8,0.041708,0.05068,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014956,0.011349,z1
9,-0.0709,-0.044642,0.039062,-0.033214,-0.012577,-0.034508,-0.024993,-0.002592,0.067736,-0.013504,z1


In [5]:
compute_new_z("z1")

array([ 0.00163701,  0.02387088,  0.01295889,  0.00663048,  0.00066696,
        0.00262005, -0.01095641,  0.00971196,  0.00703496, -0.00754974])

In [6]:
compute_new_z("z2")

array([-0.03417125, -0.04464164, -0.03704336, -0.04162948, -0.04223601,
       -0.0479556 ,  0.03166366, -0.05179376, -0.0311442 , -0.03099291])

In [7]:
compute_new_z("z1") in data.drop('label',axis=1).values

False

In [8]:
compute_new_z("z2") in data.drop('label',axis=1).values

False