## Lecture 14: Clustering II (MIT Notes)

**Introduction to K-Mediod Algorithm**

1. Randomly select $z_1,2_2,...,z_K$
2. 
    - Given $z_1,z_2,...,z_k$, assign each datapoint to $x^i$ to the $z_j$ if
    $$Cost(z_1,z_2,...,z_k) = \sum^{n}_{i=1}min_{j = 1,2...,k}dist(x^i,z_j)$$
    
    - Given $C_1,C_2,...,C_K$ find the best representatives $z_1,z_2,...,z_K$ such that:
    $$z_j = \underset{z} {\arg \min}\sum_{i \in C_j}dist(x^i-z)$$ such that $$z_j \in \{x^1,x^2,...,x^n\}$$
    

This algorithm does two things:

1. It finds $z_j$ that are part of the data
2. It allows us to use any distance function


**Computational Complexity**

For the step 2.1, the cost complexity is $O(ndk)$ as there are n data points, k clusters and each datapoint has d dimensions.

For the step 2.2 the cost complexity is $O(n^2dk)$ because we need to check if $z_j \in \{x^1,x^2,...,x^n\}$, this adds an extra cost complexity of $O(n)$


### K Mediods Implimentation with Eucledian Distance

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
def find_distance(z,x):
    return np.linalg.norm(z-x)
def get_data():
    X = load_diabetes()['data'][0:50]
    col_names = load_diabetes()['feature_names']
    data = pd.DataFrame(X,columns=col_names)
    zj = data.sample(n=2).values
    return data,zj
def get_label(row,z1,z2):
    d1 = find_distance(z1,row.values)
    d2 = find_distance(z2,row.values)
    if d1>d2:
        return 'z1'
    else:
        return 'z2'
def colorize(row):
    if row['label']=="z1":
        color = "#5F4B8BFF"
    else:
        color = "#E69A8DFF"
    return [f"background-color: {color}"]*len(row.values)

def compute_dist_dict(subset):
    distance_dict = {}
    for idx,z in enumerate(subset.values):
        distance_dict[idx]=0
        for row in subset.values:
            distance_dict[idx]+=find_distance(z,row)
    return distance_dict

def compute_new_zj(data):
    sub1 = data[data['label']=="z1"].drop('label',axis=1)
    sub2 = data[data['label']=="z2"].drop('label',axis=1)
    distance_dict1 = compute_dist_dict(sub1)
    distance_dict2 = compute_dist_dict(sub2)
    idx_z1 = pd.Series(distance_dict1).values.argmin()
    idx_z2 = pd.Series(distance_dict2).values.argmin()
    return (sub1.values[idx_z1],sub2.values[idx_z2])

### Randomly select $z_j$

In [2]:
data,zj = get_data()
data.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [3]:
zj

array([[-0.05637009, -0.04464164, -0.01159501, -0.03321358, -0.0469754 ,
        -0.04765985,  0.00446045, -0.03949338, -0.0079794 , -0.08806194],
       [ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613]])

### Assign cluster labels to other points

In [4]:
data['label'] = data.apply(get_label,z1=zj[0],z2=zj[1],axis=1)
data.style.apply(colorize,axis=1)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,label
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,z1
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,z2
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,z1
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,z2
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,z2
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.04118,-0.096346,z2
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062913,-0.038357,z2
7,0.063504,0.05068,-0.001895,0.06663,0.09062,0.108914,0.022869,0.017703,-0.035817,0.003064,z1
8,0.041708,0.05068,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014956,0.011349,z1
9,-0.0709,-0.044642,0.039062,-0.033214,-0.012577,-0.034508,-0.024993,-0.002592,0.067736,-0.013504,z2


### Compute New Cluster centers 

In [5]:
zj = compute_new_zj(data)

In [6]:
zj[0] in data.drop('label',axis=1).values ## check that the new centers belong to the original data

True

In [7]:
zj[1] in data.drop('label',axis=1).values ## check that the new centers belong to the original data

True