# K-Means Clustering

For the following dataset, perform the clustering:

17 28 50 60 80 89 150 167 171 189 
1. 	Use the K-means algorithm with K=3 to cluster the data
2.  What will the final clusters be after 3 iterations if k=3 and the initial centers are 150, 171 and 189?


In [1]:
# import packages
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# 1. Use the K-means algorithm with K= 3 to cluster the data

In [2]:
List=[17, 28, 50, 60, 80, 89, 150, 167, 171, 189]
data = np.array(List)
k=3

In [3]:
# initiate cluster centers randomly
import random
u=random.sample(List, k)
u.sort()
print("Initial cluster centers: ")
print(u)

# or manually, for example:
# u_initial=[17,50,60]
# print("Initial cluster centers: ")
# print(u_initial)
# u=u_initial

Initial cluster centers: 
[28, 80, 189]


In [4]:
# compute distance of each number to initial centers
distance={}
for i in u:
   distance[i] = abs(data-i)
distance

{28: array([ 11,   0,  22,  32,  52,  61, 122, 139, 143, 161]),
 80: array([ 63,  52,  30,  20,   0,   9,  70,  87,  91, 109]),
 189: array([172, 161, 139, 129, 109, 100,  39,  22,  18,   0])}

In [5]:
# convert output from dictionary to dataframe for virtualization purpose
df=pd.DataFrame.from_dict(distance, orient='index',
                       columns=data)
df

Unnamed: 0,17,28,50,60,80,89,150,167,171,189
28,11,0,22,32,52,61,122,139,143,161
80,63,52,30,20,0,9,70,87,91,109
189,172,161,139,129,109,100,39,22,18,0


In [6]:
# for each data point, assign it to the cluster such that the distance between it and the cluster center is minimal

# initializing dict with lists 
cluster={} 
[cluster.setdefault(x, []) for x in range(k)]  

# creating a list of dataframe columns 
columns = list(df) 
for i in columns: 
  if df[i].idxmin() == u[0]:
     cluster[0].append(i)
  elif df[i].idxmin() == u[1]:
    cluster[1].append(i)
  else:
    cluster[2].append(i)

print("After the first assignment, cluster: " + str(dict(cluster))) 

After the first assignment, cluster: {0: [17, 28, 50], 1: [60, 80, 89], 2: [150, 167, 171, 189]}


In [7]:
# for each cluster, update the cluster center
u[0]=np.mean(cluster.get(0))
u[1]=np.mean(cluster.get(1))
u[2]=np.mean(cluster.get(2))
print("Updated cluster centers:")
print(u)

Updated cluster centers:
[31.666666666666668, 76.33333333333333, 169.25]


In [8]:
# start the second loop of above steps

distance={}
for i in u:
   distance[i] = abs(data-i)
df=pd.DataFrame.from_dict(distance, orient='index', columns=data)
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)
# initializing dict with lists 
cluster={} 
[cluster.setdefault(x, []) for x in range(3)] 
# creating a list of dataframe columns 
columns = list(df) 
for i in columns: 
  if df[i].idxmin() == u[0]:
     cluster[0].append(i)
  elif df[i].idxmin() == u[1]:
    cluster[1].append(i)
  else:
    cluster[2].append(i)
print("After the second assignment, cluster: " + str(dict(cluster)))
u[0]=np.mean(cluster.get(0))
u[1]=np.mean(cluster.get(1))
u[2]=np.mean(cluster.get(2))
print("Updated cluster centers:")
print(u)

                   17          28          50          60         80   \
31.666667    14.666667    3.666667   18.333333   28.333333  48.333333   
76.333333    59.333333   48.333333   26.333333   16.333333   3.666667   
169.250000  152.250000  141.250000  119.250000  109.250000  89.250000   

                  89          150         167         171         189  
31.666667   57.333333  118.333333  135.333333  139.333333  157.333333  
76.333333   12.666667   73.666667   90.666667   94.666667  112.666667  
169.250000  80.250000   19.250000    2.250000    1.750000   19.750000  
After the second assignment, cluster: {0: [17, 28, 50], 1: [60, 80, 89], 2: [150, 167, 171, 189]}
Updated cluster centers:
[31.666666666666668, 76.33333333333333, 169.25]


In [9]:
# start the third loop of above step
distance={}
for i in u:
   distance[i] = abs(data-i)
df=pd.DataFrame.from_dict(distance, orient='index', columns=data)
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)
# initializing dict with lists 
cluster={} 
[cluster.setdefault(x, []) for x in range(3)] 

# creating a list of dataframe columns 
columns = list(df) 
for i in columns: 
  if df[i].idxmin() == u[0]:
     cluster[0].append(i)
  elif df[i].idxmin() == u[1]:
    cluster[1].append(i)
  else:
    cluster[2].append(i)

print("After the third assignment, cluster: " + str(dict(cluster)))
u[0]=np.mean(cluster.get(0))
u[1]=np.mean(cluster.get(1))
u[2]=np.mean(cluster.get(2))
print("Updated cluster centers:")
print(u)

                   17          28          50          60         80   \
31.666667    14.666667    3.666667   18.333333   28.333333  48.333333   
76.333333    59.333333   48.333333   26.333333   16.333333   3.666667   
169.250000  152.250000  141.250000  119.250000  109.250000  89.250000   

                  89          150         167         171         189  
31.666667   57.333333  118.333333  135.333333  139.333333  157.333333  
76.333333   12.666667   73.666667   90.666667   94.666667  112.666667  
169.250000  80.250000   19.250000    2.250000    1.750000   19.750000  
After the third assignment, cluster: {0: [17, 28, 50], 1: [60, 80, 89], 2: [150, 167, 171, 189]}
Updated cluster centers:
[31.666666666666668, 76.33333333333333, 169.25]


In [10]:
print("It is not changing(convergence) after 3 iterations. The K-means clustering result is ")
print(str(dict(cluster)))

It is not changing(convergence) after 3 iterations. The K-means clustering result is 
{0: [17, 28, 50], 1: [60, 80, 89], 2: [150, 167, 171, 189]}


Different starting numbers(cluster centers) can lead to different clustering results in our example, which resonates that K-means is very sensitive to initial conditions. The solution is to run multiple trials and choose one with the best SSE. I will further explore this in a seperate notebook. 

Using the default KMeans algorithm, the result is:

In [12]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3,random_state=0).fit(data.reshape(-1,1))
kmeans.predict(data.reshape(-1,1))

array([2, 2, 2, 0, 0, 0, 1, 1, 1, 1], dtype=int32)

#2. For K-means, what will the final clusters be after 3 iterations if k=3 and the initial centers are 150, 171 and 189?

In [14]:
data = np.array([17, 28, 50, 60, 80, 89, 150, 167, 171, 189])
k=3
u=[150,171,189]

for j in range(k): 
  distance={}
  for i in u:
    distance[i] = abs(data-i)
  df=pd.DataFrame.from_dict(distance, orient='index', columns=data)
  pd.set_option("display.max_rows", None, "display.max_columns", None)
  print(df) 
  # initializing dict with lists 
  cluster={} 
  [cluster.setdefault(x, []) for x in range(3)] 
  # creating a list of dataframe columns 
  columns = list(df) 
  for i in columns: 
    if df[i].idxmin() == u[0]:
      cluster[0].append(i)
    elif df[i].idxmin() == u[1]:
      cluster[1].append(i)
    else:
      cluster[2].append(i)
  print("After the assignment, cluster: " + str(dict(cluster)))
  u[0]=np.mean(cluster.get(0))
  u[1]=np.mean(cluster.get(1))
  u[2]=np.mean(cluster.get(2))
  print("Updated cluster centers:")
  print(u)

     17   28   50   60   80   89   150  167  171  189
150  133  122  100   90   70   61    0   17   21   39
171  154  143  121  111   91   82   21    4    0   18
189  172  161  139  129  109  100   39   22   18    0
After the assignment, cluster: {0: [17, 28, 50, 60, 80, 89, 150], 1: [167, 171], 2: [189]}
Updated cluster centers:
[67.71428571428571, 169.0, 189.0]
                   17          28          50          60          80   \
67.714286    50.714286   39.714286   17.714286    7.714286   12.285714   
169.000000  152.000000  141.000000  119.000000  109.000000   89.000000   
189.000000  172.000000  161.000000  139.000000  129.000000  109.000000   

                   89         150        167         171         189  
67.714286    21.285714  82.285714  99.285714  103.285714  121.285714  
169.000000   80.000000  19.000000   2.000000    2.000000   20.000000  
189.000000  100.000000  39.000000  22.000000   18.000000    0.000000  
After the assignment, cluster: {0: [17, 28, 50, 60, 8