# The k-means method

In [1]:
import numpy as np

In [2]:
data = np.random.random(90).reshape(30,3)
data

array([[0.89090551, 0.56126478, 0.87989139],
       [0.64128864, 0.11903265, 0.50539863],
       [0.75172952, 0.22686654, 0.79464664],
       [0.88163954, 0.03313545, 0.95246562],
       [0.81560491, 0.96473145, 0.84889113],
       [0.29591591, 0.96225747, 0.93968252],
       [0.99010556, 0.63916651, 0.77655312],
       [0.60781298, 0.04837202, 0.15407631],
       [0.33537468, 0.92786875, 0.68071715],
       [0.70389511, 0.66399689, 0.89461629],
       [0.26914764, 0.02215808, 0.97016708],
       [0.31313033, 0.40164826, 0.33578608],
       [0.60826113, 0.45227979, 0.22570639],
       [0.41639882, 0.25050891, 0.51967904],
       [0.89509144, 0.74224393, 0.53046497],
       [0.35365218, 0.09588486, 0.81038405],
       [0.5492481 , 0.70022281, 0.41710388],
       [0.15448678, 0.23672393, 0.52279437],
       [0.01225707, 0.47199699, 0.8525722 ],
       [0.5794768 , 0.76735035, 0.29068061],
       [0.00127137, 0.32236848, 0.13008494],
       [0.52353428, 0.45048684, 0.70119818],
       [0.

Let's imagine that the expert decides that he wants two clusters.

In [3]:
c1 = np.random.choice(range(len(data)))
c2 = np.random.choice(range(len(data)))
clust_centers = np.vstack([data[c1], data[c2]])
clust_centers

array([[0.90614291, 0.02255103, 0.78307771],
       [0.89090551, 0.56126478, 0.87989139]])

In [4]:
from scipy.cluster.vq import vq

In [5]:
vq(data, clust_centers)

(array([1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
        0, 1, 1, 0, 1, 1, 1, 0]),
 array([0.        , 0.39567937, 0.25636333, 0.17147801, 0.41160242,
        0.71998801, 0.16305866, 0.69664211, 0.69475416, 0.21387753,
        0.66390176, 0.80953825, 0.72091858, 0.60099328, 0.393535  ,
        0.55800492, 0.59178662, 0.82377448, 0.88359389, 0.69758759,
        1.15545773, 0.42327844, 0.19461566, 0.26153993, 0.89819723,
        0.75277009, 0.77827   , 0.79352704, 0.24911536, 0.        ]))

The first array gives us the information of which cluster each of the observations belongs to and the second array gives us the distance of each observation to the barycenter of its belonging cluster.

In [6]:
from scipy.cluster.vq import kmeans

In [7]:
kmeans(data, clust_centers)

(array([[0.45406543, 0.2198752 , 0.64250279],
        [0.72257738, 0.69766675, 0.55144163]]),
 0.40393385951066174)

This gives us information about the two barycenters; The first cluster then has its centroid at [0.45406543, 0.2198752, 0.64250279] and the second at [0.72257738, 0.69766675, 0.55144163]. The number at the end is the value of the sum of the squares of the errors (from each of the points, subtract the distance of said points from the barycenter, square them, add them all and divide the result by the sum of the squares of each point to the global barycenter of the system).

K-means also works if we give it the number of clusters we want instead of the centroids:

In [8]:
kmeans(data, 2)

(array([[0.45406543, 0.2198752 , 0.64250279],
        [0.72257738, 0.69766675, 0.55144163]]),
 0.40393385951066174)

We can see that the result is the same. Therefore, we can use k-means knowing the barycenters or knowing only the number k and the algorithm will randomly assign the starting points to be the clusters and thus carry out the method.