## k-Means Clustering

k-Means clustering is an unsupervised learning algorithm. So initially we have no idea of the labels and how many cluster we'd have exactly. Let's say we have a N size of list of house prices. When you look at such a list of values without labels, at first you can think of calculating some statistical variables like mean, median, standard deviation etc. All these metrics make sense in terms of location and variance of the dataset. However we would need to look in a different way to cluster or group these house prices; in a simple way for example cheap houses and expensive houses groups.

In [2]:
# let's say each value in the list represents scales of $10K
data_list = [10, 12, 22, 23, 25, 30, 40, 50, 80, 90, 120, 140]


k-Means clustering algorithm starts with deciding how many clusters we want. Let's say here we start with K = 2 clusters.

1. Then we initialize K centroids randomly at the beginning. 
2. At this point we calculate the distance of every data point to each of centroids. 
3. After this calculation we can assign each data point to the closest centroid. And this way we have our initial clusters at the first calculation. 
4. Then we assign the centroids to the mean of its cluster's data.
5. We repeat the steps 2-4 until `convergence`. When we reach the convergence the centroids stop changing which indicates the end of the training of our algorithm.
The convergence in k-Means is called `quantization`.

However the initial random centroid selection could end up with different clustering outcomes. So k-means converges to local minima.

Please note that, I assumed the data points are 1-dimensional so the distance is simply calculated as the difference of the points. We can later update this algorithm for multi-dimensional datasets.

In [3]:
class ClusterDataPoint():
    def __init__(self, _point, _centroid):
        self._point = _point
        self._centroid = _centroid

    def update_centroid(self, _centroid):
        self._centroid = _centroid


In [29]:
import numpy as np
import random


class KMeans():
    def __init__(self, _X, _K):
        self._X = _X
        self._ClusterPoints = []
        self._min = np.min(self._X)
        self._max = np.max(self._X)
        self._K = _K
        self._centroids = []
        self._clusters = []
        self._converged = False

        self.initiate_centroids()


    def calculate_distance(self, p1, p2):
        return abs(p1 - p2)


    def initiate_centroids(self):
        for i in range(0, self._K):
            rnumber = random.randint(self._min, self._max)
            self._centroids.append(rnumber)

        print("Generated initial random centroids: ", self._centroids)

        # assign points to the initial centroids
        self.assign_centroids()


    def assign_centroids(self):
        self._ClusterPoints.clear()
        for p in self._X:
            cdp = ClusterDataPoint(p, None)
            minDist = -1
            for c in self._centroids:
                dist = self.calculate_distance(c, p)
                if minDist == -1 or dist < minDist:
                    minDist = dist
                    cdp.update_centroid(c)
            self._ClusterPoints.append(cdp)


    def recalculate_centroids(self):
        new_centroids = []
        for c in self._centroids:
            # filter cluster points for current centroid
            cluster_points = [cp._point for cp in self._ClusterPoints if cp._centroid == c]
            cluster_mean = np.mean(cluster_points)
            new_centroids.append(cluster_mean)
        self._centroids = new_centroids
        

        self.assign_centroids()


    def fit(self):
        # convergence
        stop_threshold = 1
        while not self._converged:
            prev_centroids = self._centroids
            self.recalculate_centroids()
            converged = True
            print("Convergence check: self._centroids: ", self._centroids, ", prev_centroids: ", prev_centroids)
            for i in range(0, len(self._centroids)):
                if self._centroids[i] < prev_centroids[i] - stop_threshold:
                    converged = False
                    break
            self._converged = converged
        
        print("KMeans centroids converged at: ", self._centroids)


    def predict(self, _X_test):
        pass


In [31]:
model = KMeans(data_list, 2)


Generated initial random centroids:  [131, 139]


In [32]:
model.fit()


Convergence check: self._centroids:  [45.63636363636363, 140.0] , prev_centroids:  [131, 139]
Convergence check: self._centroids:  [38.2, 130.0] , prev_centroids:  [45.63636363636363, 140.0]
Convergence check: self._centroids:  [32.44444444444444, 116.66666666666667] , prev_centroids:  [38.2, 130.0]
Convergence check: self._centroids:  [26.5, 107.5] , prev_centroids:  [32.44444444444444, 116.66666666666667]
Convergence check: self._centroids:  [26.5, 107.5] , prev_centroids:  [26.5, 107.5]
KMeans centroids converged at:  [26.5, 107.5]


In [33]:
for cluster_point in model._ClusterPoints:
    print(cluster_point._point, ", centroid: ", cluster_point._centroid)


10 , centroid:  26.5
12 , centroid:  26.5
22 , centroid:  26.5
23 , centroid:  26.5
25 , centroid:  26.5
30 , centroid:  26.5
40 , centroid:  26.5
50 , centroid:  26.5
80 , centroid:  107.5
90 , centroid:  107.5
120 , centroid:  107.5
140 , centroid:  107.5
