# [K-Means Clustring](https://en.wikipedia.org/wiki/K-means_clustering)

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Our algorithm works as follows, assuming we have inputs \\(x_1, x_2, x_3, ..., x_n \\) and value of K

* Step 1 - Pick K random points as cluster centers called centroids.
* Step 2 - Assign each \\(x_i\\) to nearest cluster by calculating its distance to each centroid.
* Step 3 - Find new cluster center by taking the average of the assigned points.
* Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change.

Let's use the [code](https://github.com/mubaris/friendly-fortnight) to learn the algorithm. more information on [this blog](https://mubaris.com/posts/kmeans-clustering/).

In [None]:
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.style.use('ggplot')

In [None]:
data = pd.read_csv('data/xclara.csv')

In [None]:
data.shape

In [None]:
data.head()

In [None]:
f1 = data['V1'].values
f2 = data['V2'].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=7)

In [None]:
X.shape

In [None]:
def dist(a, b, ax=1):
    return np.linalg.norm(a-b, axis=ax)

In [None]:
k = 3
C_x = np.random.randint(0, np.max(X)-20, size=k)
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
C

In [None]:
plt.scatter(f1, f2, c="#050505", s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')

In [None]:
C_old = np.zeros(C.shape)
clusters = np.zeros(len(X))
error = dist(C, C_old, None)

In [None]:
while error != 0:
    for i in range(len(X)):
        distances = dist(X[i], C)
        cluster = np.argmin(distances) # index in C
        clusters[i] = cluster
    C_old = deepcopy(C)
    for i in range(k):
        # points in one cluster
        points = [X[j] for j in range(len(X)) if clusters[j]==i]
        C[i] = np.mean(points, axis=0)
    error = dist(C, C_old, None)

In [None]:
colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
    points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
    ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200, c='#050505')