# K-MEANS Clustering

In this notebook we will prepare the code that implements the K-means clustering algorithm. This is an unsupervised learning technique which takes a group of datapoints and partitions them into k-clusters. The mean of each cluster is called the "centroid" which is the closest one of the k-centroids for the datapoints of that particular cluster

The K-means algorithm amounts to the following steps:
- Initiate the number of k centroids
- calculate the distance of each datapoint to the centroids and select the closest one
- Recognise the clusters and move each centroid to the mean location of each cluster
- Repeat the process till the results converge, i.e. the centroids do not move. I check this convergence with a rounding function which can be made as sensitive as we desire

## Preparing all the necessary functions for our predictive algorithm:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sl
from sklearn.model_selection import train_test_split

The following functions do the following:
- Calculate the distance between two vectors
- Initialise the centroids
- Calculate the closest centroid for each datapoint
- Calculate the new mean of a cluster and move the corresponding centroid to the mean

In [12]:
def distance(a,b):
    #needs at least two columns due to the axis=1
    dis = np.linalg.norm(a-b, axis =1 )
    return dis

In [21]:
def ini_centroids(n_centroids, length):
    #initialize a matrix where each row is one centroid:
    mu = np.random.rand(n_centroids,length)
    
    return mu

In [43]:
def a_centroid(centroids, datapoint):
    # calculate the closest centroid for a specific datapoint
    a = distance(centroids, datapoint)
    cluster = np.argmin(a)
    
    return cluster

In [119]:
def move_centroids(data, cluster, centroid_number):
    #we calculate the average of all the points belonging to the cluster of one centroid
    #and then move the centroid to that spot
    points = X[np.where(cluster==centroid_number)[0], :]
    if len(points)!=0:
        new_c = np.sum(points, axis = 0)/(len(points))
    else:
        new_c = 0
    return new_c, len(points)

## The fitting algorithm

In [120]:
def fit(X, n_centroids,decimal):
    #initialize our centroids:
    length = X.shape[0]
    mu = ini_centroids(n_centroids, length)
    group = np.zeros(X.shape[0])
    print(mu)
    while True:
        #necessary for the if loop later on
        mu_backup = mu
        group_b = group
        #
        for k in range(0, X.shape[0]):
            group[k] = a_centroid(mu, X[k,:])        
        
        
        mu = np.array([ move_centroids(X,group,i)[0] if  move_centroids(X,group,i)[1]!=0 else mu_backup[i, :] for i in range(0,n_centroids) ])
        if    all(np.round(np.subtract(group, group_b), decimals = decimal)==np.zeros(group.shape)):
            break
        else:
            continue
            
    return group, mu

In [121]:
X = np.random.rand(4,4)

In [122]:
fit(X,3)

[[0.92688011 0.82579675 0.12234586 0.83468458]
 [0.51875652 0.00544407 0.71095672 0.14438704]
 [0.16527191 0.6458194  0.808197   0.49078459]]


(array([0., 0., 2., 0.]),
 array([[0.61181894, 0.57177062, 0.24837028, 0.79363518],
        [0.51875652, 0.00544407, 0.71095672, 0.14438704],
        [0.67857238, 0.56875037, 0.94678426, 0.55643573]]))