Clustering is one of the most common exploratory data analysis technique used to get an intuition
about the structure of the data. It can be defined as the task of identifying subgroups in the data 
such that data points in the same subgroup (cluster) are very similar while data points in different
clusters are very different

In other words, we try to find homogeneous subgroups within the data such that data points in each cluster 
are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance.
The decision of which similarity measure to use is application-specific.

Clustering can be done in Two types 
Based on feature
Based on samples

an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the 
true labels to evaluate its performance. 

Kmeans algorithm 

is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping
subgroups

Working

 It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s 
centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum.

The way kmeans algorithm works is as follows:

Specify number of clusters K.

Initialize centroids by first shuffling the dataset and then randomly selecting K data points 
for the centroids without replacement.

Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

Compute the sum of the squared distance between data points and all centroids.

Assign each data point to the closest cluster (centroid).

Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

In [7]:
 
import numpy as np
import pandas as pd
from copy import deepcopy

#It take order=None as default, so just to calculate the Frobenius norm of (a-b), this is ti calculate the distance between a and b
def euclidean(a,b, ax=1):
    return np.linalg.norm(a-b, axis=ax) #The norm is what is generally used to evaluate the error of a model. 


def main():
    k = 3
    X = pd.read_csv('kmeans.csv',index_col=False)
    print(X)

    x1 = X['X1'].values
    x2 = X['X2'].values
    X = np.array(list(zip(x1, x2)))
   
    C_x = [6.2, 6.6 ,6.5]
    C_y = [3.2, 3.7, 3.0]
    Centroid = np.array(list(zip(C_x, C_y)), dtype=np.float32)
    print("Initial Centroids shape")
    print(Centroid.shape)
    print()
    print(Centroid)

    Centroid_old = np.zeros(Centroid.shape)#Return a new array of given shape and type, filled with zeros
    print(Centroid_old)
    # Cluster Lables(0, 1, 2)
    clusters = np.zeros(len(X))
    print(clusters)
    error = euclidean(Centroid, Centroid_old, None)
    print(error)
    iterr = 0
    # Loop will run till the error becomes zero
    while error != 0:
        # Assigning each value to its closest cluster
        iterr = iterr + 1
        for i in range(len(X)):
            #print(X[i])
            distances = euclidean(X[i], Centroid)
            #print(distances)
            cluster = np.argmin(distances)# returns indices of the min element of the array in a particular axis.
 
            clusters[i] = cluster

        Centroid_old = deepcopy(Centroid)
        
        # Finding the new centroids by taking the Mean
        for p in range(k):
            points = [X[j] for j in range(len(X)) if clusters[j] == p]
            Centroid[p] = np.mean(points, axis=0)
        print(" Centre of the clusters after ", iterr," Iteration \n", Centroid)
        error = euclidean(Centroid, Centroid_old, None)
        print("Error  ... ",error)  
    

if __name__ == "__main__": 
    main()



    X1   X2
0  5.9  3.2
1  4.6  2.9
2  6.2  2.8
3  4.7  3.2
4  5.5  4.2
5  5.0  3.0
6  4.9  3.1
7  6.7  3.1
8  5.1  3.8
9  6.0  3.0
Initial Centroids shape
(3, 2)

[[6.2 3.2]
 [6.6 3.7]
 [6.5 3. ]]
[[0. 0.]
 [0. 0.]
 [0. 0.]]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
12.537144692236463
 Centre of the clusters after  1  Iteration 
 [[5.1714287 3.1714287]
 [5.5       4.2      ]
 [6.45      2.95     ]]
Error  ...  1.5886393
 Centre of the clusters after  2  Iteration 
 [[4.8   3.05 ]
 [5.3   4.   ]
 [6.2   3.025]]
Error  ...  0.5484787
 Centre of the clusters after  3  Iteration 
 [[4.8   3.05 ]
 [5.3   4.   ]
 [6.2   3.025]]
Error  ...  0.0
