### Clustering

1)	Clustering identifies similarities between objects, which it groups according to those characteristics in common and which differentiate them from other groups of objects. These groups are known as "clusters".<br>
2)	Clustering is framed in unsupervised learning; that is, for this type of algorithm we only have one set of input data (not labeled), about which we must obtain information, without previously knowing what the output will be.<br>
3)<b>	There is no need to split the data in training and testing dataset.</b>


#### Clustering can be categorised into the following categories
1) Centroid based - KMeans<br>
2) Density Based - DBSCAN (Density-based spatial clustering of applications with noise)<br>
3) Hierarchical - Divisive and  Agglomerative Clustering

### KMeans

Where K = number of clusters

1)	K-means algorithm is an iterative algorithm that tries to partition the dataset into<b> K pre-defined distinct non-overlapping subgroups (clusters)</b> where each data point belongs to only one group<br>
2)	It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the<b> sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum.</b> The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.<br>
3)	Since clustering algorithms including KMeans which use distance-based measurements to determine the similarity between data points, it’s recommended to standardize or scale the data since almost always the features in any dataset would have different units of measurements for instance as age vs. income.<br>


### K-Means Algorithm

1) Specify number of clusters K.<br>
2) Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.<br>
4) Compute the sum of the squared distance between each of the data points and all the centroids.<br>
5) Assign each data point to the closest cluster (centroid) based on its nearest distance<br>
6) Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.<br>
7) Repeat steps 4,5 and 6 until there is no change in the centroids.

<img src="kmeans1.png">

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data  = {'Age': np.random.randint(18,80,20),
        'Expense': np.random.randint(200,1000,20)}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Age,Expense
0,57,722
1,22,871
2,52,522
3,67,297
4,79,504


In [3]:
df1 = df.copy()
df1.head()

Unnamed: 0,Age,Expense
0,57,722
1,22,871
2,52,522
3,67,297
4,79,504


### Assume K = 4

In [4]:
df.sample(4)

Unnamed: 0,Age,Expense
5,31,503
4,79,504
3,67,297
13,59,213


### Initial cluster centroids

In [5]:
# comments has row_no
k1 = [68,316]  # 19
k2 = [54,452]  # 4
k3 = [74,835]  # 2
k4 = [22,944]  # 3

#### Euclidean dist((x1,y1),(x2,y2)) = np.sqrt((x2-x1)^2 + (y2-y1)^2)

In [6]:
df['dist_k1'] = np.sqrt((k1[0] - df['Age'])**2 + (k1[1] - df['Expense'])**2) 
df['dist_k2'] = np.sqrt((k2[0] - df['Age'])**2 + (k2[1] - df['Expense'])**2)
df['dist_k3'] = np.sqrt((k3[0] - df['Age'])**2 + (k3[1] - df['Expense'])**2)
df['dist_k4'] = np.sqrt((k4[0] - df['Age'])**2 + (k4[1] - df['Expense'])**2)
df.head()

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4
0,57,722,406.148987,270.016666,114.271606,224.742074
1,22,871,556.903044,420.22018,63.245553,73.0
2,52,522,206.620425,70.028566,313.77221,423.065007
3,67,297,19.026298,155.544206,538.045537,648.563027
4,79,504,188.321534,57.697487,331.037762,443.676684


In [7]:
df.head(20)

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4
0,57,722,406.148987,270.016666,114.271606,224.742074
1,22,871,556.903044,420.22018,63.245553,73.0
2,52,522,206.620425,70.028566,313.77221,423.065007
3,67,297,19.026298,155.544206,538.045537,648.563027
4,79,504,188.321534,57.697487,331.037762,443.676684
5,31,503,190.625287,55.946403,334.773057,441.091827
6,72,619,303.026401,167.967259,216.009259,328.823661
7,24,303,45.880279,151.990131,534.344458,641.00312
8,60,202,114.280357,250.07199,633.154799,742.972409
9,39,462,148.852276,18.027756,374.638492,482.299699


In [8]:
df[['dist_k1','dist_k2','dist_k3','dist_k4']].min(axis=1)

0     114.271606
1      63.245553
2      70.028566
3      19.026298
4      57.697487
5      55.946403
6     167.967259
7      45.880279
8     114.280357
9      18.027756
10     36.891733
11     38.327536
12     55.785303
13    103.392456
14     26.925824
15     11.704700
16    165.075740
17     48.826222
18     70.830784
19    173.141561
dtype: float64

In [9]:
r1 = []
for i,j in df.iterrows():
    x = j[['dist_k1','dist_k2','dist_k3','dist_k4']].min()
    if x == j['dist_k1']:
        r1.append('C1')
    elif x == j['dist_k2']:
        r1.append('C2')
    elif x == j['dist_k3']:
        r1.append('C3')
    else:
        r1.append('C4')
print(r1)

['C3', 'C3', 'C2', 'C1', 'C2', 'C2', 'C2', 'C1', 'C1', 'C2', 'C1', 'C4', 'C1', 'C1', 'C4', 'C4', 'C3', 'C4', 'C3', 'C2']


In [10]:
df['Min_Dist'] = df[['dist_k1','dist_k2','dist_k3','dist_k4']].min(axis=1)
df['Cluster_Iter1']  = r1
df.head(20)

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4,Min_Dist,Cluster_Iter1
0,57,722,406.148987,270.016666,114.271606,224.742074,114.271606,C3
1,22,871,556.903044,420.22018,63.245553,73.0,63.245553,C3
2,52,522,206.620425,70.028566,313.77221,423.065007,70.028566,C2
3,67,297,19.026298,155.544206,538.045537,648.563027,19.026298,C1
4,79,504,188.321534,57.697487,331.037762,443.676684,57.697487,C2
5,31,503,190.625287,55.946403,334.773057,441.091827,55.946403,C2
6,72,619,303.026401,167.967259,216.009259,328.823661,167.967259,C2
7,24,303,45.880279,151.990131,534.344458,641.00312,45.880279,C1
8,60,202,114.280357,250.07199,633.154799,742.972409,114.280357,C1
9,39,462,148.852276,18.027756,374.638492,482.299699,18.027756,C2


In [11]:
df.sort_values(by='Cluster_Iter1')

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4,Min_Dist,Cluster_Iter1
13,59,213,103.392456,239.052296,622.180842,731.93579,103.392456,C1
12,54,262,55.785303,190.0,573.348934,682.75032,55.785303,C1
3,67,297,19.026298,155.544206,538.045537,648.563027,19.026298,C1
10,37,336,36.891733,117.239072,500.369863,608.185005,36.891733,C1
8,60,202,114.280357,250.07199,633.154799,742.972409,114.280357,C1
7,24,303,45.880279,151.990131,534.344458,641.00312,45.880279,C1
9,39,462,148.852276,18.027756,374.638492,482.299699,18.027756,C2
19,47,625,309.71277,173.141561,211.7286,319.978124,173.141561,C2
5,31,503,190.625287,55.946403,334.773057,441.091827,55.946403,C2
4,79,504,188.321534,57.697487,331.037762,443.676684,57.697487,C2


In [12]:
dfC1 = df[df['Cluster_Iter1']=='C1']
dfC2 = df[df['Cluster_Iter1']=='C2']
dfC3 = df[df['Cluster_Iter1']=='C3']
dfC4 = df[df['Cluster_Iter1']=='C4']
print(dfC1.shape)
print(dfC2.shape)
print(dfC3.shape)
print(dfC4.shape)

(6, 8)
(6, 8)
(4, 8)
(4, 8)


In [13]:
k1_iter2 = [dfC1['Age'].mean(),dfC1['Expense'].mean()]
k2_iter2 = [dfC2['Age'].mean(),dfC2['Expense'].mean()]
k3_iter2 = [dfC3['Age'].mean(),dfC3['Expense'].mean()]
k4_iter2 = [dfC4['Age'].mean(),dfC4['Expense'].mean()]

### Updated Cluster Centriods for Iteration2

In [14]:
print(k1_iter2)
print(k2_iter2)
print(k3_iter2)
print(k4_iter2)

[50.166666666666664, 268.8333333333333]
[53.333333333333336, 539.1666666666666]
[40.0, 759.25]
[36.0, 918.25]
