## Tasks

### Task 1

In this task, you would be comparing the performance of two clustering algorithms: K-means and DBSCAN based on different parameters.

### Problem 1:

Run the K-means algorithm on the Image Segmentation data using 3, 5, 7, and 9 clusters with a 40% percentage split.

Is there any correlation between the number of clusters and the inertia (or the sum of the squared errors)? If there is, why?





In [1]:
#Load data segment.arff

from scipy.io.arff import loadarff
with open("segment.arff", "r") as f:
    data, meta = loadarff(f)
#print(data)



print("There are %d data points:" % (data.size))       # Printing the data size
X = data[meta.names()[:-1]]                            # matrix of datasize
Y = data[meta.names()[-1]]


import numpy as np                                  # data type of every element to float.
X = np.asarray(X.tolist(), dtype=np.float_)
print(X.mean(axis =0))




There are 2310 data points:
[ 1.24913853e+02  1.23417316e+02  9.00000000e+00  1.43338000e-02
  4.71380000e-03  1.89393964e+00  5.70932019e+00  2.42472330e+00
  8.24369225e+00  3.70515950e+01  3.28213087e+01  4.41878774e+01
  3.41456008e+01 -1.26908604e+01  2.14088501e+01 -8.71798933e+00
  4.51374684e+01  4.26892977e-01 -1.36289695e+00]


In [2]:
#Standardization makes data in same unit for clustering purpose.
#standardscaler and minmaxscaler gives almoost same kind of result.

#standardization
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler


#X_scaled = StandardScaler().fit_transform(X)
#print(X_scaled)

#mean value is nearly equal to 0 in all the data points after standardization
#print(X_scaled.mean(axis =0))



from sklearn.preprocessing import MinMaxScaler
X_scaled = MinMaxScaler().fit_transform(X)
print(X_scaled)


[[0.85770751 0.69583333 0.         ... 0.49852673 0.318996   0.16848872]
 [0.44268775 0.49583333 0.         ... 0.01693669 1.         0.1546051 ]
 [0.7944664  0.125      0.         ... 0.92636309 0.199347   0.12494586]
 ...
 [0.31225296 0.25416667 0.         ... 0.49337195 0.314606   0.16015015]
 [0.38339921 0.50833333 0.         ... 0.01840943 1.         0.1546051 ]
 [0.07114625 0.56666667 0.         ... 0.04639172 0.713228   0.26332542]]


In [3]:
#Run the K-means algorithm on the Image Segmentation data using 3, 5, 7, and 9 clusters with a 40% percentage split.

#import module for K-means algorithm
from sklearn.cluster import KMeans

#import module for visulization
import matplotlib.pyplot as plt 

#to display visulization on this page instead of background
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D

#to split whole dataset into traning and testing data

from sklearn.model_selection import train_test_split   

X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.60, random_state=42)




In [4]:
#clusters = 3
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
print(kmeans)

labels = kmeans.predict(X_test)


centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_

print(labels)
print(inertia)

print(centroids)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)
[1 2 1 ... 1 0 1]
291.86590389039833
[[5.05717863e-01 5.38508304e-01 0.00000000e+00 3.91459075e-02
  4.44839858e-02 1.10475986e-01 1.20562889e-02 9.50465572e-02
  1.17036881e-02 3.34661186e-01 3.10638319e-01 3.93715080e-01
  2.94734811e-01 5.61275276e-01 4.93949940e-01 2.71914480e-01
  3.93738663e-01 3.07213477e-01 1.66276492e-01]
 [4.71590909e-01 5.19108073e-01 0.00000000e+00 4.62239583e-02
  1.26953125e-02 4.62400942e-02 2.22627788e-03 3.74369415e-02
  2.50847525e-03 7.48337026e-02 6.58665839e-02 8.61790685e-02
  7.13322318e-02 7.48149028e-01 2.03839238e-01 5.49769255e-01
  9.80689594e-02 5.39638215e-01 3.90957981e-01]
 [5.19295175e-01 1.54293893e-01 0.00000000e+00 2.79898219e-02
  3.81679389e-03 3.06504777e-02 5.67140727e-04 2.71016167e-02
  6.33649132e-04 8.19713467e-01 7.76455522e-01 8.90155591e-01
  7.

In [5]:
#clusters = 5
kmeans = KMeans(n_clusters=5, random_state=0).fit(X_train)


labels = kmeans.predict(X_test)


centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_

print(labels)
print(inertia)

print(centroids)

[4 2 0 ... 3 1 4]
185.9153873767445
[[5.39601122e-01 7.95578880e-01 0.00000000e+00 6.36132316e-02
  3.81679389e-03 5.17227003e-02 1.33079928e-03 4.81248237e-02
  1.42782911e-03 1.10944563e-01 9.39971281e-02 9.34804656e-02
  1.45555349e-01 6.81511501e-01 7.42968329e-02 8.26483934e-01
  1.37556250e-01 4.13926840e-01 8.87380776e-01]
 [5.26952831e-01 5.38683528e-01 0.00000000e+00 4.28015564e-02
  4.86381323e-02 1.13099466e-01 1.29338046e-02 9.50093328e-02
  1.18855535e-02 3.48211414e-01 3.23909709e-01 4.08456852e-01
  3.07270745e-01 5.55026685e-01 5.02865409e-01 2.63890015e-01
  4.08482637e-01 3.00283673e-01 1.66404636e-01]
 [5.19295175e-01 1.54293893e-01 0.00000000e+00 2.79898219e-02
  3.81679389e-03 3.06504777e-02 5.67140727e-04 2.71016167e-02
  6.33649132e-04 8.19713467e-01 7.76455522e-01 8.90155591e-01
  7.85471669e-01 2.73684489e-01 6.63241867e-01 2.91365413e-01
  8.90155591e-01 2.12139465e-01 1.24881181e-01]
 [7.46585280e-01 4.33781646e-01 0.00000000e+00 1.89873418e-02
  1.58227848e-

In [6]:
#clusters = 7
kmeans = KMeans(n_clusters=7, random_state=0).fit(X_train)


labels = kmeans.predict(X_test)


centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_

print(labels)
print(inertia)

print(centroids)

[1 2 4 ... 3 0 1]
153.92418681294316
[[7.45919929e-01 5.68783602e-01 0.00000000e+00 3.49462366e-02
  3.62903226e-02 8.93383430e-02 8.22571170e-03 8.87197188e-02
  7.16959776e-03 3.22933531e-01 3.02458798e-01 3.77458264e-01
  2.84406897e-01 5.89522426e-01 4.69468425e-01 2.82671219e-01
  3.77493892e-01 2.96437702e-01 1.70786010e-01]
 [2.92202183e-01 4.06208609e-01 0.00000000e+00 6.62251656e-02
  9.93377483e-03 7.44467237e-02 2.45303058e-03 5.73814588e-02
  3.62956904e-03 1.50639186e-01 1.41509458e-01 1.86605624e-01
  1.21114804e-01 7.22842842e-01 3.39773747e-01 3.56258646e-01
  1.86751923e-01 3.98536258e-01 1.99964087e-01]
 [5.19295175e-01 1.54293893e-01 0.00000000e+00 2.79898219e-02
  3.81679389e-03 3.06504777e-02 5.67140727e-04 2.71016167e-02
  6.33649132e-04 8.19713467e-01 7.76455522e-01 8.90155591e-01
  7.85471669e-01 2.73684489e-01 6.63241867e-01 2.91365413e-01
  8.90155591e-01 2.12139465e-01 1.24881181e-01]
 [7.53293808e-01 4.36197917e-01 0.00000000e+00 1.62037037e-02
  1.38888889e

In [7]:
#clusters = 9
kmeans = KMeans(n_clusters=9, random_state=0).fit(X_train)


labels = kmeans.predict(X_test)


centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_

print(labels)
print(inertia)
print(centroids)

[6 1 3 ... 5 2 6]
134.3216394513502
[[2.51244607e-01 4.60591603e-01 0.00000000e+00 2.79898219e-02
  1.52671756e-02 3.84001724e-02 3.97963993e-03 3.07145446e-02
  3.60633189e-03 2.89655455e-02 2.04696622e-02 4.63411288e-02
  1.87001506e-02 7.66036356e-01 2.21894564e-01 5.02455009e-01
  4.73866681e-02 7.84916366e-01 1.83629613e-01]
 [5.19295175e-01 1.54293893e-01 0.00000000e+00 2.79898219e-02
  3.81679389e-03 3.06504777e-02 5.67140727e-04 2.71016167e-02
  6.33649132e-04 8.19713467e-01 7.76455522e-01 8.90155591e-01
  7.85471669e-01 2.73684489e-01 6.63241867e-01 2.91365413e-01
  8.90155591e-01 2.12139465e-01 1.24881181e-01]
 [7.55515016e-01 5.71901709e-01 0.00000000e+00 3.41880342e-02
  3.84615385e-02 8.83299829e-02 8.60465805e-03 9.03965844e-02
  7.32347788e-03 3.24988137e-01 3.04651917e-01 3.79466754e-01
  2.86373872e-01 5.89823593e-01 4.69733253e-01 2.81937732e-01
  3.79504513e-01 2.94381393e-01 1.71036551e-01]
 [7.93935461e-01 7.87189055e-01 0.00000000e+00 5.97014925e-02
  7.46268657e-

### Conclusion

Intertia value decreases when cluster size increases.because K-means searches for the minimum sum of squares assignment, i.e. it minimizes unnormalized variance (=total_SS) by assigning points to cluster centers.

In [8]:
#Reference: https://stats.stackexchange.com/questions/48520/interpreting-result-of-k-means-clustering-in-r?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa