## **Lab 10 - Tasks**


Activity- 1: Consider the dataset given below, implement K-Means Clustering algorithm using K=2. Find out
new centroid values based on the mean values of the coordinates of all the data instances from
the corresponding cluster. Also find the accuracy of the model and compute it manually.


In [14]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# convert the given data table into a pandas dataframe
data = {
    'A': [170, 168, 185, 188, 177, 180],
    'B': [56, 60, 72, 77, 76, 71]
}
df = pd.DataFrame(data)
print("Dataset:")
print(df)

# Implementing K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(df)
# Getting the cluster assignments
labels = kmeans.labels_
print("\nCluster assignments:")
print(labels)

# Getting the centroid values
centroids = kmeans.cluster_centers_
print("\nCentroid values:")
print(centroids)

# Assigning data points to clusters
df['Cluster'] = labels
cluster_0 = df[df['Cluster'] == 0]
cluster_1 = df[df['Cluster'] == 1]

# Computing new centroid values
new_centroid_0 = cluster_0[['A', 'B']].mean().values
new_centroid_1 = cluster_1[['A', 'B']].mean().values
print("\nNew Centroid values computed manually:")
print(f"Centroid 0: {new_centroid_0}")
print(f"Centroid 1: {new_centroid_1}")

# Calculating Within-Cluster Sum of Squares (WCSS)
wcss = kmeans.inertia_
print("\nWithin-Cluster Sum of Squares (WCSS):")
print(wcss)

Dataset:
     A   B
0  170  56
1  168  60
2  185  72
3  188  77
4  177  76
5  180  71

Cluster assignments:
[1 1 0 0 0 0]

Centroid values:
[[182.5  74. ]
 [169.   58. ]]

New Centroid values computed manually:
Centroid 0: [182.5  74. ]
Centroid 1: [169.  58.]

Within-Cluster Sum of Squares (WCSS):
109.0


Activity- 2: A dataset (income.csv) has been provided. Implement K-Means Clustering Algorithm on this
dataset using K (number of clusters = 3). Find out new centroid values based on the mean values
of the coordinates of all the data instances from the corresponding cluster. Also find the accuracy
of the model.


In [15]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# load dataset
df = pd.read_csv('./data/income.csv')
print("Dataset Head:")
display(df.head())

print("\nDataset Description:")
display(df.describe())

print("\nDataset Info:")
print(df.info())

# Select features
X = df[['Age', 'Income']]

# Set the number of clusters
k = 3

# Create and fit the KMeans model
kmeans = KMeans(n_clusters=k, random_state=0)
kmeans.fit(X)

# Get the cluster centroids
centroids = kmeans.cluster_centers_

# Assign clusters to each data point
df['Cluster'] = kmeans.labels_

print("\nCentroids:")
print(centroids)

print("\nData with Clusters:")
print(df)

# Compute inertia
inertia = kmeans.inertia_
print(f"\n\nInertia: {inertia}")

# Compute silhouette score
silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f"\nSilhouette Score: {silhouette_avg}")

Dataset Head:


Unnamed: 0,Name,Age,Income
0,Rob,27,70000
1,Michael,29,90000
2,Mohan,29,61000
3,Ismail,28,60000
4,Kory,42,150000



Dataset Description:


Unnamed: 0,Age,Income
count,22.0,22.0
mean,34.818182,90431.818182
std,5.90106,43505.964412
min,26.0,45000.0
25%,29.0,58500.0
50%,36.5,67500.0
75%,39.75,135250.0
max,43.0,162000.0



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    22 non-null     object
 1   Age     22 non-null     int64 
 2   Income  22 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 660.0+ bytes
None

Centroids:
[[3.29090909e+01 5.61363636e+04]
 [3.82857143e+01 1.50000000e+05]
 [3.40000000e+01 8.05000000e+04]]

Data with Clusters:
        Name  Age  Income  Cluster
0        Rob   27   70000        2
1    Michael   29   90000        2
2      Mohan   29   61000        0
3     Ismail   28   60000        0
4       Kory   42  150000        1
5     Gautam   39  155000        1
6      David   41  160000        1
7     Andrea   38  162000        1
8       Brad   36  156000        1
9   Angelina   35  130000        1
10    Donald   37  137000        1
11       Tom   26   45000        0
12    Arnold   27   48000        0
13     Jared   28   

Activity- 3:
A dataset (ratings_small.csv) has been provided. Implement K-Means Clustering Algorithm on
this dataset using K (number of clusters = 5). Find out new centroid values based on the mean
values of the coordinates of all the data instances from the corresponding cluster. Also find the
accuracy of the model


In [16]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# load dataset
df = pd.read_csv('./data/ratings_small.csv')
print("Dataset Head:")
display(df.head())

print("\nDataset Info:")
print(df.info())

# Select features
X = df[['movieId', 'rating']]

# Set the number of clusters
k = 5

# Create and fit the KMeans model
kmeans = KMeans(n_clusters=k, random_state=0)
kmeans.fit(X)

# Get the cluster centroids
centroids = kmeans.cluster_centers_

# Assign clusters to each data point
df['Cluster'] = kmeans.labels_

print("\nCentroids:")
print(centroids)

print("\nData with Clusters:")
print(df)

# Compute inertia
inertia = kmeans.inertia_
print(f"\n\nInertia: {inertia}")

# Compute silhouette score
silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f"\nSilhouette Score: {silhouette_avg}")

Dataset Head:


Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100004 non-null  int64  
 1   movieId  100004 non-null  int64  
 2   rating   100004 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB
None

Centroids:
[[2.46468362e+03 3.55349367e+00]
 [9.14045139e+04 3.45715419e+00]
 [3.80936262e+04 3.50282250e+00]
 [6.20758125e+04 3.52448042e+00]
 [1.22352480e+05 3.41396582e+00]]

Data with Clusters:
        userId  movieId  rating  Cluster
0            1       31     2.5        0
1            1     1029     3.0        0
2            1     1061     3.0        0
3            1     1129     2.0        0
4            1     1172     4.0        0
...        ...      ...     ...      ...
99999      671     6268     2.5        0
100000     671     6269     4.0        0
100001     671     6365     4.0        0
100002     671  