STUDENT: Joel S. Mollel

NUMBER: C00313599

ALGORITHM: K-Means

We are provided with code that performs clustering on the digits dataset using the K-Means algorithm to group the data into 10 clusters, corresponding to the 10 digit classes.

It visualizes the cluster centers as images representing the average appearance of each cluster. 

Finally, it evaluates the clustering accuracy by comparing the predicted labels with the actual digit labels from the dataset.

Provided with K-Means Code, we are required to

i) make sure it runs

ii)Change some hyperparameters and see the impact

iii)Use another dataset and perform other operations, and simulate as an app

(i) The code run well after installing all the required libraries

Accuracy obtained: 0.7440

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits

digits = load_digits()
digits.data.shape

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)


ii)Change some hyperparameters and see the impact

In this case, I will change number of clusters from 10 to 8, accuracy dropped from 0.7440 to 0.6466

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits

digits = load_digits()

kmeans = KMeans(n_clusters=8, random_state=0)  # Adjusted number of clusters to 8
clusters = kmeans.fit_predict(digits.data)

fig, ax = plt.subplots(2, 4, figsize=(8, 3))  # Adjusted for 8 clusters
centers = kmeans.cluster_centers_.reshape(8, 8, 8)  # Adjusted for 8 clusters
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(8):  # Adjusted for 8 clusters
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]


from sklearn.metrics import accuracy_score # Calculate the accuracy score
accuracy_score(digits.target, labels)


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits

# Load the digits dataset
digits = load_digits()

# Define the KMeans model with an adjusted tolerance value (e.g., tol=1e-4)
kmeans = KMeans(n_clusters=10, tol=1e-4, random_state=0)  # Adjusted tolerance to 1e-4
clusters = kmeans.fit_predict(digits.data)

# Reshape and display the cluster centers
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

# Assign labels to clusters based on the majority class in each cluster
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):  # Adjusted for 10 clusters
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# Calculate the accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)


iii)Use another dataset and perform other operations, and simulate as an app

I will use the Iris dataset
The Iris dataset consists of 150 samples from three iris flower species: Setosa, Versicolor, and Virginica. Each sample has four features—sepal length, sepal width, petal length, and petal width— measured in centimeters. The label values will therefore be Setosa, Versicolor, and Virginica 

(a) Let us start with training the model and testing its accuracy

step1: Loading libraries and dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from scipy.stats import mode

iris = load_iris()
iris_data = iris.data
iris_target = iris.target

Step2: printing 5 rows of data to see the structure

The label values and their meaning

0 = Setosa

1 = Versicolor

2 = Virginica

In [None]:
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['label'] = iris.target
print(iris_df.head())

step3: Perform clusterng

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0)
clusters = kmeans.fit_predict(iris_data)

step4: Visualize how datapoints are distributed for each flower cluster

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(iris_data)

cluster_names = ['Setosa', 'Versicolor', 'Virginica']

plt.figure(figsize=(8, 6))

for i, name in enumerate(cluster_names):
    mask = clusters == i 
    plt.scatter(reduced_data[mask, 0], reduced_data[mask, 1], label=name)

plt.title('KMeans Clustering on Iris Dataset (PCA Reduced)') #title and axis labels

plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')

# Add a legend to map clusters to species names
plt.legend(title="Iris Species")

# Display the plot
plt.show()


(b) Let us prompt user input so that a user can use our system to put the flower into its appropriate cluster

Steps are the same as above but there is additinal code for user input prompt and the processing of the entered values

In [None]:

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from scipy.stats import mode

# Loading the Iris dataset, just as above
iris = load_iris()
iris_data = iris.data
iris_target = iris.target

# Train KMeans clustering, just as above
kmeans = KMeans(n_clusters=3, random_state=0)
clusters = kmeans.fit_predict(iris_data)

# Reassign clusters to match the target labels as step4 above
labels = np.zeros_like(clusters)
for i in range(3):
    mask = (clusters == i)
    labels[mask] = mode(iris_target[mask], axis=None)[0]  # Safely access the mode

# Prompting the user to enter the flower features 
print("Enter the values for the following features of the flower you want to identify (in cm):")
sepal_length = float(input("Sepal length (cm): "))
sepal_width = float(input("Sepal width (cm): "))
petal_length = float(input("Petal length (cm): "))
petal_width = float(input("Petal width (cm): "))

# Create a new data point based on user input
new_flower = np.array([[sepal_length, sepal_width, petal_length, petal_width]])

# actual prediction occurs here
predicted_cluster = kmeans.predict(new_flower)[0]


predicted_flower_name = iris.target_names[mode(iris_target[clusters == predicted_cluster], axis=None)[0]]

# Display the result
print(f"The predicted flower type is: {predicted_flower_name.capitalize()}")
