# This is a K-Means Classification Project

The aim of this project is to use K-Means Clustering algorithm to group Iris Flowers into 3 clusters using just two (sepal length and petal length). After clustering, map each cluster to a class label (setosa, versicolor, and virginica) by majority vote on a small labeled validation split, and report accuracy on a held-out test set.

Important: K-Means doesn't use labels to learn. We only use labels after clustering to evaluate and to map clusters -> class names.

Import the necessary libraries:

In [149]:
import numpy as np
import pandas as pd
import matplotlib as plt
import sklearn
from sklearn.model_selection import train_test_split

print('Imported necessary libraries')


Imported necessary libraries


Split labes into three randomly:
- training (60%)
- validation (20%)
- test (20%)



In [150]:
dataset = sklearn.datasets.load_iris()

X, y = dataset.data, dataset.target

#First split
X_train, X_temp, y_train, y_temp = train_test_split( #Syntax to split the dataset
    X, y, test_size = 0.4, random_state = 42, stratify = y #X (dataset), y (target), test_size (40%), random state (seed, so it keeps the same value when randomized), stratify (keeps the same proportions dependent on the label, which is y)
)

#Second split
X_validation, X_test, y_validation, y_test = train_test_split(
    X_temp, y_temp, test_size = 0.5, random_state = 42, stratify = y_temp
)

print('Training size: ', len(X_train))
print('Validation size: ',len(X_validation))
print('Test size: ', len(X_test))

print('Validation labels: ', y_validation)



Training size:  90
Validation size:  30
Test size:  30
Validation labels:  [2 0 2 1 1 1 2 0 2 0 0 2 1 2 1 2 1 2 2 0 0 0 2 0 1 0 1 1 0 1]


Choose K. For this example, 3 is the most sensible because there are three types of target.

In [151]:
K = 3

Run K-Means on the train set
- Fit K-Means with k = 3, random_state fixed for reproducibility

We should get:
- Cluster centers (shape 3 x 4)
- A cluster id (0,1,2) for every sample

In [152]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=K, random_state=42)

kmeans.fit(X_train)

centroids = kmeans.cluster_centers_

print("centroids: ", centroids)

centroids:  [[4.98666667 3.43666667 1.48666667 0.24      ]
 [6.96363636 3.16363636 5.84090909 2.09090909]
 [5.92894737 2.71315789 4.42894737 1.45      ]]


Build the cluster -> class label map (using validation set)
- Predict cluster ids for validation samples.
- For each cluster id, find which true class appears most among its val members -> that is the mapped class for that cluster.
- Example: cluster 0 contains mostly setosa, so map cluster 0 -> setosa.

In [153]:
def calculate_ordered_match_percentage(arr1, arr2):
    if len(arr1) != len(arr2):
        print("Arrays must have the same length for ordered comparison.")
        return 0.0

    matching_elements = 0

    for i in range(len(arr1)):
        if arr1[i] == arr2[i]:
            matching_elements += 1

    return (matching_elements / len(arr1)) * 100.0


labels = kmeans.predict(X_validation)
print('Validation accuracy: ', calculate_ordered_match_percentage(y_validation, labels))

print('Validation array: ', y_validation)
print('Predicted with validation array: ', labels)

Validation accuracy:  43.333333333333336
Validation array:  [2 0 2 1 1 1 2 0 2 0 0 2 1 2 1 2 1 2 2 0 0 0 2 0 1 0 1 1 0 1]
Predicted with validation array:  [2 0 1 2 2 2 1 0 1 0 0 1 2 1 2 1 2 1 2 0 0 0 1 0 2 0 2 1 0 2]


Predict using the test set

In [154]:
final_prediction = kmeans.predict(X_test)

print('Test prediction accuracy: ', calculate_ordered_match_percentage(y_test, final_prediction))

Test prediction accuracy:  46.666666666666664
