# Subproject 3 – Clustering for Iris datasets
Fasegun Babatunde Oyeniyi (a91647)

Machine Learning – M.Sc. in Electrical and Computer Engineering, ISE,
University of Algrave, Faro, Portugal

## Introduction

In this subproject, clustering algorithms are used on the Iris dataset to group samples based on feature values rather than class labels during training. The core approach is K-Means clustering, utilizing feature scaling due to its distance-based nature. The Elbow Method is used to establish the appropriate number of clusters, while Principal Component Analysis (PCA) is employed for visualization. Finally, the clustering results are evaluated by calculating the Silhouette Score from the clusters.

Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

Loading the Iris Dataset

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

## Exploratory Data Analysis (EDA)

Before preprocessing, I examined the Iris dataset's basic structure and sample rows. I then used summary statistics and basic distribution plots to determine how each feature was distributed.

i extracted the features out from the datsaet

In [None]:
X = iris.data

i used the shape property to get the shape of the data

In [None]:
#  shape of the data

X.shape

I created a pandas Dataframe table that showed the feature names and the the first 5 rows using .head()

In [None]:
feature_names = list(iris.feature_names)

df_features = pd.DataFrame(X, columns=iris.feature_names)
df_features.head()

Iris data features visualization 

In [None]:
plt.figure(figsize=(10, 4))
plt.boxplot([df_features[col].values for col in feature_names], tick_labels=feature_names)
plt.title("Feature Distributions")
plt.xlabel("Features")
plt.ylabel("Value")
plt.show()

Petal features vary far more than sepal features, and sepal width has a few outliers. Since K-Means is distance-based, these bigger ranges would end up driving the clustering, therefore I standardize the features first to keep them on the same scale.

# Data preprocessing

Before running K-Means, I standardized the Iris features because the variables have different ranges and K-Means relies on distance. Without scaling, the larger-variation features (especially the petal measurements) would dominate the clustering. There, i used StandardScaler to transform each features.

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

features_scaled = scaler.fit_transform(X)

features_scaled = pd.DataFrame(features_scaled, columns=iris.feature_names)
features_scaled


# Model training

## K-Means: Elbow Method

Before doing clustering for the Iris data using means, when defining the n_clusters parameter for the KMeans() method intead of using an arbitary value, i used the elbow method to get the right value of clusters in the Iris dataset.

In [None]:
from sklearn.cluster import KMeans

inertias = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(
        n_clusters=k,
        random_state=42,
        n_init=10
    )
    kmeans.fit(features_scaled)
    inertias.append(kmeans.inertia_)

k_values = np.array(list(k_range))
inertias_np = np.array(inertias, dtype=float)

points = np.column_stack((k_values, inertias_np))

p1, p2 = points[0], points[-1]
line_vec = p2 - p1
line_vec = line_vec / np.linalg.norm(line_vec)

vec_from_p1 = points - p1
proj = (vec_from_p1 @ line_vec)[:, None] * line_vec
perp = vec_from_p1 - proj
distances = np.linalg.norm(perp, axis=1)

k_value = int(k_values[np.argmax(distances)])


## Calculating the Clusters using k from above

I used the number of cluster derived from using the elbow method which wich will be the n_clusters.  The n_init runs the K-Means 10 times to find the best results.

In [None]:
kmeans = KMeans(
    n_clusters=k_value,
    random_state=42,
    n_init=10
)

derived_clusters = kmeans.fit_predict(features_scaled)
derived_clusters 
derived = pd.DataFrame()
derived["derived"] = derived_clusters
derived


# PCA (Principal Component Analysis)

PCA stands for Principal Conponent Analysis, a technique that is used in Machine Learning for reducing the dimentionality of features (reduction from large features to smaller features) while keelping as much deatils as possible. the first principal components has the most variantion.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)

features_pca = pca.fit_transform(features_scaled)

features_pca = pd.DataFrame(
    features_pca,
    columns=pca.get_feature_names_out()
)

features_pca.head()

## Visualization of PCAs for the clusters


In [None]:
plt.figure(figsize=(10, 6))

for cluster in np.unique(derived_clusters):
    plt.scatter(
        features_pca.loc[derived_clusters == cluster, "pca0"],
        features_pca.loc[derived_clusters == cluster, "pca1"],
        label=f"Cluster {cluster}"
    )

plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA K-Means Clusters")
plt.legend()
plt.show()


## Visualization: True Species

In [None]:
y = iris.target
target_names = iris.target_names

plt.figure(figsize=(10, 6))

for label in np.unique(y):
    plt.scatter(
        features_pca.loc[y == label, "pca0"],
        features_pca.loc[y == label, "pca1"],
        label=target_names[label]
    )

plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Iris Data (True Species)")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.3)
plt.tight_layout()
plt.show()


## Clusters Evaluation using Silhouette Score

In [None]:
from sklearn.metrics import silhouette_samples
import pandas as pd

sample_scores = silhouette_samples(features_scaled, derived_clusters)

cluster_scores = (
    pd.DataFrame({"clusters": derived_clusters, "sil": sample_scores})
      .groupby("clusters")["sil"]
      .mean()
)

print(cluster_scores)


Comparison with Ground Truth

In [None]:
import pandas as pd

comparison = pd.crosstab(
    derived_clusters,
    y,
    rownames=["Cluster"],
    colnames=["True Species"]
)

display(comparison)
