# K-Means from Scratch: Module Demo & Testing


# 1. Introduction
K-Means is a widely used unsupervised learning algorithm for clustering. It partitions a dataset into k distinct clusters by minimizing the variance within each cluster. The algorithm works by iteratively assigning points to the nearest cluster centroid and then updating the centroids based on the mean of the assigned points.

K-Means is simple, efficient, and scalable, making it suitable for large datasets with well-separated spherical clusters. However, its performance can be sensitive to initialization and it assumes clusters of similar size and density.

In this notebook, we demonstrate and evaluate a pre-implemented custom K-Means clustering algorithm using the Breast Cancer Wisconsin dataset. The goal is to test how effectively the algorithm can uncover structure in the data without relying on class labels. We assess the clustering performance using internal evaluation metrics such as Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index.

## 2. Import Libraries

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

In [None]:
# Connect to Google Drive and access my custom KNN model
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys
sys.path.append('/content/drive/MyDrive/scratch/')
from models.k_means import KMeans

## 3. Load Dataset

In [None]:
# Load diabload_breast_cancer dataset
data = datasets.load_breast_cancer()

X = data.data
y_true = data.target

### Scale the features

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 4. Train the model

In [None]:
model = KMeans(k=2)
model.fit(X_scaled)

## 5. predict

In [None]:
y_pred = model.predict(X_scaled)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,

## 6. Evaluate

In [None]:
# Evaluate
print("Silhouette Score:", silhouette_score(X_scaled, y_pred))
print("Calinski-Harabasz Score:", calinski_harabasz_score(X_scaled, y_pred))
print("Davies-Bouldin Index:", davies_bouldin_score(X_scaled, y_pred))

Silhouette Score: 0.3449740051034408
Calinski-Harabasz Score: 267.6964044280202
Davies-Bouldin Index: 1.312320210357444
