# K-Means Clustering — No-Framework Implementation

Building K-Means from scratch using only NumPy. This is the most educational implementation — we manually code every component of the algorithm.

**Dataset**: Dry Beans — 13,543 samples, 16 geometric features, 7 bean types.

## What We'll Build From Scratch
- **K-Means++ initialization** — smart centroid seeding (probability ∝ distance²)
- **Cluster assignment** — vectorized Euclidean distance via broadcasting
- **Centroid update** — mean of assigned samples + empty cluster handling
- **Lloyd's algorithm** — the iterative assign → update → converge loop
- **Multi-init wrapper** — n_init runs, keep best by inertia


In [1]:
import sys
import os
import numpy as np

# Add project root to path for utils
sys.path.insert(0, os.path.abspath('../..'))
from utils.data_loader import load_processed_data
from utils.performance import track_performance
from utils.visualization import (plot_elbow_curve, plot_silhouette_comparison,
                                  plot_silhouette_analysis, plot_convergence_curve)
from utils.results import save_results, add_result, print_comparison
from utils.metrics import inertia, silhouette_score, adjusted_rand_index

# Configuration
RANDOM_STATE = 113
K_RANGE = range(2, 13)       # Test K=2 through K=12
MAX_ITER = 300
TOL = 1e-4
N_INIT = 5                   # 5 random initializations, keep best by inertia
FRAMEWORK = 'No-Framework'

# Load preprocessed data
X_train, X_test, y_train, y_test, meta = load_processed_data('kmeans')

print("=" * 60)
print(f"K-MEANS — {FRAMEWORK}")
print("=" * 60)
print(f"Training: {X_train.shape[0]:,} samples, {X_train.shape[1]} features")
print(f"Test:     {X_test.shape[0]:,} samples")
print(f"Classes:  {meta['n_classes']} ({meta['class_names']})")

K-MEANS — No-Framework
Training: 10,834 samples, 16 features
Test:     2,709 samples
Classes:  7 (['BARBUNYA', 'BOMBAY', 'CALI', 'DERMASON', 'HOROZ', 'SEKER', 'SIRA'])


In [2]:
# Step 1: K-Means++ initalization

def kmeans_plus_plus_init(X, k, rng):
    """
    Select k initial centroids using k-means++ algo.

    First centroid is chosen randomly. Each subsequent centroid is 
    chosen with probability prorportional to D(x)^2 - the squared
    distance to the nearest existing centroid. This spreads centroids
    apart for faster, more reliable convergence.

    Args:
        X: training data (n_samples, n_featres)
        k: Number of clusters
        rng: np.random.Randomstate for reproducibility

    Returns:
        centroids: initial centroid positions (k, n_features)
    """
    n_samples, n_features = X.shape
    centroids = np.empty((k, n_features))

    # First centroid: random data point
    first_idx = rng.randint(0, n_samples)
    centroids[0] = X[first_idx]

    # Remaining centroids: weighted by squared distance
    for c in range(1, k):
        # Squared distance from each point to nearest existing centroid
        diff = X[:, np.newaxis, :] - centroids[np.newaxis, :c, :]
        sq_distances = np.sum(diff ** 2, axis=2)

        # Distance to nearest centroid for each point
        min_sq_dist = np.min(sq_distances, axis=1)

        # Convert to probabilities
        probs = min_sq_dist / min_sq_dist.sum()

        # Choose next centroid using weighted probability
        next_idx = rng.choice(n_samples, p=probs)
        centroids[c] = X[next_idx]

    return centroids

# Quick test
rng = np.random.RandomState(RANDOM_STATE)
test_centroids = kmeans_plus_plus_init(X_train, k=3, rng=rng)
print(f"Centroids shape: {test_centroids.shape}")
print(f"First centroid: {test_centroids[0][:5]}...")
print(f"Centroids are unique: {len(np.unique(test_centroids, axis=0)) == 3}")

Centroids shape: (3, 16)
First centroid: [ 0.1160314   0.49730625  0.79295167 -0.35469757  2.0429446 ]...
Centroids are unique: True
