# K-Means Clustering

What is K-Means?
K-Means is an iterative, centroid-based partitioning algorithm for unsupervised learning that:

Partitions n observations into k clusters

Minimizes within-cluster variance (inertia)

Assumes spherical, equally-sized clusters with similar densities

```bash
K-Means = Partition + Minimize Distance

Unsupervised learning - no labels needed

Centroid-based - each cluster has a center point

Distance minimization - points assigned to nearest centroid

Iterative refinement - repeats until convergence
```
Objective Function (Inertia/WCSS):
``` Inertia = Σ Σ ||x - μ_k||² ```

where:

x = data point in cluster k

μ_k = centroid of cluster k

Goal: Minimize this value

``` Time: O(n × k × d × i)```
```Space: O(n × k + k × d)```

Clustering = grouping similar data points
No labels given
Algorithm finds structure by itself
That’s unsupervised learning


K-Means = partition data into K clusters using distance

K = number of clusters (chosen by us)

Means = centroid = average of points

Each cluster is represented by its mean

params:
KMeans(
    n_clusters=8,           # Most important: choose carefully
    init='k-means++',       # Smart initialization
    n_init=10,              # Run 10 times, keep best
    max_iter=300,           # Maximum iterations
    tol=1e-4,               # Convergence tolerance
    random_state=42,        # Reproducibility
    algorithm='auto'        # 'elkan' faster for dense data
)


Random init → bad clusters
K-Means++ → smart initialization

How?

First centroid randomly

Next centroid chosen far from existing ones

 Faster convergence
 Better clusters

When K-Means FAILS (Very Important)

 Non-spherical clusters
 Different cluster sizes
 Different densities
 Sensitive to outliers
 Need to choose K manually

 Example where K-Means fails:
Moon-shaped data

K-Means is distance-based

 Features must be scaled

Evaluation Metrics

Since no labels:

Inertia (WCSS)

Silhouette Score

Davies-Bouldin Index

In [6]:
import random
import math
def euclidean_distance(p1, p2):
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))
def assign_clusters(X, centroids):
    clusters = [[] for _ in centroids]
    
    for point in X:
        distances = [euclidean_distance(point, c) for c in centroids]
        cluster_index = distances.index(min(distances))
        clusters[cluster_index].append(point)
    
    return clusters
def update_centroids(clusters):
    new_centroids = []
    
    for cluster in clusters:
        centroid = [
            sum(dim) / len(cluster)
            for dim in zip(*cluster)
        ]
        new_centroids.append(centroid)
    
    return new_centroids
def kmeans(X, k, max_iters=100):
    centroids = random.sample(X, k)

    for _ in range(max_iters):
        clusters = assign_clusters(X, centroids)
        new_centroids = update_centroids(clusters)

        if new_centroids == centroids:
            break

        centroids = new_centroids

    return clusters, centroids
X = [
    [1, 2], [1, 4], [1, 0],
    [10, 2], [10, 4], [10, 0]
]

clusters, centroids = kmeans(X, k=2)

print("Centroids:", centroids)
print("Clusters:", clusters)


Centroids: [[1, 2], [10, 2]]
Clusters: [[[1, 2], [1, 4], [1, 0]], [[10, 2], [10, 4], [10, 0]]]


In [7]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
X = np.array([
    [1, 2], [1, 4], [1, 0],
    [10, 2], [10, 4], [10, 0]
])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(
    n_clusters=2,
    init='k-means++',
    random_state=42
)

kmeans.fit(X_scaled)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print("Labels:", labels)
print("Centroids:", centroids)


Labels: [0 1 0 1 1 0]
Centroids: [[-0.33333333 -0.81649658]
 [ 0.33333333  0.81649658]]


In [9]:
from sklearn.metrics import silhouette_score

score = silhouette_score(X_scaled, labels)
print("Silhouette Score:", score)


Silhouette Score: 0.15910418698883563


WHY use K-Means?
To find structure in unlabeled data

When:

You don’t have labels

You still want to group similar data

Example
You have customers, but no “type” column
→ K-Means discovers customer segments automatically

Simple, fast, and scalable 

Why companies love it:

Easy to understand

Very fast even for large datasets

Works well in high dimensions (with scaling)

Used in:

Industry

Interviews

Baseline clustering

Clear objective function

K-Means minimizes within-cluster distance (WCSS)

This makes it:

Mathematically clean

Easy to optimize

Easy to explain

Easy to interpret results

Output is:

Cluster labels

Centroids (means)

Centroid = “average representative” of group

Good baseline model

Before trying complex clustering:

Try K-Means first

If it works → great

If not → move to DBSCAN / Hierarchical

WHEN to use K-Means?

Use K-Means ONLY if these conditions are mostly true 

Data is numerical

K-Means uses distance.

Height, weight, income
Color, category, text (without encoding)

Clusters are roughly spherical

K-Means assumes:

Round-shaped clusters

Equal spread

Works well when clusters look like “balls”

Similar cluster sizes

If one cluster is huge and one is tiny →  bad

ou know (or can estimate) K

You can:

Use Elbow Method

Use Silhouette Score

Use domain knowledge

Few outliers

Outliers pull centroids badly

Always check:

Boxplot

Scatter plot

WHEN NOT to use K-Means 
Non-spherical clusters

Example:

Moon shape

Spiral shape

Use DBSCAN

Different densities

One cluster dense, another sparse

K-Means fails

Categorical data only

Distance doesn’t make sense

Use:

K-Modes

Hierarchical clustering

Many outliers

Centroid shifts incorrectly
Clean data or use DBSCAN


Customer Segmentation
Features: age, income, spending_score
Why K-Means?
✔ Numeric
✔ Want segments
✔ Fast

Image Compression
Sales, visits, revenue

Document Clustering (after TF-IDF)
Vectors are numeric → K-Means works

Why do you use K-Means?
K-Means is used to cluster unlabeled numerical data by minimizing within-cluster variance. It is fast, simple, scalable, and works well when clusters are spherical and of similar size.

When do you use K-Means?
I use K-Means when the data is numerical, scaled, has few outliers, and when the number of clusters can be estimated using methods like the elbow or silhouette score.

K-Means = fast + numeric + spherical + known K



