# What is **Clustering** in Machine Learning?

**Clustering** is an **unsupervised learning** technique where:

- The goal is to **group data points** into **clusters**.
- Points **in the same cluster** are **similar** to each other.
- Points **in different clusters** are **different** from each other.

**Key point:**  
You don't have labeled data telling you which point belongs to which group — the algorithm figures it out on its own!


![image.png](attachment:ef6cedaf-d9f7-4da0-8b9b-4f059928c1f8.png)

### Intuitive Example:

Imagine you have a bunch of photos of animals — but you don't know their species.  
A clustering algorithm could automatically group:
- All **dogs** together,
- All **cats** together,
- All **birds** together,  
**even though it was never told** what a "dog" or "cat" is!


### Common Clustering Algorithms

| Algorithm | Main Idea |
|:----------|:----------|
| **K-Means** | Partition the data into K groups by minimizing the distance to group centers |
| **DBSCAN** | Group points that are closely packed together (density-based) |
| **Hierarchical Clustering** | Build a tree of clusters by progressively merging or splitting groups |


### Applications of Clustering

- **Customer segmentation** (group similar customers in marketing)
- **Image compression** (group similar pixels)
- **Anomaly detection** (find outliers that don't belong to any cluster)
- **Organizing large datasets** without labels


### Key Concepts to Understand

- **Distance metrics** (how to measure "similarity") — e.g., Euclidean distance
- **Number of clusters (K)** — sometimes you need to decide it manually
- **Intra-cluster similarity** (points in the same cluster are close)
- **Inter-cluster dissimilarity** (different clusters are well-separated)


### Quick Diagram

```
Data points  --->  Clustering Algorithm  --->  Groups (Clusters)
  (X1,X2)              (e.g., K-Means)          Cluster 1
                                                Cluster 2
                                                Cluster 3
```

### Important difference:
- **Classification** = supervised (you know the correct labels).
- **Clustering** = unsupervised (you discover structure without labels).

# The K-Means algorithm

K-Means is one of the most popular clustering algorithm.

Simplified steps for **K-Means** clustering:
1. Pick **K** random points as initial cluster centers (centroids).
2. Assign each data point to the **nearest centroid**.
3. Move each centroid to the **average** position of the points assigned to it.
4. Repeat steps 2–3 until the centroids **don’t move much** anymore.


![image.png](attachment:bc53f060-de2d-4887-8354-33f3a398afa7.png)
![image.png](attachment:353d4881-dae0-4eb8-a4d8-0ba240ff1caf.png)

### Reasons why K-Means is so popular

1. **Simplicity and Ease of Understanding**
    - The algorithm is very intuitive:  
      - Choose cluster centers (centroids).
      - Assign points to the nearest center.
      - Update centers and repeat.
    - You don’t need a deep math background to understand or implement it.
2. **Speed and Scalability**
    - **K-Means is extremely fast**, especially for large datasets:
      - Each iteration involves simple calculations (distance, mean).
    - It **scales well** to datasets with millions of samples and dimensions.
3. **Low Memory Usage**
    - K-Means needs to store only:
      - The centroids,
      - Assignments of points,
      - The dataset itself.
    - Very **memory-efficient** compared to more complex algorithms.
4. **Works Well in Many Practical Cases**
    - If clusters are roughly **spherical** and **similar in size**, K-Means often performs very well.
    - Good enough for many industrial, marketing, and basic scientific applications.
5. **Easy to Implement and Available Everywhere**
    - K-Means is **included in every major library**:  
      `scikit-learn`, `TensorFlow`, `PyTorch`, `MATLAB`, `R`, etc.
    - You can also write a basic version yourself in just a few lines of Python.
6. **Can be Improved Easily**
    - Many improvements/extensions exist:
      - **K-Means++** (smart initialization, better results)
      - **Mini-batch K-Means** (even faster for very big data)
    - These make it even more attractive while keeping its simplicity.


### K-Means also has weaknesses
- It **assumes clusters are convex, spherical**, and similar-sized.
- It’s **sensitive to outliers** (a single outlier can badly affect centroids).
- You must **choose K manually** (number of clusters).

(That’s why sometimes people use more advanced methods like **DBSCAN**, **GMM**, or **Spectral Clustering**.)

### In detail explanation of K-Means

We are given a dataset of $n\_samples$ samples: $x^{(1)}, x^{(1)}, \dots, x^{(n\_samples)} \in \mathbb{R}^{n\_features}$, when each sample has $n\_features$ features.

1. Randomly initialize K clusters centroids $\mu_1, \mu_2, \dots, \mu_K \in \mathbb{R}^{n\_features}$: centroids are vectors in the same vector space as the samples. They are the centers of clusters of samples in that vector space.
2. Repeat:
   - for i = 1 to $n\_samples$
     - find the closest centroid to $x^{(i)}$ according to the L2 norm squared (this is method we use to measure the distance from the centroids):
       $$
       c^{(i)} := \text{ index from 1 to K of centroid closest to } x^{(i)} = \arg\min_{k} \| x_i - \mu_k \|_2^2
       $$
3. Move cluster centroids:
   - for i = 1 to K:
        $$
        \mu_k := \text{ average (mean) of points assigned to cluster } k = \frac{1}{N_k} \sum_{i: c_i = k} x^{(i)}
        $$
     
        where:
        - $ N_k $ is the number of points assigned to cluster $ k $
4. Convergence Criterion
    - The algorithm **stops** when:
      - **Cluster assignments** no longer change, **or**
      - **Centroids** change very little (below a threshold), **or**
      - After a **maximum number of iterations**.


### Distance Computation

Usually, we use **Euclidean distance** also called the L2 Norm which is the as-the-crow-flies distance :

$$
d(x_i, \mu_k) = \sqrt{ \sum_j (x_{ij} - \mu_{kj})^2 }
$$

![image.png](attachment:167a25a9-4468-4fa9-8a4c-cbb84b1c83de.png)

We usually square it to make the computation simpler (the square root doesn't impact the order between the distances so we don't have to apply to compare distances.



### Key Points to Know:

| Point | Explanation |
|:------|:------------|
| Sensitivity to initialization | Poor initial centroids → bad clustering (local minima). |
| Non-convex clusters | K-Means struggles if clusters are not spherical. |
| Number of clusters \( K \) | Must be specified beforehand (can be chosen by heuristics like the Elbow method). |
| Speed | Very fast for small/medium datasets! |
| Scalability | K-Means scales well with data size (especially with optimizations like mini-batch K-Means). |

# KMeans with scikit-learn

Let's see how to perform clustering using the K-Means algorithm of scikit-learn.

In [1]:
# Install dependencies:
%pip install kagglehub
%pip install matplotlib
%pip install numpy
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
#Import dependencies

import kagglehub
from kagglehub import KaggleDatasetAdapter
import matplotlib.pyplot as plt
import numpy as np

In [3]:
%matplotlib inline

### Loading the dataset

In [13]:
# Set the path to the file you'd like to load
file_path = "penguins.csv"

# Load the latest version
penguins_df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "youssefaboelwafa/clustering-penguins-species",
  file_path
).dropna()

penguins_df['sex'] = penguins_df['sex'].map(lambda sex: 1 if sex == 'MALE' else 0)

  penguins_df = kagglehub.load_dataset(




In [14]:
print(penguins_df)

     culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g  sex
0                39.1             18.7              181.0       3750.0    1
1                39.5             17.4              186.0       3800.0    0
2                40.3             18.0              195.0       3250.0    0
4                36.7             19.3              193.0       3450.0    0
5                39.3             20.6              190.0       3650.0    1
..                ...              ...                ...          ...  ...
338              47.2             13.7              214.0       4925.0    0
340              46.8             14.3              215.0       4850.0    0
341              50.4             15.7              222.0       5750.0    1
342              45.2             14.8              212.0       5200.0    0
343              49.9             16.1              213.0       5400.0    1

[335 rows x 5 columns]


In [18]:
import os
from sklearn.cluster import KMeans

os.environ['OMP_NUM_THREADS'] = '1'
kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(penguins_df)
print(kmeans.labels_)
print(kmeans.cluster_centers_)

[0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1
 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 1]
[[4.20206573e+01 1.80389671e+01 2.13704225e+02 3.68978873e+03
  4.36619718e-01]
 [4.74237705e+01 1.56516393e+01 2.15491803e+02 5.11598361e+03
  6.22950820e-01]]
