**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2025-08-16

**Last update:** 2025-10-06

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

📢 <span style="color:red"><strong> Note for Students:</strong></span>

* Before working on the labs, review your lecture notes.

* Please read all sections, code blocks, and comments **carefully** to fully understand the material. Throughout the labs, my instructions are provided to you in written form, guiding you through the materials step-by-step.

* All concepts covered in this lab are part of the course and may be included in the final exam.

* I strongly encourage you to work in pairs and discuss your findings, observations, and reasoning with each other.

* If something is unclear, don't hesitate to ask.

* I have done my best to make the lab files as bug-free (and error-free) as possible, but remember: *there is no such thing as bug-free code.* If you observed any bugs, errors, typos, or other issues, I would greatly appreciate it if you report them to me by email. Verbal notifications are not work, as I will likely forget 🙂

* Your answers for the "⚡ Mandatory" sections of each lab <span style="color:red"><strong>must be submitted before the start of the next lab session</strong></span>.

ENJOY WORKING ON THIS LAB.
***

# 🛠️ Purpose and Learning Outcomes:

- Understand the concept of clustering.
- Apply the K-Means clustering algorithm.
- Evaluate clustering performance using metrics like inertia and silhouette score.
- Understand the DBSCAN clustering algorithm.
- Apply Gaussian Mixture Models (GMM) for clustering.
***

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../utils'))
from notebook_config import *

# Clustering and Unsupervised Learning

Clustering is a key technique in unsupervised ML used to group similar data points without predefined labels. Its goal is to discover natural groupings in data, where items in the same cluster are more similar than those in different clusters. This makes clustering especially useful for Exploratory Data Analysis (EDA) to identify patterns, reveal hidden structures, and marks outliers. Rather than relying on a single method, clustering includes various algorithms, each with its own definition of similarity and approach to forming clusters. Here, we practice on a few of the most common models.


# K-Means

K-means is a popular clustering algorithm used to group data into $K$ number of clusters. It works by iteratively assigning data points to the nearest cluster centroid and then updating the centroids based on the assigned points. Below, I illustrate how K-means iteratively updates centroids and cluster assignments using 3 initial centroids.

<p align="center">
  <img src="../figures/kmeans_animation.gif" width="700">
</p>

Let's load a dataset with multiple features that we aim to group into K clusters. Since we don't know what the data looks like or how many clusters it may contain, we start by exploring the dataset. We use `Pandas` to load and inspect the data before applying any clustering algorithm.

In [None]:
import pandas as pd

# Load the dataset into a DataFrame.
df = pd.read_csv("../datasets/clustering_data.csv", comment="#").iloc[:, :-1] # We drop it, y, as not needed.

Since the dataset is small, we can jump straight to visualizing it. For large datasets, this is not as straightforward or trivial. I intentionally set `alpha=0.5` to introduce transparency in the scatter plot, which helps you visually assess the density of overlapping data points. Using transparency is a simple yet powerful technique to enhance clarity and insight in dense visualizations.

In [None]:
def plot_scatter(X):
    assert X.shape[1] == 2, "X must have exactly two features for scatter plot."

    plt.figure()
    plt.scatter(X[:, 0], X[:, 1], marker="o",
        c='grey', s=50, edgecolors="k", linewidth=0.6, alpha=0.5)

    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    
    plt.grid(True)
    plt.show()

We visualize x1 vs. x3 features. Our dataset contains 4 features: x1, x2, x3, and x4.
Here, we only focus on x1 vs. x3. You can of course try different features later.

In [None]:
X = df[["x1", "x3"]].values  # Select two features for 2D scatter plot
plot_scatter(X)

***
### ✅ Check your understanding

- How many clusters do you see in the data?
***

## Outlier Detection

There are different methods to detect outliers. Earlier, you tried Box Plots. Here, we want to try another method, called the Z-score. 

The Z-score (also known as a standard score in statistics) measures how far a data point is from the mean of a dataset in terms of standard deviations. It helps determine whether a specific value is above or below the average and by how much. We calculate the Z-score of a data point X as $Z = (X - \mu)/\sigma$, where $\mu$ is the mean and $\sigma$ is the standard deviation.

A Z-score = 0 shows that the data point is exactly at the mean. A positive (negative) Z-score indicates that the data point is above (below) the mean. Z-scores are useful to identify outliers in a dataset. Typically, points with Z-scores beyond a threshold of ±3 are considered unusual and can be marked as outliers. 

In the code below we apply this method to our dataset and identify any potential outliers.

In [None]:
# Z-score is part of the scipy library.
from scipy import stats

# Finding outliers using Z-score
def detect_outliers_zscore(data, threshold=3):
    z_scores = np.abs( stats.zscore(data))
    outliers = (z_scores > threshold)
    print("Number of outliers using Z-score > 3:")
    print(pd.DataFrame(outliers, columns=data.columns).sum())
    
detect_outliers_zscore(df)

***
See the figure below. Do you see the Z-score > 3 and why it is insignificant and considered as outlier?
<p align="center">
  <img src="../figures/Gaussian.png" width="600">
</p>

The code below applies the K-Means clustering to our dataset `X`, and making groups specified by the number of clusters (`n_clusters`). It uses the `fit_predict` method to both fit the model and assign each data point to the nearest cluster based on Euclidean distance. Once training is complete, the final cluster centroids and data labels (the cluster assignment) are used to analyze the clustering result.

In [None]:
from sklearn.cluster import KMeans

# plot KMeans clusters
def plot_kmeans_clusters(X, n_clusters=3):
    # Fit KMeans
    #kmeans = KMeans(n_clusters=k, init="k-means++", random_state=42)
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X)  # Get the labels (cluster assignments) for each point

    centroids = kmeans.cluster_centers_  # Get the centroids of the clusters    
    print("Centroids:\n", np.round(centroids, 2))

    plt.figure()   
    plt.scatter(X[:, 0], X[:, 1], marker="o",
        c=[colors[i] for i in labels], s=50, edgecolors="k", linewidth=0.6, alpha=0.5)

    # plot centroids
    plt.scatter(centroids[:, 0], centroids[:, 1], marker="X",
        c="red", s=120, edgecolors="k", linewidth=1.2, alpha=0.7, label="Centroids")

    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.title(f"K-Means Clustering with k = {n_clusters}")
    plt.grid(True)
    plt.legend()
    plt.show()

First we try it with 2 clusters.

In [None]:
plot_kmeans_clusters(X, n_clusters=2)

Then, with 5 clusters.

In [None]:
plot_kmeans_clusters(X, n_clusters=5)

You can change the method for the initial assignment of the centroids in the KMeans algorithm using the `init` parameter in Scikit-Learn's KMeans class. The two supported methods are `'k-means++'` and `'random'`. You can do this by:
```python
    kmeans = KMeans(n_clusters=k, init="k-means++", random_state=42)
```

## Choosing the number of clusters

The weakness of KMeans clustering is that we do not know how many clusters we need by just running the model. We need to test ranges of values and make a decision on the best value of K. We typically make a decision using the Elbow method to determine the optimal number of clusters where we aim neither **overfit** with too many clusters nor **underfit** with too few.

Yes, the concepts of `overfitting` and `underfitting` also exist in unsupervised learning!

In the code below we calculate the inertia (Within Cluster Sum of Squared Distances) for different values of K (from 1 to 10). We then plot the inertia against the number of clusters to visualize the "elbow" point, which guids us in selecting the optimal number of clusters.

In [None]:
# Elbow method to find optimal number of clusters
WCSS = []  # Within-Cluster Sum of Squares for each k. WCSS is also known as inertia.

k_vals = range(1, 10)  # Test k from 1 to 9

for k in k_vals:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    WCSS.append(kmeans.inertia_)

# Plot elbow
plt.figure()
plt.plot(k_vals, WCSS, "--", marker="o", color="RoyalBlue", linewidth=1.0, alpha=1.0)

plt.xlabel("Number of Clusters (k)")
plt.ylabel("WCSS (or Inertia)")
plt.grid(True)
plt.show()

***
### ✅ Check your understanding

- Examine the inertia graph to determine the optimal value for K.
- To enhance the clarity of the results, modify the figure based on the lecture notes to improve the WCSS  presentation.
***

## Evaluation Methods for KMeans Clustering

In the code section below, we have used four main metrics to evaluate KMeans clustering: **WCSS**, **Silhouette Score**, **Davies-Bouldin Index**, and **Calinski-Harabasz Index**.

1. **WCSS (inertia)**: This metric, as we used it earlier, measures the total variance within each cluster. It quantifies how tightly the data points in each cluster are packed. Lower WCSS values indicate better clustering, as they suggest that the clusters are more distinct.

2. **Silhouette Score**: This score assesses how similar a data point is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a high score indicates that points are well clustered. A score close to 1 suggests that the points are far away from the neighboring clusters, and a score close to 0 indicates overlapping clusters (see lecture notes).

3. **Davies-Bouldin Index**: We did not discuss this index in the class. It evaluates the average similarity ratio of each cluster with its most similar one. A lower Davies-Bouldin index indicates better clustering, as it suggests that clusters are well-separated and distinct from each other.

4. **Calinski-Harabasz Index**: Also known as the "Variance Ratio Criterion", measures the ratio of the sum of between-cluster dispersion to within-cluster dispersion. A higher score indicates better defined clusters.

By analyzing these metrics across different values of k, you can determine the most appropriate number of clusters for your dataset.

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Evaluate KMeans clustering using various metrics
def evaluate_kmeans_clustering(X, k_range=range(2, 15)):
    wcss        = []  # Saving Within-Cluster Sum of Squares (WCSS) for each k
    silhouettes = []  # Saving Silhouette score for each k
    db_scores   = []  # Saving Davies-Bouldin Index for each k
    ch_scores   = []  # Saving Calinski-Harabasz Index for each k

    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(X)

        wcss.append(kmeans.inertia_)
        silhouettes.append(silhouette_score(X, labels))
        db_scores.append(davies_bouldin_score(X, labels))
        ch_scores.append(calinski_harabasz_score(X, labels))

    plt.figure(figsize=(12, 8))
    ax = plt.subplot(221)

    ax.plot(k_range, wcss, "--", marker="o", color="RoyalBlue", linewidth=1.0, alpha=0.7)
    ax.set_xlabel("Number of Clusters (k)")
    ax.set_ylabel("WCSS")
    ax.set_title("Elbow Method")
    ax.grid(True)

    ax = plt.subplot(222)
    ax.plot(k_range, silhouettes, "--", marker="o", color="tomato", linewidth=1.0, alpha=0.7)
    ax.set_xlabel("Number of Clusters (k)")
    ax.set_ylabel("Silhouette score")
    ax.set_title("Silhouette Score")
    ax.grid(True)

    ax = plt.subplot(223)
    ax.plot(k_range, db_scores, "--", marker="o", color="forestgreen", linewidth=1.0, alpha=0.7)
    ax.set_xlabel("Number of Clusters (k)")
    ax.set_ylabel("DB Index (lower is Better)")
    ax.set_title("Davies–Bouldin Index")
    ax.grid(True)

    ax = plt.subplot(224)
    ax.plot(k_range, ch_scores, "--", marker="o", color="purple", linewidth=1.0, alpha=0.7)
    ax.set_xlabel("Number of Clusters (k)")
    ax.set_ylabel("CH Score (higher is Better)")
    ax.set_title("Calinski-Harabasz Index")
    ax.grid(True)

    plt.show()

evaluate_kmeans_clustering(X)

***
### ✅ Check your understanding

- Look at the scores and indexes you calculated. Which K do you choose and why?

- Does the Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index agree with the Elbow method in determining the optimal number of clusters? Why or why not?

***

## Standardization of Features

Oops! Did you notice that the features were not standardaized? :(

It is important to scale the input features before you run KMeans, or the clusters may be very stretched and k-means will perform poorly. Scaling the features does not guarantee that all the clusters will be
nice and spherical, but it generally helps k-means <span style="color:cyan">[A. Géron]</span>

**REMEMBER:** While standardizing is crucial, it should not be done blindly. Always consider the context of your data and the specific characteristics of the features. For example, if certain features represent categorical data encoded as numerical values, standardization may not be appropriate. Always assess the impact of standardization on your specific dataset and analysis.

- When algorithms sensitive to scale or features have different units, we need to Standardize.

- For tree-based models, and for binary features, we do not need to Standardize.

Now we standardize our data. Here, we only do it for x1 and x3, because earlier we set 
```python
X = df[["x1", "x3"]].values 
```

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Let's run our KMeans with 5 clusters after standardization. Compare the results with those you got earlier.

In [None]:
plot_kmeans_clusters(X_scaled, n_clusters=5)

In [None]:
# Evaluate KMeans clustering on scaled data
evaluate_kmeans_clustering(X_scaled, k_range=range(2, 21))

***
### ✅ Check your understanding

- Look at the scores and indexes you calculated from standardized data. Which K do you choose now? Compare it with your earlier choice.

- Test your choice. In the CSV file, there is an extra column called **"y"** that we did not load before. Look at your data file. This column contains the true cluster labels. Since this is an unsupervised learning task, we did not need the labels earlier, so we masked it. But now, you can load the **"y"** values to check how well your KMeans results match the real labels. Use this to validate your chosen K.

⚠️ **NOTE:** In unsupervised learning, we do not have access to the true labels. However, since this is a lab exercise, we can use the true labels to validate our clustering results.

***

## Silhouette analysis

Let's perform a silhouette analysis to evaluate the quality of clustering for our standardized dataset across a specified range of cluster numbers. The silhouette score is a metric that measures how similar each data point is to its own cluster compared to other clusters, with values ranging from -1 to 1; a higher score indicates better-defined clusters. Our code calculates the average silhouette score for each k, helping us to identifying the optimal number of clusters. We highlight the average index on each subplot with a dashed red line.

In [None]:
from sklearn.metrics import silhouette_score, silhouette_samples

# Silhouette analysis to visualize the silhouette scores for different clusters
def silhouette_analysis(X, k_range=range(2, 11)):
    assert X.shape[1] == 2, "X must have exactly 2 columns."

    silhouette_avgs = [] # List to store average silhouette scores for each k

    # Subplot layout for 2 columns
    n_cols = 2
    n_plots = len(k_range)
    n_rows = int(np.ceil(n_plots / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, 3 * n_rows), dpi=300)
    axes = axes.flatten()  # Flatten to 1D list for easy indexing

    for idx, k in enumerate(k_range):
        ax = axes[idx]

        kmeans = KMeans(n_clusters=k, random_state=42)
        cluster_labels = kmeans.fit_predict(X)

        silhouette_avg = silhouette_score(X, cluster_labels)
        silhouette_vals = silhouette_samples(X, cluster_labels)
        silhouette_avgs.append(silhouette_avg)

        ax.set_xlim([-0.2, 1])
        ax.set_ylim([0, len(X) + (k + 1) * 10])

        y_lower = 10
        for i in range(k):
            ith_vals = silhouette_vals[cluster_labels == i]
            ith_vals.sort()

            size_cluster_i = ith_vals.shape[0]
            y_upper = y_lower + size_cluster_i

            color = colors[i % len(colors)]
            ax.fill_betweenx(np.arange(y_lower, y_upper),
                             0, ith_vals,
                             facecolor=color, edgecolor='k', alpha=0.5)

            ax.text(0.0, y_lower + 0.5 * size_cluster_i, str(i))
            y_lower = y_upper + 10

        ax.text(-0.12, 0.5 * y_lower, "Clusters", rotation=90, va="center", weight="bold")
        ax.axvline(x=silhouette_avg, color="red", linestyle="--", linewidth=3)

        ax.set_xlabel("Silhouette score values")
        ax.set_ylabel("# of samples")
        ax.set_title(f"k = {k}")
        ax.grid(True)

    plt.show()


silhouette_analysis(X_scaled, k_range=range(2, 10))

***
### ✅ Check your understanding

- What does each subplot show? Make sure you fully understand them.

- What do the negative scores show (e.g. see subplots for k=3 or k=7)?

- For each value of k, you can compare the highlighted average score (red dashed line) with those you got earlier from `evaluate_kmeans_clustering` function (the previous code section).

***

# DBSCAN

Instead of KMeans, here we try DBSCAN (Density-Based Spatial Clustering of Applications with Noise) for clustering our scaled data, X.

In [None]:
from sklearn.cluster import DBSCAN

def plot_dbscan_clusters(X, epsilon=0.4, min_samples=10):
    # Standardize to make sure all features are on the same scale
    X_scaled = StandardScaler().fit_transform(X)

    # DBSCAN
    db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X_scaled)
    labels = db.labels_  # Get the labels (cluster assignments) for each point

    # Cluster stats
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) # Number of clusters excluding noise
    n_noise_ = list(labels).count(-1) # Number of noise points (outliers)

    print(f"Estimated number of clusters: {n_clusters_}")
    print(f"Estimated number of noise points: {n_noise_}")

    # Core point mask
    core_samples_mask = np.zeros_like(labels, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True

    # Plot
    plt.figure()
    unique_labels = set(labels)

    for k in unique_labels:
        col = 'black' if k == -1 else colors[k % len(colors)]
        class_mask = labels == k

        # Core points
        xy_core = X_scaled[class_mask & core_samples_mask]
        plt.scatter(xy_core[:, 0], xy_core[:, 1], marker="o",
            c=[col], s=80, edgecolors="k", linewidth=0.6, alpha=0.5,
            label=f"Cluster {k}" if k != -1 else "Noise")

        # Border points
        xy_border = X_scaled[class_mask & ~core_samples_mask]
        plt.scatter(xy_border[:, 0], xy_border[:, 1], marker="o",
            c=[col], s=30, edgecolors="k", linewidth=0.6, alpha=0.7)
        
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.title(f"DBSCAN with {n_clusters_} clusters")
    plt.grid(True)
    plt.legend()
    plt.show()


plot_dbscan_clusters(X_scaled, epsilon=0.4, min_samples=10)

***
### 💡 Reflect and Run

- How many clusters do you see in the data, and why? Read about DBSCAN.

- Try modifying the DBSCAN clustering by setting the `epsilon` parameter to 0.2 and rerun the code. Observe how this change affects the number and shape of the clusters. 

- Find the most optimal `epsilon` value for your dataset with writing your code. You can use metrics like the silhouette to compare performance and determine which `epsilon` produces the "best" number of clusters.

***

# Gaussian Mixture Models (GMM)

Here we want to use GMM for clustering our scaled data, X. Before that, I've created a function to plot the results. After that, we use `GaussianMixture` from Scikit Learn to create clusters.

In [None]:
def plot_gmm_clusters(X_scaled, labels, title):
    plt.figure()

    for k in set(labels):
        mask = labels == k
        plt.scatter(X_scaled[mask, 0], X_scaled[mask, 1], marker="o",
                    color=colors[k % len(colors)], s=50, edgecolors="k", linewidth=0.6, 
                    alpha=0.5, label=f"Cluster {k}")

    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.title(title)
    plt.grid(True)
    plt.legend()
    plt.show()

In [None]:
from sklearn.mixture import GaussianMixture

# GMM
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X_scaled)
gmm_n = gmm.n_components

plot_gmm_clusters(X_scaled, gmm_labels, f"GMM Clustering (k = {gmm_n})")

In [None]:
def plot_clusters_with_distributions(X, gmm, labels):
    plt.figure()

    # Plot clusters
    for k in set(labels):
        mask = labels == k
        plt.scatter(X_scaled[mask, 0], X_scaled[mask, 1], marker="o",
                    color=colors[k % len(colors)], s=50, edgecolors="k", linewidth=0.5, 
                    alpha=0.5, label=f"Cluster {k}")

    # Grid over data bounds
    x_min, x_max = X_scaled[:, 0].min() - 0.5, X_scaled[:, 0].max() + 0.5
    y_min, y_max = X_scaled[:, 1].min() - 0.5, X_scaled[:, 1].max() + 0.5

    xs = np.linspace(x_min, x_max, 200)
    ys = np.linspace(y_min, y_max, 200)
    xx, yy = np.meshgrid(xs, ys)

    # Compute densities
    scores = gmm.score_samples(np.c_[xx.ravel(), yy.ravel()])
    scores = np.exp(scores)

    # Plot density contours using percentiles
    contour_levels = np.linspace(np.min(scores), np.max(scores), 8)
    plt.contourf(xx, yy, scores.reshape(xx.shape), levels=contour_levels, cmap="gray_r", alpha=0.5)

    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.grid(True)
    plt.legend()
    plt.show()

# Plot with distribusions
plot_clusters_with_distributions(X_scaled, gmm, gmm_labels)

***
### ✅ Check your understanding

- For GMM, we need to identify `n_components`. Change it to the best k value you obtained earlier and re-run the GMM again and compare results with those obtained from KMeans and DBSCAN.

***

Instead of GMM, we use BGMM and the code section below does it for you.

In [None]:
from sklearn.mixture import BayesianGaussianMixture

# Bayesian GMM
bgmm = BayesianGaussianMixture(n_components=10, random_state=42)
bgmm_labels = bgmm.fit_predict(X_scaled)
bgmm_n = len(set(bgmm_labels)) - (1 if -1 in bgmm_labels else 0)

plot_gmm_clusters(X_scaled, bgmm_labels, f"Bayesian GMM Clustering (k = {bgmm_n})")

***
### ✅ Check your understanding

- How well each model (GMM and BGMM) separates the data into distinct clusters? 

- You may have noticed that `n_components` is set to 10 for `BayesianGaussianMixture`, while the actual number of resulting clusters (len(set(bgmm_labels))) is fewer, and in our example is 7. Why is that? What do we actually identify with setting `n_components` in `BayesianGaussianMixture`?

**NOTE:** GMM relies on a fixed number of clusters, while BGMM can automatically infer the number of effective clusters, making it more flexible, especially when the true number of clusters is unknown.

***

# Image Segmentation

Image segmentation is a technique in image processing that helps break an image into `"meaningful"` regions or segments. It plays an important role in understanding the content of an image. For example, separating different objects in a photo so we can tell whether there is a chair, a person, or any other item. This step usually comes before more advanced tasks like object recognition, feature extraction, or even image compression, which are beyond the scope of our course.

The main idea behind image segmentation is to group together pixels that share similar characteristics, such as color or brightness, so that each group corresponds to a different region or object in the image. One of the most popular and simple methods to do this is clustering, especially using the `K-Means` algorithm. Yes, you read it correctly! K-Means works very well for identifying clusters of similar pixels and assigning each pixel to the nearest cluster center. It is fast and efficient.

By using segmentation, we no longer need to analyze the whole image at once. Instead, we focus only on the important parts. This makes processing faster and more effective. 

In the next section, we briefly demonstrate how K-Means can successfully group similar pixels together and use this clustering to segment different components of an image.

In [None]:
import cv2
from skimage import io

# Original image filename and number of clusters
original_filename = "../figures/TF.jpg"  # You can change this to any image file
n_clusters = 3

# Load image
img = io.imread(original_filename)

# Convert from RGBA or grayscale to RGB if needed
if img.ndim == 2:
    # Grayscale to RGB
    img = np.stack([img] * 3, axis=-1)
elif img.shape[2] == 4:
    # RGBA to RGB
    img = img[:, :, :3]

# Resize image to half to speed up processing
height, width = img.shape[:2]
img = cv2.resize(img, (width // 2, height // 2))

# Get original shape and reshape image to a 2D array of pixels
# Normalize pixel values to [0, 1] for better clustering performance
original_shape = img.shape
all_pixels = img.reshape(-1, 3) / 255.0  # Normalize to [0, 1]

print(f"Original image shape: {original_shape},\t Total pixels: {all_pixels.shape[0]}")

# KMeans clustering
kmeans = KMeans(n_clusters, random_state=42)
kmeans.fit(all_pixels)

# Get cluster centers and labels
colors = kmeans.cluster_centers_  # still in [0, 1]
labels = kmeans.labels_

# Build new image using cluster colors
new_img = np.zeros_like(all_pixels)
for ix in range(new_img.shape[0]):
    new_img[ix] = colors[labels[ix]]

# Rescale back to [0, 255] and reshape
new_img = (new_img*255).astype('uint8').reshape(original_shape)

# Plot segmented image
plt.figure()
plt.imshow(new_img)
plt.title("KMeans Segmentation")
plt.axis("off")
plt.show()

***
### ⚡ Mandatory submission

You can try a different figure and change the number of clusters. Check out images in the "figures" folder.

For example,
```python
original_filename = "../figures/gotland.jpg"
n_clusters = 5
```

- How many clusters do you see in the image, and why?

- Perform Silhouette analysis for your chosen number of clusters and report the average silhouette score.

***

Similar to our earlier code, the code below runs the KMeans on our image's pixel data for a range of cluster numbers (from 2 to 30). For each value of k, it fits a model and stores both the model itself and its WCSS value. These values are later used to identify the optimal number of clusters using the elbow method.

We take advantage of `tqdm`, which is a Python library that adds a progress bar to loops, making it easier to monitor the progress of long-running operations. This is especially helpful when training models, processing large datasets, or performing tasks that take noticeable time. Read about it and try using it in your codes that contain loops and iterations.

In [None]:
# tqdm is a library for progress bars
import tqdm

# Run KMeans for multiple k
n_clusters = range(2, 31)
wcss   = []
models = []

# Initialize lists to store WCSS and models
# tqdm.trange is used to create a progress bar for the for loop
for k in tqdm.trange(len(n_clusters), desc="Clustering"):
    model = KMeans(n_clusters=n_clusters[k], random_state=42)
    model.fit(all_pixels)
    wcss.append(model.inertia_)
    models.append(model)

# Plot inertia and elbow
plt.figure()
plt.plot(n_clusters, wcss, "--", marker="o", color="RoyalBlue", linewidth=1.0, alpha=0.7)
plt.xlabel("Number of Clusters (k)")
plt.ylabel("WCSS (or Inertia)")
plt.title("Elbow method to find optimal k")
plt.grid(True)
plt.show()


***
### ⛷️ Exercise

It is your turn now to work on a clustering task. I want you to work on a Customer Clustering task, which you can either download from Kaggle (see how to do it below), or use the "customer_data.csv" in your "datasers" folder.

Your goal is to group similar customers together using unsupervised learning and reflect on what these clusters might represent in a business context.

* Load and inspect the dataset. Focus on understanding what each column might represent. Think about what kind of patterns or segments might exist among the customers. Handle any missing values appropriately. If necessary, normalize the data so features contribute equally.

* Use at least **three clustering algorithms**: one centroid-based (like KMeans), one density-based (like DBSCAN), and one probabilistic (like GMM). For each, decide how to set the number of clusters (or equivalent parameters) using visual or statistical techniques like the Elbow method, Silhouette scores, or BIC/AIC.

* Also try hierarchical clustering with different linkage criteria: single, complete, average, and ward. You may also try Hierarchical DBSCAN, known as **HDBSCAN**. Plot the dendrogram to find a reasonable cut for clusters. Read about it at `sklearn.cluster.AgglomerativeClustering`.

* Visualize the resulting clusters and discuss how they differ. Are there consistent patterns across models? Which model gives the most meaningful grouping? Can you describe the characteristics of each cluster? You are not expected to be precise — just try to interpret what the model is finding.

* Which clustering method did you find most useful, and why? What could be the potential use of these clusters for a company (e.g. marketing and targeting)?

**Important Notes:**

* You do not need to write perfect code. Focus on **reasoning and analysis**.
* Avoid over-engineering. Use **visual intuition** where possible.

# Install Kaggle API

To install the Kaggle API, run the following commands in your terminal:

```bash
# Upgrade pip
pip install --upgrade pip

# Install Kaggle API
pip install kaggle
```

### Get Your Kaggle API Token

1. Go to your Kaggle Account Settings: https://www.kaggle.com/settings
2. Scroll down to the **API** section and click on **Create New API Token**. This will download a file named `kaggle.json`.

### Save the API Token

#### For Linux or macOS:
1. Move the `kaggle.json` file to the `.kaggle` directory:
   ```bash
   mkdir -p ~/.kaggle
   mv /path/to/downloaded/kaggle.json ~/.kaggle/
   ```
2. Set the permissions of the `kaggle.json` file:
   ```bash
   chmod 600 ~/.kaggle/kaggle.json
   ```

#### For Windows (I am not a Windows User, so this might not work):
1. Move the `kaggle.json` file to the `.kaggle` directory:
   - Create a directory named `.kaggle` in your user folder (e.g., `C:\Users\YourUsername\.kaggle`).
   - Move the `kaggle.json` file to this directory. You can do this using File Explorer or by running the following command in the Command Prompt:
     ```cmd
     mkdir C:\Users\YourUsername\.kaggle
     move C:\path\to\downloaded\kaggle.json C:\Users\YourUsername\.kaggle\
     ```
2. Ensure the permissions are set correctly:
   - In Windows, you typically don’t need to set permissions, but you can ensure that the file is not accessible by others by right-clicking the file, selecting **Properties**, going to the **Security** tab, and changing the permissions as necessary.


### Get/Download a Kaggle dataset
```python
    import os
    from kaggle.api.kaggle_api_extended import KaggleApi

    api = KaggleApi()
    api.authenticate()

    # Example: to download CERN electron collision dataset from 
    #      https://www.kaggle.com/datasets/fedesoriano/cern-electron-collision-data
    #  you need to do the following.
    #  Only need part of the URL for the code below:
    api.dataset_download_files("fedesoriano/cern-electron-collision-data", 
        path='/local_path_to_store_data/', unzip=True)
```


***
END
***