# Level 2 
## Task 3: Clustering with K-Means

Welcome to **Level 2 – Task 3: Clustering**.  
In this notebook, we'll explore how to:

- Load cleaned datasets interactively
- Preprocess numerical features for clustering
- Determine the optimal number of clusters using the Elbow Method
- Apply **K-Means Clustering**
- Visualize clusters via PCA
- Interpret cluster summaries

In [13]:
# -------------------- Imports --------------------
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output


In [14]:
# -------------------- Paths & Defaults --------------------
root_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
cleaned_dir = os.path.join(root_dir, "data", "cleaned")

default_file = "iris_cleaned.csv"  # Can adjust to another default if needed
available_csvs = [f for f in os.listdir(cleaned_dir) if f.endswith(".csv")]

# Dataset selector dropdown
dataset_selector = widgets.Dropdown(
    options=available_csvs,
    value=default_file,
    description="Dataset:"
)

# Display dropdown
display(Markdown("### 📁 Select a Cleaned Dataset"))
display(dataset_selector)


### 📁 Select a Cleaned Dataset

Dropdown(description='Dataset:', index=2, options=('churn-bigml-20_cleaned.csv', 'house_prediction_cleaned.csv…

In [None]:
output_area = widgets.Output()

def load_dataset(file_name):
    path = os.path.join(cleaned_dir, file_name)
    return pd.read_csv(path)

def run_kmeans(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

    if len(numeric_cols) < 2:
        display(Markdown("❌ This dataset does not have enough numeric columns for clustering."))
        return

    X = df[numeric_cols]
    scaler = StandardScaler()
    scaled = scaler.fit_transform(X)

    # -------------------- Elbow Plot --------------------
    distortions = []
    K_range = range(1, 11)
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled)
        distortions.append(kmeans.inertia_)

    plt.figure(figsize=(5, 4))
    plt.plot(K_range, distortions, marker='o')
    plt.title("📉 Elbow Method – Optimal K")
    plt.xlabel("Number of Clusters (k)")
    plt.ylabel("Inertia")
    plt.grid(True)
    plt.show()

    # -------------------- Choose k --------------------
    k_slider = widgets.IntSlider(value=3, min=2, max=10, step=1, description="Clusters (k):")
    display(k_slider)

    def on_k_change(change):
        k = change['new']
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(scaled)
        sil = silhouette_score(scaled, labels)

        display(Markdown(f"### ✅ K-Means Results (k = {k})"))
        display(Markdown(f"**Silhouette Score:** {sil:.4f}"))

        # Attach labels
        df['Cluster'] = labels

        # PCA for 2D plot
        pca = PCA(n_components=2)
        reduced = pca.fit_transform(scaled)
        reduced_df = pd.DataFrame(reduced, columns=["PC1", "PC2"])
        reduced_df["Cluster"] = labels

        # Cluster visualization
        plt.figure(figsize=(6, 4))
        sns.scatterplot(data=reduced_df, x="PC1", y="PC2", hue="Cluster", palette="Set2", s=60)
        plt.title("🧭 Clusters Visualized (PCA Projection)")
        plt.show()

        # Cluster summary
        display(Markdown("### 📌 Cluster Summary"))
        try:
            display(df.groupby("Cluster")[numeric_cols].mean())
        except Exception as e:
            display(Markdown(f"⚠️ Could not compute cluster summary: {e}"))

    k_slider.observe(on_k_change, names="value")
    on_k_change({'new': 3})  # Trigger default run

def preview_dataset(change=None):
    output_area.clear_output()
    with output_area:
        df = load_dataset(dataset_selector.value)
        display(Markdown("### Dataset Preview"))
        display(df.head())
        run_kmeans(df)

dataset_selector.observe(preview_dataset, names="value")
preview_dataset()  # Trigger initial load
display(output_area)


Output()

## Summary & Conclusion

In this task, we explored **K-Means Clustering**, a popular unsupervised learning technique used to group similar data points into clusters based on their features.

- We started by loading and preprocessing datasets, focusing on numeric features to ensure meaningful clustering.
- The **Elbow Method** helped us determine the optimal number of clusters (`k`) by analyzing the inertia (within-cluster sum of squares).
- We applied K-Means clustering on the scaled data and evaluated the cluster quality using the **Silhouette Score**.
- Finally, we visualized the resulting clusters in 2D space using **Principal Component Analysis (PCA)** for dimensionality reduction.
- The cluster summary statistics provided insights into the characteristics of each group.

### Key Takeaways:

- Proper data scaling is essential for effective clustering since K-Means relies on distance calculations.
- The Elbow Method and Silhouette Score are valuable tools to guide the choice of `k`.
- Dimensionality reduction like PCA enables intuitive visualization of high-dimensional data clusters.
- Clustering can reveal hidden patterns and groupings in data without pre-labeled classes.

This task provided a hands-on understanding of clustering workflows and important considerations when applying K-Means to real-world datasets.
