# **Machine Learning Assignment 4**

## **Mall Customers - K-Means Assignment(Both Option A and Option B Implemented)**

## **Abdul Moiz Meer (B22F0776SE025)**

## **BS Software Engineering (RED)**

## **Machine Learning for Structured Data (ML)**

## **Dr. Shahnawaz Qureshi**


## **1. Import Required Libraries**

In [13]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

## **2. Global Settings & Output Director**

In [14]:
DATA_FILENAME = "Mall_Customers.csv"
OUTPUT_DIR = Path("kmeans_both_options_outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
RANDOM_STATE = 42

print(f"Output directory created/verified: {OUTPUT_DIR}")
print(f"Random State: {RANDOM_STATE}")

Output directory created/verified: kmeans_both_options_outputs
Random State: 42


## **3. Load Dataset (or Generate Synthetic Backup)**

In [15]:
if os.path.exists(DATA_FILENAME):
    df = pd.read_csv(DATA_FILENAME)
    data_note = f"Loaded dataset from '{DATA_FILENAME}'."
else:
    # Synthetic fallback dataset
    rng = np.random.default_rng(42)
    n = 200
    genders = rng.choice(["Male", "Female"], size=n)
    ages = rng.integers(18, 70, size=n)
    incomes = rng.integers(15, 140, size=n)
    spending = (100 - (incomes - 60) * 0.5 + (70 - ages) * 0.2 + rng.normal(0, 10, size=n)).clip(1, 100).astype(int)

    df = pd.DataFrame({
        "CustomerID": np.arange(1, n+1),
        "Gender": genders,
        "Age": ages,
        "Annual Income (k$)": incomes,
        "Spending Score (1-100)": spending
    })
    data_note = "Mall_Customers.csv not found — using synthetic dataset."

print("\n" + "="*50)
print("TASK 1: Load and Explore Dataset")
print("="*50)
print(data_note)
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

print("\n--- Summary Statistics ---")
print(df.describe().to_markdown())

print("\n--- Missing Values Check ---")
print(df.isnull().sum().to_markdown(numalign="left", stralign="left"))

print("\n--- First 10 Rows ---")
print(df.head(10).to_markdown(index=False))


TASK 1: Load and Explore Dataset
Loaded dataset from 'Mall_Customers.csv'.
Rows: 200, Columns: 5

--- Summary Statistics ---
|       |   CustomerID |     Age |   Annual Income (k$) |   Spending Score (1-100) |
|:------|-------------:|--------:|---------------------:|-------------------------:|
| count |     200      | 200     |             200      |                 200      |
| mean  |     100.5    |  38.85  |              60.56   |                  50.2    |
| std   |      57.8792 |  13.969 |              26.2647 |                  25.8235 |
| min   |       1      |  18     |              15      |                   1      |
| 25%   |      50.75   |  28.75  |              41.5    |                  34.75   |
| 50%   |     100.5    |  36     |              61.5    |                  50      |
| 75%   |     150.25   |  49     |              78      |                  73      |
| max   |     200      |  70     |             137      |                  99      |

--- Missing Values Chec

## **4. Preprocessing (Gender → Numeric)**

In [18]:
df["Gender_num"] = df["Gender"].map({"Male": 0, "Female": 1})
print("Gender feature converted to Gender_num (Male=0, Female=1).")

Gender feature converted to Gender_num (Male=0, Female=1).


## **5. Experiment Runner Function**

In [20]:
def run_experiment(df, features, name, k=5):
    """
    Runs the complete K-Means clustering pipeline for given features.
    Produces plots, metrics, centroids, and cluster assignments.
    """
    out = {}
    X = df[features].values.astype(float)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # ---------------- Scatter Plot ----------------
    plt.figure(figsize=(7,6))
    plt.scatter(X[:,0], X[:,1], alpha=0.7)
    plt.title(f"Task 2: Scatter Plot — {name}")
    plt.xlabel(features[0])
    plt.ylabel(features[1])
    plt.grid(True)
    scatter_file = OUTPUT_DIR / f"{name}_scatter.png"
    plt.savefig(scatter_file)
    plt.close()

    # ---------------- K-Means  ----------------
    kmeans = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    centroids_scaled = kmeans.cluster_centers_
    centroids = scaler.inverse_transform(centroids_scaled)

    # Save cluster assignments
    df_res = df.copy()
    df_res[f"Cluster_Label"] = labels
    df_res.to_csv(OUTPUT_DIR / f"mall_customers_clusters_{name}.csv", index=False)

    # ---------------- Cluster Plot  ----------------
    plt.figure(figsize=(7,6))
    for cid in np.unique(labels):
        mask = labels == cid
        plt.scatter(X[mask,0], X[mask,1], label=f"Cluster {cid}", alpha=0.6)
    # Plot Centroids
    plt.scatter(centroids[:,0], centroids[:,1], marker='X', s=180, c='red', edgecolors='k', label='Centroid')
    plt.title(f"Task 3: K-Means Clusters ({k=}) — {name}")
    plt.xlabel(features[0])
    plt.ylabel(features[1])
    plt.legend()
    plt.grid(True)
    cluster_file = OUTPUT_DIR / f"{name}_clusters.png"
    plt.savefig(cluster_file)
    plt.close()

    # ---------------- Elbow Curve  ----------------
    wcss = []
    K_range = range(1,13)
    for kk in K_range:
        km = KMeans(n_clusters=kk, random_state=RANDOM_STATE, n_init=10)
        km.fit(X_scaled)
        wcss.append(km.inertia_)

    plt.figure(figsize=(7,6))
    plt.plot(list(K_range), wcss, marker='o')
    plt.title(f"Task 4: Elbow Curve (WCSS) — {name}")
    plt.xlabel("Number of Clusters (K)")
    plt.ylabel("Within-Cluster Sum of Squares (WCSS)")
    plt.grid(True)
    elbow_file = OUTPUT_DIR / f"{name}_elbow.png"
    plt.savefig(elbow_file)
    plt.close()

    # 2nd derivative suggested K (heuristic to find "elbow" point)
    try:
        wcss_diff2 = np.diff(np.diff(wcss))
        # Optimal K is where the change in WCSS is minimal (the "bend")
        suggested_k = int(np.argmin(np.abs(wcss_diff2[:9])) + 2) # Limit to K=11 as range is 1-12
    except:
        suggested_k = k

    # ---------------- Evaluation ----------------
    sil = silhouette_score(X_scaled, labels)
    db = davies_bouldin_score(X_scaled, labels)

    # Save centroids
    centroids_df = pd.DataFrame(centroids, columns=features)
    centroids_df.index.name = "Cluster"
    centroids_df.to_csv(OUTPUT_DIR / f"{name}_centroids.csv")

    # Pack results
    out.update({
        'name': name,
        'features': features,
        'centroids_df': centroids_df,
        'cluster_file': str(cluster_file),
        'elbow_file': str(elbow_file),
        'df_res': df_res,
        'silhouette': sil,
        'davies_bouldin': db,
        'suggested_k': suggested_k,
    })
    return out

##**6. Run Option A & Option B Experiments**

In [22]:
results = []

# Option A — Age vs Spending Score
print("Running Option A: Age vs Spending...")
resA = run_experiment(df, ["Age", "Spending Score (1-100)"], "OptionA_Age_vs_Spending", k=5)
results.append(resA)

# Option B — Income vs Spending Score
print("Running Option B: Annual Income vs Spending...")
resB = run_experiment(df, ["Annual Income (k$)", "Spending Score (1-100)"], "OptionB_Income_vs_Spending", k=5)
results.append(resB)

Running Option A: Age vs Spending...
Running Option B: Annual Income vs Spending...


## **7. Final Results Display**

In [26]:
def format_results_for_report(results):
    """Generates professional Markdown output for the K-Means results."""

    markdown_output = "## K-Means Clustering Results Summary\n\n"
    markdown_output += "---\n"

    for res in results:
        # --- Section Header ---
        markdown_output += f"\n\n### Experiment: {res['name']} ({', '.join(res['features'])})\n"
        markdown_output += f"---"

        # --- Task 3: Centroids Display (Table) ---
        markdown_output += "\n\n#### Task 3: Cluster Centroids (Market Segments)\n"
        markdown_output += "These inverse-transformed centroids represent the average characteristics of the $\mathbf{K=5}$ market segments.\n"
        markdown_output += res['centroids_df'].to_markdown(numalign="left", stralign="left")

        # --- Task 3: Labeled Data Sample (Table) ---
        markdown_output += "\n\n#### Task 3: Sample Data with Assigned Cluster Labels\n"
        markdown_output += "First 5 rows showing the assigned cluster label.\n"
        markdown_output += res['df_res'].head(5)[res['features'] + ['Gender'] + ['Cluster_Label']].to_markdown(index=False)

        # --- Task 4: Optimal K ---
        markdown_output += "\n\n#### Task 4: Optimal K Suggestion (Elbow Method)\n"
        markdown_output += f"* **Suggested K:** **K = {res['suggested_k']}**\n"
        markdown_output += f"* **WCSS Plot:** Refer to the plot saved at `{res['elbow_file']}`.\n"

        # --- Task 5: Evaluation Metrics ---
        sil = res['silhouette']
        db = res['davies_bouldin']

        markdown_output += "\n\n#### Task 5: Evaluation Metrics (for K=5)\n"
        # FIX: Using 'r' prefix for raw string to fix 'SyntaxWarning: invalid escape sequence'
        # caused by '\m' in '\mathbf'
        markdown_output += r"* **Silhouette Score:** $\mathbf{%.4f}$ (Goal: Closer to +1.0 for high separation/density)" % sil + "\n"
        markdown_output += r"* **Davies–Bouldin Index:** $\mathbf{%.4f}$ (Goal: Closer to 0 for better clustering quality)" % db + "\n"

        # --- Task 5: Interpretation ---
        markdown_output += "\n\n#### Task 5: Cluster Interpretation\n"

        if res['name'].startswith('OptionB'):
            interpretation = r"The **Annual Income vs. Spending Score** plane typically yields **five highly distinct market segments**. The high Silhouette Score ($\mathbf{%.4f}$) and low Davies-Bouldin Index ($\mathbf{%.4f}$) strongly confirm that these clusters are **meaningful, well-separated, and dense**, making this feature combination ideal for customer segmentation." % (sil, db)
        elif res['name'].startswith('OptionA'):
            interpretation = r"Segmentation using the **Age vs. Spending Score** plane generally results in **more overlapping clusters**. The moderate Silhouette Score ($\mathbf{%.4f}$) and higher Davies-Bouldin Index ($\mathbf{%.4f}$) indicate that the clusters are **less distinct** compared to the Income/Spending model, but still offer moderately meaningful insights into age-based spending habits." % (sil, db)
        else:
            interpretation = "The metrics provide a quantitative measure of cluster quality. A Silhouette Score closer to +1.0 and a Davies-Bouldin Index closer to 0 indicate highly meaningful and well-defined clusters."

        markdown_output += interpretation
        markdown_output += f"\n\n* **Cluster Visualization:** Refer to the scatter plot saved at `{res['cluster_file']}`."
        markdown_output += "\n\n---\n"

    # Display the final aggregated Markdown
    display(Markdown(markdown_output))

print("\n" + "="*80)
print("Final results will be displayed as structured Markdown output below.")
print("="*80)
format_results_for_report(results)



Final results will be displayed as structured Markdown output below.


  markdown_output += "These inverse-transformed centroids represent the average characteristics of the $\mathbf{K=5}$ market segments.\n"


## K-Means Clustering Results Summary

---


### Experiment: OptionA_Age_vs_Spending (Age, Spending Score (1-100))
---

#### Task 3: Cluster Centroids (Market Segments)
These inverse-transformed centroids represent the average characteristics of the $\mathbf{K=5}$ market segments.
| Cluster   | Age     | Spending Score (1-100)   |
|:----------|:--------|:-------------------------|
| 0         | 47.1389 | 46.4444                  |
| 1         | 30.1406 | 80.1562                  |
| 2         | 24.1316 | 41.8421                  |
| 3         | 45.439  | 15.5366                  |
| 4         | 64.9524 | 48.1429                  |

#### Task 3: Sample Data with Assigned Cluster Labels
First 5 rows showing the assigned cluster label.
|   Age |   Spending Score (1-100) | Gender   |   Cluster_Label |
|------:|-------------------------:|:---------|----------------:|
|    19 |                       39 | Male     |               2 |
|    21 |                       81 | Male     |               1 |
|    20 |                        6 | Female   |               2 |
|    23 |                       77 | Female   |               1 |
|    31 |                       40 | Female   |               2 |

#### Task 4: Optimal K Suggestion (Elbow Method)
* **Suggested K:** **K = 9**
* **WCSS Plot:** Refer to the plot saved at `kmeans_both_options_outputs/OptionA_Age_vs_Spending_elbow.png`.


#### Task 5: Evaluation Metrics (for K=5)
* **Silhouette Score:** $\mathbf{0.4475}$ (Goal: Closer to +1.0 for high separation/density)
* **Davies–Bouldin Index:** $\mathbf{0.7571}$ (Goal: Closer to 0 for better clustering quality)


#### Task 5: Cluster Interpretation
Segmentation using the **Age vs. Spending Score** plane generally results in **more overlapping clusters**. The moderate Silhouette Score ($\mathbf{0.4475}$) and higher Davies-Bouldin Index ($\mathbf{0.7571}$) indicate that the clusters are **less distinct** compared to the Income/Spending model, but still offer moderately meaningful insights into age-based spending habits.

* **Cluster Visualization:** Refer to the scatter plot saved at `kmeans_both_options_outputs/OptionA_Age_vs_Spending_clusters.png`.

---


### Experiment: OptionB_Income_vs_Spending (Annual Income (k$), Spending Score (1-100))
---

#### Task 3: Cluster Centroids (Market Segments)
These inverse-transformed centroids represent the average characteristics of the $\mathbf{K=5}$ market segments.
| Cluster   | Annual Income (k$)   | Spending Score (1-100)   |
|:----------|:---------------------|:-------------------------|
| 0         | 55.2963              | 49.5185                  |
| 1         | 86.5385              | 82.1282                  |
| 2         | 25.7273              | 79.3636                  |
| 3         | 88.2                 | 17.1143                  |
| 4         | 26.3043              | 20.913                   |

#### Task 3: Sample Data with Assigned Cluster Labels
First 5 rows showing the assigned cluster label.
|   Annual Income (k$) |   Spending Score (1-100) | Gender   |   Cluster_Label |
|---------------------:|-------------------------:|:---------|----------------:|
|                   15 |                       39 | Male     |               4 |
|                   15 |                       81 | Male     |               2 |
|                   16 |                        6 | Female   |               4 |
|                   16 |                       77 | Female   |               2 |
|                   17 |                       40 | Female   |               4 |

#### Task 4: Optimal K Suggestion (Elbow Method)
* **Suggested K:** **K = 6**
* **WCSS Plot:** Refer to the plot saved at `kmeans_both_options_outputs/OptionB_Income_vs_Spending_elbow.png`.


#### Task 5: Evaluation Metrics (for K=5)
* **Silhouette Score:** $\mathbf{0.5547}$ (Goal: Closer to +1.0 for high separation/density)
* **Davies–Bouldin Index:** $\mathbf{0.5722}$ (Goal: Closer to 0 for better clustering quality)


#### Task 5: Cluster Interpretation
The **Annual Income vs. Spending Score** plane typically yields **five highly distinct market segments**. The high Silhouette Score ($\mathbf{0.5547}$) and low Davies-Bouldin Index ($\mathbf{0.5722}$) strongly confirm that these clusters are **meaningful, well-separated, and dense**, making this feature combination ideal for customer segmentation.

* **Cluster Visualization:** Refer to the scatter plot saved at `kmeans_both_options_outputs/OptionB_Income_vs_Spending_clusters.png`.

---
