# Clustering — Model Building, Interpretation & Evaluation
Following CRISP-DM Phases 4 (Modelling) and 5 (Evaluation)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, silhouette_samples, davies_bouldin_score, adjusted_rand_score
from sklearn.decomposition import PCA
from scipy.stats import chi2_contingency
import scipy.cluster.hierarchy as sch
import warnings
warnings.filterwarnings('ignore')

try:
    from kmodes.kmodes import KModes
    KMODES_AVAILABLE = True
except ImportError:
    KMODES_AVAILABLE = False

# Load pre-constructed clustering features
df = pd.read_csv('data/clustering_features_v2.csv')

# Load target variable from raw dataset
df_raw = pd.read_csv('CDC Diabetes Dataset (4).csv')
target = df_raw['Diabetes_012'].values

features = ['lifestyle_risk_score', 'limited_access_to_care', 'age_group',
            'income_bracket', 'education_bracket', 'cardio_metabolic_risk']

print(f"Dataset: {df.shape[0]:,} records, {len(features)} features")
print(f"Target: {len(target):,} records")
print(f"\nFeature ranges:")
for col in features:
    print(f"  {col}: {df[col].min()}–{df[col].max()}")

## 4.1 Select Modelling Technique

### Modelling Technique

Four clustering algorithms are compared:

| Algorithm | Distance Metric | Best For |
|-----------|-----------------|----------|
| K-Means | Euclidean | Continuous data, interpretable centroids |
| Hierarchical (Ward) | Euclidean | Visualising cluster hierarchy |
| DBSCAN | Euclidean | Density-based clusters, automatic k selection |
| K-Modes | Hamming | Categorical/ordinal data |

**Primary choice: K-Means** — provides interpretable centroids describing the "average patient" in each segment, and a fixed number of clusters suitable for public health campaigns.

### Modelling Assumptions

| Assumption | K-Means Requires | Our Data | Mitigation |
|------------|------------------|----------|------------|
| Continuous features | Yes | Ordinal (0–4, 0–3, etc.) | Compare with K-Modes to validate |
| Spherical clusters | Yes | Unknown | Check silhouette scores |
| Similar scales | Yes | Different ranges | StandardScaler applied |
| No missing values | Yes | None present | N/A |

**Known limitation:** Features are ordinal, not truly continuous. K-Means treats the distance from 1→2 the same as 4→5, which may not reflect true dissimilarity. We validate by comparing with K-Modes (designed for categorical data).

### Clustering Feature Set (6 features)

| Feature | Type | Range | Domain |
|---------|------|-------|--------|
| lifestyle_risk_score | Ordinal | 0–4 | Behavioural |
| limited_access_to_care | Binary | 0–1 | Access |
| age_group | Ordinal | 0–3 | Demographic |
| income_bracket | Ordinal | 0–2 | Socioeconomic |
| education_bracket | Ordinal | 0–2 | Socioeconomic |
| cardio_metabolic_risk | Ordinal | 0–2 | Clinical |

This feature set spans **four domains** (behavioural, access, demographic, clinical), providing a multi-dimensional patient segmentation.

## 4.2 Generate Test Design

Since clustering is unsupervised (no ground truth labels), we use:

**Internal validation metrics:**
- **Silhouette Score** (−1 to 1): Measures cluster cohesion vs separation. Higher = better.
- **Davies-Bouldin Index**: Ratio of within-cluster to between-cluster distance. Lower = better.
- **Inertia**: Sum of squared distances to centroids (elbow method).

**External validation:**
- Compare cluster assignments across algorithms (Adjusted Rand Index)
- Post-hoc validation against diabetes outcomes (not used for model selection — see Section 4.4.3)

**Sampling strategy:**
- 10% sample (≈25,368 records) for k-selection and algorithm comparison
- Full dataset (253,680 records) for final model
- Sampling required due to O(n²) complexity of silhouette calculation

In [None]:
# Preprocessing and sampling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

np.random.seed(42)
sample_idx = np.random.choice(len(df), int(len(df) * 0.10), replace=False)
X_sample = X_scaled[sample_idx]
df_sample = df.iloc[sample_idx].copy()

print(f"Full dataset: {len(df):,} records")
print(f"Sample (10%): {len(X_sample):,} records")

## 4.3 Build Model

### 4.3.1 Parameter Settings — K Selection

K is selected using **internal validation metrics only**. Diabetes outcomes are deliberately excluded from k-selection to avoid target leakage — using the outcome variable to tune hyperparameters would compromise the independence of the subsequent external validation.

In [None]:
# Evaluate k=2 to 8 using internal metrics only
k_range = range(2, 9)
results = {'k': [], 'inertia': [], 'silhouette': [], 'davies_bouldin': []}

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_sample)
    results['k'].append(k)
    results['inertia'].append(km.inertia_)
    results['silhouette'].append(silhouette_score(X_sample, labels))
    results['davies_bouldin'].append(davies_bouldin_score(X_sample, labels))

results_df = pd.DataFrame(results)
print(results_df.round(3).to_string(index=False))

In [None]:
# Visualise k-selection
fig, axes = plt.subplots(1, 3, figsize=(12, 3))

axes[0].plot(k_range, results['inertia'], 'bo-')
axes[0].set_xlabel('k'); axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')

axes[1].plot(k_range, results['silhouette'], 'go-')
axes[1].set_xlabel('k'); axes[1].set_ylabel('Silhouette')
axes[1].set_title('Silhouette (higher=better)')

axes[2].plot(k_range, results['davies_bouldin'], 'ro-')
axes[2].set_xlabel('k'); axes[2].set_ylabel('Davies-Bouldin')
axes[2].set_title('Davies-Bouldin (lower=better)')

plt.tight_layout()
plt.savefig('figures/k_selection.png', dpi=150)
plt.show()

In [None]:
# Dendrogram for hierarchical perspective
dendro_idx = np.random.choice(len(X_sample), 1000, replace=False)
linkage_matrix = sch.linkage(X_sample[dendro_idx], method='ward')

fig, ax = plt.subplots(figsize=(10, 4))
sch.dendrogram(linkage_matrix, ax=ax, no_labels=True, color_threshold=20)
ax.set_title('Dendrogram (Ward Linkage)')
ax.set_ylabel('Distance')
ax.axhline(y=20, color='r', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('figures/dendrogram.png', dpi=150)
plt.show()

### Parameter Settings Decision

**Selected: k=3**

| Criterion | Observation |
|-----------|-------------|
| Elbow | Visible bend around k=3 |
| Silhouette | k=3 offers good trade-off between cluster quality and interpretability |
| Davies-Bouldin | k=3 shows acceptable cluster separation |
| Dendrogram | Cutting at distance ≈20 suggests 3 clusters |
| Interpretability | 3 segments are manageable for public health campaigns |

**K-Means parameters:** `n_clusters=3`, `n_init=10`, `random_state=42`

### 4.3.2 Final Model

In [None]:
CHOSEN_K = 3

final_model = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=CHOSEN_K, random_state=42, n_init=10))
])

# Fit on features only (not extra columns that may have been added)
df['cluster'] = final_model.fit_predict(df[features])

print(f"Final model fitted on {len(df):,} records")
print(f"\nCluster sizes:")
for c in range(CHOSEN_K):
    n = (df['cluster'] == c).sum()
    print(f"  Cluster {c}: {n:,} ({n/len(df)*100:.1f}%)")

In [None]:
# Centroids (inverse-transformed for interpretation)
kmeans_model = final_model.named_steps['kmeans']
scaler_model = final_model.named_steps['scaler']
centroids_scaled = kmeans_model.cluster_centers_
centroids = scaler_model.inverse_transform(centroids_scaled)

centroids_df = pd.DataFrame(centroids, columns=features)
centroids_df.index.name = 'Cluster'
print("Cluster centroids (mean feature values):")
print(centroids_df.round(2))

### Model Description

The K-Means model partitions 253,680 patients into 3 segments based on 6 features:
- Lifestyle risk score (0–4)
- Healthcare access barriers (0–1)
- Age group (0–3)
- Income bracket (0–2)
- Education bracket (0–2)
- Cardio-metabolic risk (0–2)

Centroids represent the "average patient" in each cluster — interpretable for designing targeted interventions.

## 4.4 Assess Model

### 4.4.1 Algorithm Comparison

In [None]:
# Compare algorithms on 10% sample
km_labels = KMeans(n_clusters=CHOSEN_K, random_state=42, n_init=10).fit_predict(X_sample)
hier_labels = AgglomerativeClustering(n_clusters=CHOSEN_K, linkage='ward').fit_predict(X_sample)

# DBSCAN - auto-determines clusters based on density
dbscan = DBSCAN(eps=2.0, min_samples=50)
dbscan_labels = dbscan.fit_predict(X_sample)
n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = list(dbscan_labels).count(-1)

comparison = {
    'Algorithm': ['K-Means', 'Hierarchical', 'DBSCAN'],
    'Silhouette': [
        silhouette_score(X_sample, km_labels),
        silhouette_score(X_sample, hier_labels),
        silhouette_score(X_sample, dbscan_labels) if n_clusters_dbscan > 1 else 0.0
    ],
    'Davies-Bouldin': [
        davies_bouldin_score(X_sample, km_labels),
        davies_bouldin_score(X_sample, hier_labels),
        davies_bouldin_score(X_sample, dbscan_labels) if n_clusters_dbscan > 1 else np.nan
    ],
    'N_Clusters': [CHOSEN_K, CHOSEN_K, n_clusters_dbscan]

}print(f"\nDBSCAN found {n_clusters_dbscan} clusters and {n_noise} noise points ({n_noise/len(X_sample)*100:.1f}%)")

print(pd.DataFrame(comparison).round(4).to_string(index=False))

if KMODES_AVAILABLE:

    kmodes_labels = KModes(n_clusters=CHOSEN_K, init='Huang', n_init=5, random_state=42).fit_predict(df_sample[features])    comparison['N_Clusters'].append(CHOSEN_K)

    comparison['Algorithm'].append('K-Modes')    comparison['Davies-Bouldin'].append(davies_bouldin_score(X_sample, kmodes_labels))
    comparison['Silhouette'].append(silhouette_score(X_sample, kmodes_labels))

In [None]:
# Algorithm agreement (Adjusted Rand Index)
print("Algorithm Agreement (ARI):")
print(f"  K-Means vs Hierarchical: {adjusted_rand_score(km_labels, hier_labels):.3f}")
print(f"  K-Means vs DBSCAN:       {adjusted_rand_score(km_labels, dbscan_labels):.3f}")
if KMODES_AVAILABLE:
    print(f"  K-Means vs K-Modes:      {adjusted_rand_score(km_labels, kmodes_labels):.3f}")

    print(f"\nNote: Low K-Means/K-Modes agreement expected — different distance metrics")    print(f"DBSCAN uses density-based clustering — low agreement is expected if natural structure differs from k={CHOSEN_K}")

**Algorithm comparison findings:**

- **K-Means** produces well-defined clusters with interpretable centroids
- **Hierarchical** clustering shows moderate agreement with K-Means, confirming the cluster structure is not purely an artefact of the algorithm
- **DBSCAN** automatically identifies clusters based on density — if it finds a very different number of clusters or many noise points, this suggests the k=3 structure may be imposed rather than natural
- **K-Modes** agreement is typically lower — it uses Hamming distance which treats all feature differences equally, regardless of magnitude
- **Conclusion:** K-Means is retained for its combination of cluster quality, centroid interpretability ("average patient" profiles), and suitability for creating a fixed number of actionable patient segments for public health campaigns

### 4.4.2 Internal Validation

In [None]:
# Silhouette analysis on 10% sample
X_scaled_full = final_model.named_steps['scaler'].transform(df[features])
X_eval = X_scaled_full[sample_idx]
labels_eval = df['cluster'].values[sample_idx]

sil_avg = silhouette_score(X_eval, labels_eval)
sil_samples = silhouette_samples(X_eval, labels_eval)
db_score = davies_bouldin_score(X_eval, labels_eval)

print(f"Overall Metrics (10% sample, n={len(sample_idx):,}):")
print(f"  Silhouette Score: {sil_avg:.4f}")
print(f"  Davies-Bouldin Index: {db_score:.4f}")

print(f"\nPer-cluster Silhouette:")
for c in range(CHOSEN_K):
    cluster_sil = sil_samples[labels_eval == c].mean()
    n = (labels_eval == c).sum()
    print(f"  Cluster {c}: {cluster_sil:.4f} (n={n:,})")

In [None]:
# Silhouette plot
fig, ax = plt.subplots(figsize=(8, 6))
y_lower = 10
colors = plt.cm.Set2(np.linspace(0, 1, CHOSEN_K))

for i in range(CHOSEN_K):
    cluster_sil = sil_samples[labels_eval == i]
    cluster_sil.sort()
    y_upper = y_lower + len(cluster_sil)
    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil,
                     alpha=0.7, color=colors[i])
    ax.text(-0.05, y_lower + 0.5 * len(cluster_sil), str(i))
    y_lower = y_upper + 10

ax.axvline(x=sil_avg, color='red', linestyle='--', label=f'Average: {sil_avg:.3f}')
ax.set_xlabel('Silhouette Coefficient')
ax.set_ylabel('Cluster')
ax.set_title('Silhouette Plot (10% sample)')
ax.legend()
plt.tight_layout()
plt.savefig('figures/silhouette_plot.png', dpi=150)
plt.show()

**Internal validation interpretation:**

| Metric | Value | Interpretation |
|--------|-------|----------------|
| Silhouette | *See output above* | Within typical range (0.20–0.50) for real-world data with ordinal features |
| Davies-Bouldin | *See output above* | Lower is better; <1.5 is acceptable |

The silhouette plot shows:
- All clusters should have positive average silhouette (no systematically misassigned cluster)
- Width of each "blade" indicates cluster size
- Points extending past the red dashed line (average silhouette) are well-clustered; points near zero are borderline
- Any cluster with consistently low silhouette values indicates more overlap with other clusters

**Conclusion:** The clustering has acceptable internal validity — clusters are cohesive and reasonably well-separated given the ordinal nature of the features.

### 4.4.3 External Validation: Diabetes Outcomes

Since diabetes status was **not used** in clustering, different diabetes rates across clusters would suggest that our segmentation captures health-relevant patterns.

**Important caveat:** One of the clustering features, `cardio_metabolic_risk` (HighBP + HighChol), is a known correlate of diabetes (Spearman ρ ≈ 0.30). Therefore, clusters that differ in cardio-metabolic risk will *mechanistically* differ in diabetes rates. This external validation is best interpreted as a **consistency check** rather than independent evidence of predictive power. The clusters are not "discovering" a diabetes relationship — they are reflecting one that is partly built into the feature set.

In [None]:
# Cluster profiles
profiles = df.groupby('cluster')[features].mean()
profiles['n'] = df.groupby('cluster').size()
profiles['%'] = (profiles['n'] / len(df) * 100).round(1)

print("Cluster Profiles (mean feature values):")
print(profiles.round(2))

In [None]:
# Heatmap visualisation
fig, ax = plt.subplots(figsize=(10, 5))
im = ax.imshow(profiles[features].values, cmap='YlOrRd', aspect='auto')

ax.set_xticks(range(len(features)))
ax.set_yticks(range(len(profiles)))
ax.set_xticklabels(features, rotation=45, ha='right')
ax.set_yticklabels([f'Cluster {i}' for i in profiles.index])

for i in range(len(profiles)):
    for j in range(len(features)):
        val = profiles[features].iloc[i, j]
        ax.text(j, i, f'{val:.2f}', ha='center', va='center',
                color='white' if val > profiles[features].values.max()/2 else 'black')

plt.colorbar(im, label='Mean Value')
ax.set_title('Cluster Profiles')
plt.tight_layout()
plt.savefig('figures/cluster_profiles.png', dpi=150)
plt.show()

In [None]:
# Diabetes rates by cluster
df['Diabetes_012'] = target
diabetes_rates = pd.crosstab(df['cluster'], df['Diabetes_012'], normalize='index')
diabetes_rates.columns = ['No Diabetes', 'Prediabetes', 'Diabetes']

print("Diabetes prevalence by cluster:")
print(diabetes_rates.round(3))

print(f"\nDiabetes rate range: {diabetes_rates['Diabetes'].min()*100:.1f}% to {diabetes_rates['Diabetes'].max()*100:.1f}%")
print(f"Spread: {(diabetes_rates['Diabetes'].max() - diabetes_rates['Diabetes'].min())*100:.1f} percentage points")

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
diabetes_rates.plot(kind='bar', ax=ax, color=['#2ecc71', '#f39c12', '#e74c3c'])
ax.set_xlabel('Cluster')
ax.set_ylabel('Proportion')
ax.set_title('Diabetes Prevalence by Cluster')
ax.set_xticklabels([f'Cluster {i}' for i in diabetes_rates.index], rotation=0)
ax.legend(title='Status')
plt.tight_layout()
plt.savefig('figures/diabetes_by_cluster.png', dpi=150)
plt.show()

In [None]:
# Statistical test
contingency = pd.crosstab(df['cluster'], df['Diabetes_012'])
chi2, p_value, dof, expected = chi2_contingency(contingency)

print(f"Chi-square test for independence:")
print(f"  Chi-sq = {chi2:.2f}")
print(f"  df = {dof}")
print(f"  p-value = {p_value:.2e}")

**External validation interpretation:**

The chi-square test is highly significant (p < 0.001), confirming that cluster membership and diabetes status are **not independent**.

However, this result must be interpreted with caution:

1. **`cardio_metabolic_risk` is a diabetes proxy.** HighBP and HighChol are components of metabolic syndrome, the same underlying condition that leads to type 2 diabetes. Clusters that differ in this feature will *mechanistically* show different diabetes rates.

2. **The validation is a consistency check, not a discovery.** It confirms that the clustering successfully captured the clinical risk dimension embedded in `cardio_metabolic_risk`. This is expected and reassuring, but not evidence of an unexpected pattern.

3. **The other features contribute genuine signal.** Lifestyle risk, age, SES, and healthcare access are associated with diabetes through different pathways (behavioural, demographic, structural). Their contribution to cluster separation is more genuinely "external."

**Public health implication:** Despite the caveat, the clusters remain useful for targeting interventions — they identify distinct patient subgroups with different combinations of clinical risk, lifestyle behaviours, and socioeconomic circumstances.

### 4.4.4 Sensitivity Analysis: Clustering Without Cardio-Metabolic Risk

To quantify how much of the diabetes separation is driven by `cardio_metabolic_risk`, we re-run the same clustering pipeline using only the 5 behavioural/demographic features (excluding the clinical biomarker).

In [None]:
# Clustering without cardio_metabolic_risk (5 features only)
features_no_cm = [f for f in features if f != 'cardio_metabolic_risk']

model_no_cm = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=CHOSEN_K, random_state=42, n_init=10))
])
labels_no_cm = model_no_cm.fit_predict(df[features_no_cm])

# Internal metrics (on 10% sample)
X_no_cm_scaled = model_no_cm.named_steps['scaler'].transform(df[features_no_cm])
X_no_cm_eval = X_no_cm_scaled[sample_idx]
labels_no_cm_eval = labels_no_cm[sample_idx]

sil_no_cm = silhouette_score(X_no_cm_eval, labels_no_cm_eval)
db_no_cm = davies_bouldin_score(X_no_cm_eval, labels_no_cm_eval)

# Diabetes separation
rates_no_cm = pd.crosstab(
    pd.Series(labels_no_cm, name='cluster'),
    pd.Series(target, name='diabetes'),
    normalize='index'
)
spread_no_cm = (rates_no_cm[2].max() - rates_no_cm[2].min()) * 100

chi2_no_cm, _, _, _ = chi2_contingency(
    pd.crosstab(pd.Series(labels_no_cm), pd.Series(target))
)

# Side-by-side comparison
spread_full = (diabetes_rates['Diabetes'].max() - diabetes_rates['Diabetes'].min()) * 100

print(f"{'Metric':<25} {'6 features':>15} {'5 features':>15} {'Interpretation':>20}")
print("-" * 75)
print(f"{'Silhouette':<25} {sil_avg:>15.3f} {sil_no_cm:>15.3f} {'':>20}")
print(f"{'Davies-Bouldin':<25} {db_score:>15.3f} {db_no_cm:>15.3f} {'':>20}")
print(f"{'Diabetes spread (pp)':<25} {spread_full:>14.1f}pp {spread_no_cm:>14.1f}pp {'':>20}")
print(f"{'Chi-square':<25} {chi2:>15,.0f} {chi2_no_cm:>15,.0f} {'':>20}")

**Sensitivity analysis findings:**

Removing `cardio_metabolic_risk` reveals the trade-off between cluster quality and diabetes stratification. The comparison table above shows:

1. **The behavioural/demographic features form real, high-quality clusters** — silhouette and DB scores typically improve when removing the clinical feature, indicating the 5-feature model finds genuine patient segments with better internal structure

2. **But those segments stratify diabetes risk less effectively** — the diabetes spread and chi-square values drop, indicating weaker association with diabetes outcomes

3. **The clinical biomarker contributes meaningful diabetes stratification** — `cardio_metabolic_risk` (HighBP + HighChol) accounts for a substantial portion of the statistical association with diabetes

This reinforces the caveat stated above: the external validation against diabetes is a consistency check on the clinical feature, not an independent discovery. The 6-feature model is retained because it produces **clinically actionable** segments (combining behavioural, demographic, and clinical dimensions), but a substantial portion of its diabetes separation is mechanistically driven by including a diabetes-correlated clinical biomarker.

### 4.4.5 PCA Visualisation

In [None]:
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled_full)

print(f"Variance explained:")
print(f"  PC1: {pca.explained_variance_ratio_[0]:.1%}")
print(f"  PC2: {pca.explained_variance_ratio_[1]:.1%}")
print(f"  Total: {sum(pca.explained_variance_ratio_):.1%}")

In [None]:
# Sample for clean visualisation
np.random.seed(42)
plot_idx = np.random.choice(len(df), 2000, replace=False)
X_pca_plot = X_pca[plot_idx]
labels_plot = df['cluster'].values[plot_idx]
diabetes_plot = df['Diabetes_012'].values[plot_idx]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Clusters in PCA space
cmap = plt.cm.viridis(np.linspace(0, 0.95, CHOSEN_K))
for c in range(CHOSEN_K):
    mask = labels_plot == c
    axes[0].scatter(X_pca_plot[mask, 0], X_pca_plot[mask, 1],
                    s=30, alpha=0.7, color=cmap[c], label=f'Cluster {c}',
                    edgecolors='none')
axes[0].set_xlabel('PC1'); axes[0].set_ylabel('PC2')
axes[0].set_title('Clusters in PCA Space')
axes[0].legend(markerscale=1.5)

# Diabetes status in PCA space
colors_diab = {0: '#59a14f', 1: '#edc948', 2: '#e15759'}
names_diab = {0: 'No Diabetes', 1: 'Prediabetes', 2: 'Diabetes'}
for status in [0, 1, 2]:
    mask = diabetes_plot == status
    axes[1].scatter(X_pca_plot[mask, 0], X_pca_plot[mask, 1],
                    s=30, alpha=0.7, color=colors_diab[status],
                    label=names_diab[status], edgecolors='none')
axes[1].set_xlabel('PC1'); axes[1].set_ylabel('PC2')
axes[1].set_title('Diabetes Status in PCA Space')
axes[1].legend(markerscale=1.5)

plt.tight_layout()
plt.savefig('figures/pca_combined.png', dpi=150)
plt.show()

In [None]:
# Feature loadings
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=features)
print("PCA Feature Loadings:")
print(loadings.round(3))

**PCA interpretation:**

Two principal components capture a moderate proportion of total variance (see output above) — typical for heterogeneous health data with 6 features. Nearly half the variation is in higher dimensions and cannot be visualized in 2D.

**Reading the loadings:**

The feature loadings table above shows which features drive each principal component:

- **PC1** is typically driven by socioeconomic features (income, education) vs. health risks (lifestyle and cardio-metabolic risk)
  - Interpretation: High-SES vs. high-risk behavioural/clinical profile

- **PC2** typically contrasts age/clinical risk vs. access barriers
  - Interpretation: Older patients with clinical conditions vs. younger patients with access barriers

**Visual observations:**
- Left plot: Clusters show varying degrees of separation — some overlap is expected given the moderate silhouette scores
- Right plot: Diabetes cases (red) tend toward regions associated with higher health risk and lower SES
- The overlap in PCA space does not invalidate the clusters — it means the 6D separation does not fully project into 2D

## 5. Evaluation

### 5.1 Cluster Interpretation

In [None]:
# Detailed cluster summary
print("=" * 70)
print("CLUSTER INTERPRETATION")
print("=" * 70)

for c in range(CHOSEN_K):
    p = profiles.loc[c]
    cluster_df = df[df['cluster'] == c]
    diabetes_rate = (cluster_df['Diabetes_012'] == 2).mean() * 100
    prediabetes_rate = (cluster_df['Diabetes_012'] == 1).mean() * 100

    risk_level = "High" if p['lifestyle_risk_score'] > 2.0 else "Moderate" if p['lifestyle_risk_score'] > 1.0 else "Low"
    access_issue = "Yes" if p['limited_access_to_care'] > 0.5 else "No"
    age_cat = ["Young", "Middle", "Older", "Elderly"][min(int(round(p['age_group'])), 3)]
    ses = "High" if (p['income_bracket'] + p['education_bracket']) > 3 else \
          "Low" if (p['income_bracket'] + p['education_bracket']) < 2 else "Medium"
    cm_risk = "High" if p['cardio_metabolic_risk'] > 1.0 else "Low"

    print(f"\nCLUSTER {c}")
    print(f"  Size: {p['n']:,.0f} ({p['%']:.1f}%)")
    print(f"  Lifestyle risk: {p['lifestyle_risk_score']:.2f}/4 ({risk_level})")
    print(f"  Cardio-metabolic risk: {p['cardio_metabolic_risk']:.2f}/2 ({cm_risk})")
    print(f"  Access barrier: {p['limited_access_to_care']:.1%} ({access_issue})")
    print(f"  Age: {p['age_group']:.1f} ({age_cat})")
    print(f"  SES: Income={p['income_bracket']:.1f}, Education={p['education_bracket']:.1f} ({ses})")
    print(f"  Diabetes: {diabetes_rate:.1f}%, Prediabetes: {prediabetes_rate:.1f}%")

### 5.2 Cluster Labels and Intervention Strategy

Based on the cluster profiles above, three distinct patient segments emerge. Review the cluster interpretation output and diabetes prevalence table to complete the characterization:

| Cluster | Label | Key Characteristics | Diabetes Rate | Size | Priority | Intervention Strategy |
|---------|-------|---------------------|---------------|------|----------|----------------------|
| **0** | *[Label based on output]* | *[Review cluster interpretation]* | *[See output]* | *[%]* | *[Priority]* | *[Tailored intervention based on risk profile]* |
| **1** | *[Label based on output]* | *[Review cluster interpretation]* | *[See output]* | *[%]* | *[Priority]* | *[Tailored intervention based on risk profile]* |
| **2** | *[Label based on output]* | *[Review cluster interpretation]* | *[See output]* | *[%]* | *[Priority]* | *[Tailored intervention based on risk profile]* |

**Instructions:** After running all cells, fill in this table based on:
- Cluster interpretation output (Section 5.1)
- Diabetes prevalence by cluster
- Centroid values showing mean feature levels

**Key considerations for public health planning:**

- Identify the highest-risk cluster (highest diabetes rate) for priority intervention- Design cluster-specific interventions addressing their unique combination of risk factors

- Distinguish between clinical risk, behavioral risk, and access barriers- Consider both cluster size and risk level when allocating resources

### 5.3 Limitations and Ethical Considerations

**Methodological limitations:**
- Ordinal features treated as continuous (K-Means assumption violated; validated against K-Modes and DBSCAN)
- Cross-sectional data — cannot infer causal relationships between cluster membership and diabetes
- Self-reported data may have recall and social desirability bias
- `cardio_metabolic_risk` (HighBP + HighChol) is a known diabetes correlate, so the external validation against diabetes outcomes is partly circular — clusters separate diabetes risk partly *because* a diabetes-correlated feature was included
- Silhouette score is moderate, indicating some overlap between clusters — boundaries are not sharp

**Ethical considerations:**
- Risk of stigmatisation if clusters are labelled negatively
- Interventions should empower, not blame patients
- SES-based targeting could reinforce existing inequalities if not implemented carefully
- Need to ensure equitable resource allocation across all clusters

## Summary

| CRISP-DM Phase | Finding |
|----------------|--------|
| 4.1 Select Technique | K-Means (primary), Hierarchical, DBSCAN & K-Modes (validation) |
| 4.2 Test Design | Silhouette, DB, ARI (internal); diabetes prevalence (external, post-hoc) |
| 4.3 Build Model | k=3, n_init=10, StandardScaler, 6 features |
| 4.4 Assess Model | Acceptable internal metrics; significant diabetes separation (with caveats) |
| 5. Evaluation | 3 interpretable segments combining clinical, behavioural, and demographic dimensions |

**Key methodological decisions:**
- K selected using **internal metrics only** to avoid target leakage
- External validation against diabetes interpreted as a **consistency check**, not independent validation, because `cardio_metabolic_risk` is a known diabetes correlate
- K-Means chosen over alternatives (Hierarchical, DBSCAN, K-Modes) for centroid interpretability and fixed segment count despite ordinal feature limitation