# 05 — Clustering: Community Typology by Healthcare Access Profile

Identify groups of census tracts with similar healthcare access challenges
using unsupervised machine learning.

**Methods:**
1. Feature engineering and standardisation
2. K-means clustering with optimal-k selection
3. Hierarchical clustering (validation)
4. LISA spatial clustering
5. Cluster characterisation and intervention mapping

In [None]:
import sys
sys.path.insert(0, "..")

import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

from src.config import DATA_OUTPUTS
from src.clustering import (
    prepare_features,
    kmeans_optimal_k,
    hierarchical_clustering,
    spatial_clustering,
    characterize_clusters,
)
from src.visualization import plot_elbow, cluster_map

## 5.1 Load Accessibility and Demographic Data

In [None]:
cluster_gdf = gpd.read_file(DATA_OUTPUTS / "pa_accessibility_scores.gpkg")
cluster_gdf.shape, cluster_gdf.head()

## 5.2 Feature Engineering

In [None]:
X, feature_names = prepare_features(cluster_gdf)
X.shape, feature_names

## 5.3 K-Means — Optimal k Selection

In [None]:
kmeans_result = kmeans_optimal_k(X)
cluster_gdf["cluster"] = kmeans_result["labels"].astype(int)

k_vals = range(2, 2 + len(kmeans_result["inertias"]))
elbow_fig = plot_elbow(kmeans_result["inertias"], kmeans_result["silhouette_scores"], k_vals)
elbow_fig

## 5.4 Hierarchical Clustering (Validation)

In [None]:
hier_result = hierarchical_clustering(X, n_clusters=kmeans_result["optimal_k"])
cluster_gdf["cluster_hier"] = hier_result["labels"].astype(int)
cluster_gdf[["cluster", "cluster_hier"]].head()

## 5.5 LISA Spatial Clustering

In [None]:
cluster_gdf = spatial_clustering(cluster_gdf, variable="accessibility_score")
cluster_gdf[["spatial_cluster"]].value_counts().head()

## 5.6 Cluster Characterisation

In [None]:
cluster_profiles = characterize_clusters(cluster_gdf, label_col="cluster")
cluster_profiles

## 5.7 Cluster Maps

In [None]:
cluster_folium_map = cluster_map(cluster_gdf, cluster_col="cluster")

cluster_out = DATA_OUTPUTS / "pa_accessibility_clusters.gpkg"
profile_out = DATA_OUTPUTS / "pa_cluster_profiles.csv"
map_out = DATA_OUTPUTS / "pa_cluster_map.html"

cluster_gdf.to_file(cluster_out, driver="GPKG")
cluster_profiles.to_csv(profile_out, index=False)
cluster_folium_map.save(str(map_out))

cluster_folium_map