# k-Means

**Scenario (Property Ops)**: Cluster incoming documents to auto-route (leases, invoices, inspection reports, maintenance tickets). Start with k-means++ and choose k using inertia and silhouette. Then profile clusters against business features (pages, OCR, tables).

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, warnings
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

!wget -q https://raw.githubusercontent.com/Jihun-ust/ust-mail-557/main/Unsupervised/unsup_utils.py.py
import unsup_utils as utils
csv_path = "https://raw.githubusercontent.com/Jihun-ust/ust-mail-557/main/Unsupervised/unsup.csv"
warnings.filterwarnings("ignore")

df = pd.read_csv(csv_path)
X, cols, sc = utils.feature_matrix(df, use_emb=True)

# Try different k and collect scores
inertias, sils, ks = [], [], list(range(2,9))
for k in ks:
    km = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=42)
    labels = km.fit_predict(X)
    inertias.append(km.inertia_)
    sils.append(silhouette_score(X, labels))

plt.figure(figsize=(8,3.8)); plt.plot(ks, inertias, marker="o"); plt.title("Elbow (Inertia)"); plt.xlabel("k"); plt.ylabel("inertia"); plt.tight_layout(); plt.show()
plt.figure(figsize=(8,3.8)); plt.plot(ks, sils, marker="o"); plt.title("Silhouette"); plt.xlabel("k"); plt.ylabel("score"); plt.tight_layout(); plt.show()

best_k = ks[int(np.argmax(sils))]
print("Chosen k by silhouette:", best_k)

# Fit final
km = KMeans(n_clusters=best_k, init="k-means++", n_init=20, random_state=42)
df["cluster_km"] = km.fit_predict(X)

# PCA for visualization
X2, p = utils.pca_2d(X)
utils.plot_xy(X2, title="PCA (colored by k-means clusters)", labels=df["cluster_km"].values)

# Cluster profiling
prof = df.groupby("cluster_km")[["page_count","ocr_conf","image_pct","table_density","token_count_k","amount_usd","signatures","layout_complexity"]].mean().round(2)
prof.head()

### Compare to known doc types (for quick review)
Cross-tab clusters with the known `doc_type` to see if clusters align with business categories.

In [None]:
pd.crosstab(df["cluster_km"], df["doc_type"])