# Agent 1 — Voter Clustering Pipeline (Code Review Notebook)

**Author:** Summer Xiong  
**Purpose:** Side-by-side *code + review* for the voter clustering (Agent 1) module in the Synthetic Voter Model.

This notebook is structured as an executable, well-documented walkthrough:
- Each step includes: **what it does**, **inputs/outputs**, and **methodological rationale**
- Outputs include: elbow curve, (optional) radar charts, PCA loadings, PCA scatter, and exported CSVs

> **Tip:** Run cells from top to bottom. If a package is missing (e.g., Plotly), the notebook will gracefully skip optional visuals.


## 0) Environment & Dependencies

This notebook uses:
- `pandas`, `numpy` — data loading/processing  
- `scikit-learn` — scaling, KMeans clustering, silhouette score, PCA  
- `matplotlib` — static plots  
- `plotly` (optional) — radar charts for cluster profiles  
- `dataframe_image` (optional) — export styled tables as PNG


In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# Optional: for radar chart
try:
    import plotly.express as px
    plotly_available = True
except ImportError:
    plotly_available = False

# Optional: for exporting DataFrame tables as images
try:
    import dataframe_image as dfi
    dfi_available = True
except ImportError:
    dfi_available = False

plotly_available, dfi_available


## 1) Load Data

**What this step does**  
Loads the voter-level feature dataset where each row corresponds to one voter/wallet and columns are engineered behavioural features.

**Input / Output**  
- **Input:** `Framework/voter_base.csv`  
- **Output:** `df` (pandas DataFrame)

**Methodological rationale**  
Clustering is performed at the voter level to identify latent behavioural archetypes (e.g., whales vs. small voters; active vs. passive; aligned vs. contrarian).


In [None]:
from pathlib import Path

# Path to your voter feature dataset
csv_path = Path("Framework") / "voter_base.csv"

# Load
df = pd.read_csv(csv_path)

# Quick sanity checks
df.head(), df.shape


## 2) Select Features for Clustering

**What this step does**  
Selects numeric behavioural features to define the clustering space.

**Input / Output**  
- **Input:** `df`  
- **Output:** `X` (feature matrix)

**Methodological rationale**  
We use a behavioural feature set capturing:
- participation intensity (`total_votes`)
- voting power (`avg_voting_power`, `is_whale_ratio`)
- preference distribution (`%_for_votes`, `%_against_votes`, `%_abstain_votes`)
- conformity (`%_aligned_with_majority`)

**Note on missing values**  
This implementation uses `fillna(0)` for simplicity. If missingness has semantic meaning (e.g., “no history” vs. “zero behaviour”), consider alternative imputation or missingness flags.


In [None]:
features = [
    "total_votes",
    "avg_voting_power",
    "%_for_votes",
    "%_against_votes",
    "%_abstain_votes",
    "%_aligned_with_majority",
    "is_whale_ratio",
]

# Feature matrix
X = df[features].fillna(0)

X.describe().T


## 3) Standardise Features

**What this step does**  
Applies z-score standardisation to each feature (mean=0, std=1).

**Input / Output**  
- **Input:** `X`  
- **Output:** `X_scaled` (numpy array)

**Methodological rationale**  
KMeans uses Euclidean distance. Without scaling, large-range features (e.g., total_votes) would dominate cluster assignment and distort the segmentation.


In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled[:3]


## 4) Choose the Number of Clusters (K)

We use two common internal diagnostics:

### 4.1 Inertia (Elbow Method)
- Inertia = within-cluster sum of squared distances
- Decreases as K increases
- Choose K at the “elbow” where marginal improvement drops

### 4.2 Silhouette Score
- Measures cluster cohesion & separation
- Range roughly [-1, 1]; higher is better
- Not defined for K=1, so computed for K≥2


In [None]:
inertia = []
silhouette_scores = []

K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)
    if k > 1:
        score = silhouette_score(X_scaled, kmeans.labels_)
        silhouette_scores.append((k, score))
        print(f"Silhouette score for k={k}: {score:.3f}")

silhouette_scores[:3], len(silhouette_scores)


### 4.3 Visualise Elbow Curve

In [None]:
plt.figure(figsize=(8, 4))
plt.plot(list(K_range), inertia, marker="o")
plt.title("Elbow Curve (Inertia vs. K)")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Inertia")
plt.tight_layout()
plt.show()


### 4.4 Visualise Silhouette Scores

In [None]:
if silhouette_scores:
    ks = [k for k, s in silhouette_scores]
    ss = [s for k, s in silhouette_scores]

    plt.figure(figsize=(8, 4))
    plt.plot(ks, ss, marker="o")
    plt.title("Silhouette Score vs. K")
    plt.xlabel("Number of clusters (k)")
    plt.ylabel("Silhouette score")
    plt.tight_layout()
    plt.show()
else:
    print("No silhouette scores computed (need k>=2).")


## 5) Fit Final KMeans Model (Chosen K)

**What this step does**  
Fits KMeans using the chosen number of clusters and assigns a cluster label to each voter.

**Input / Output**  
- **Input:** `X_scaled`  
- **Output:** `df['cluster']` (integer label 0..K-1)

**Methodological rationale**  
`optimal_k` should be justified with:
- elbow curve behaviour
- silhouette score trend
- interpretability & downstream needs (representative agents)


In [None]:
optimal_k = 3  # <-- adjust based on your elbow + silhouette diagnostics

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X_scaled)

df["cluster"].value_counts().sort_index()


## 6) Cluster Interpretation (Behavioural Profiles)

**What this step does**  
Computes mean feature values per cluster to interpret what each voter archetype represents.

**Outputs**
- Printed cluster summary
- Exported CSV: `Framework/cluster_feature_means.csv`
- Optional PNG table if `dataframe_image` is available

**How to use**
Use this table to name clusters (e.g., “Whales”, “Passive aligned”, “Active contrarian”), and to define representative agents for downstream modules.


In [None]:
cluster_summary = df.groupby("cluster")[features].mean()
cluster_summary


In [None]:
# Save as CSV
out_csv = Path("Framework") / "cluster_feature_means.csv"
out_csv.parent.mkdir(parents=True, exist_ok=True)
cluster_summary.to_csv(out_csv)

out_csv


### 6.1 Optional: Export Styled PNG Table (if `dataframe_image` installed)

In [None]:
if dfi_available:
    styled = cluster_summary.style.format(precision=3).set_caption("Cluster Feature Means")
    out_png = Path("Framework") / "cluster_feature_means_table.png"
    dfi.export(styled, out_png)
    print(f"Saved PNG table to: {out_png}")
else:
    print("dataframe_image not installed: skipping PNG table export.")


### 6.2 Optional: Radar Charts per Cluster (if Plotly installed)

**Note:** Radar charts can be sensitive to feature scales; here we plot raw feature means.  
For more comparable shapes, consider min-max normalising cluster_summary before plotting.


In [None]:
if plotly_available:
    for c in cluster_summary.index:
        fig = px.line_polar(
            r=cluster_summary.loc[c].values,
            theta=features,
            line_close=True,
            title=f"Cluster {c} Profile",
        )
        fig.show()
else:
    print("Plotly not installed: skipping radar charts.")


## 7) PCA for Visual Sanity Check & Interpretability

**What this step does**
- Projects the scaled feature space down to 2D for plotting (PCA1, PCA2)
- Extracts PCA loadings to interpret which features drive each component

**Outputs**
- `df['pca1']`, `df['pca2']`
- Bar plots of PCA component loadings
- Scatter plot of voters coloured by cluster

**Methodological rationale**
PCA here is used as an *interpretability and visual sanity check*, not as a clustering method.


In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df["pca1"] = X_pca[:, 0]
df["pca2"] = X_pca[:, 1]

pca_loadings = pca.components_
pca_loadings


### 7.1 PCA Loadings (Component 1)

In [None]:
plt.figure(figsize=(8, 5))
plt.bar(range(len(pca_loadings[0])), pca_loadings[0])
plt.title("PCA Component 1 Loadings")
plt.xlabel("Feature")
plt.ylabel("Loading")
plt.xticks(range(len(features)), features, rotation=30, ha="right")
plt.tight_layout()
plt.show()


### 7.2 PCA Loadings (Component 2)

In [None]:
plt.figure(figsize=(8, 5))
plt.bar(range(len(pca_loadings[1])), pca_loadings[1])
plt.title("PCA Component 2 Loadings")
plt.xlabel("Feature")
plt.ylabel("Loading")
plt.xticks(range(len(features)), features, rotation=30, ha="right")
plt.tight_layout()
plt.show()


### 7.3 PCA Scatter Plot (Clusters in 2D)

In [None]:
plt.figure(figsize=(8, 6))
for c in range(optimal_k):
    subset = df[df["cluster"] == c]
    plt.scatter(subset["pca1"], subset["pca2"], label=f"Cluster {c}", alpha=0.6)
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.title("PCA Scatter Plot of Clusters")
plt.legend()
plt.tight_layout()
plt.show()


## 8) Export Results

**What this step does**
Exports the enriched voter dataset including:
- original features
- cluster label
- PCA coordinates

**Output**
- `Framework/voter_base_with_clusters.csv`

This file is typically used as an input to:
- Agent 2 (vote prediction)
- Agent 3 (mechanism simulation)


In [None]:
out_with_clusters = Path("Framework") / "voter_base_with_clusters.csv"
df.to_csv(out_with_clusters, index=False)

out_with_clusters


## Appendix: Notes & Suggested Improvements (Optional)

If you want to make this pipeline more publication-ready, common improvements include:
- Parameterisation via config file (paths, K, random seeds)
- Saving fitted objects (`scaler`, `kmeans`) for consistent reuse
- Robustness checks (multiple seeds; stability metrics; feature sensitivity)
- Proper missingness handling (semantics-aware imputation or flags)
- Logging + figure saving to a `figures/` directory
