# Agent 2 — Module Walkthrough (Code + Review)
## Evaluation & Reporting Script (`eval_agent2.py`)

**Author:** Summer Xiong  
**Purpose:**  
Evaluate **Agent 2** on the validation split and export **paper‑ready artifacts**, including:
- overall and per‑cluster metrics
- confusion matrices (raw + normalised)
- calibration (ECE + reliability curves)
- LaTeX tables for direct inclusion in the dissertation / paper

This script is *post‑training* and assumes a trained Agent 2 model stored in `agent2_artifacts/`.

> **Key idea:**  
> Training performance alone is insufficient. This module focuses on **interpretability, calibration, and fairness across clusters**, which directly supports your dissertation claims.


## 1) Command‑Line Interface (CLI)

```bash
python eval_agent2.py \
  --data_dir path/to/data \
  --artifacts_dir path/to/agent2_artifacts
```

### Arguments
- `--code_dir`: where Agent2 modules live (dataset.py, model.py, etc.)
- `--data_dir`: directory containing `cluster_0/1/2_dataset.csv`
- `--artifacts_dir`: trained model + config (auto‑discovered if empty)
- `--output_dir`: where evaluation outputs are saved
- `--seed`: **must match training seed** (critical)
- `--batch_size`: evaluation batch size

⚠️ **Methodological note**  
The same `seed` and `split_by_voter` logic as training are reused → avoids leakage and guarantees consistency.


## 2) Artifact Auto‑Discovery

### Problem addressed
In practice (especially Colab), model artifacts may not live in a fixed path.

### Solution
`autodiscover_artifacts()`:
- checks common relative locations
- falls back to a *light recursive search* under `/content` and Drive
- verifies both `config.json` and `agent2_model.pt` exist

This makes the evaluation script robust and user‑friendly.


## 3) Data Reloading & Canonical Normalisation

### Steps
1. Reload raw CSVs
2. Apply `normalise_columns`
3. Filter valid labels `{0,1,2}`

### Why reload instead of reusing training tensors?
- guarantees **stateless reproducibility**
- allows independent verification of results
- aligns with best practices for empirical research


## 4) Model & Tokenizer Reconstruction (Critical Section)

### Key steps
1. Load `config.json`
2. Assert **label mapping consistency**
3. Load tokenizer (with special tokens)
4. Instantiate model with correct `feat_dim`
5. **Resize token embeddings BEFORE loading weights**
6. Load model state dict

```python
model.text_encoder.resize_token_embeddings(len(tokenizer))
model.load_state_dict(state)
```

⚠️ **This ordering is critical**  
Failing to resize embeddings first will cause a shape mismatch error.

This section demonstrates strong engineering discipline.


## 5) Rebuilding Validation Windows

### Why rebuild windows?
Agent 2 windows are:
- voter‑dependent
- time‑ordered
- seed‑dependent

To ensure comparability with training:
- `split_by_voter` is re‑run with the same seed
- windows are rebuilt using identical logic

This ensures **evaluation integrity**.


## 6) Inference Loop

### Outputs collected
- `y_true`: true labels
- `y_pred`: predicted labels
- `probs_all`: softmax probabilities
- `clusters`: cluster id per window

These are the foundation for:
- performance metrics
- confusion matrices
- calibration analysis
- cluster‑level fairness evaluation


## 7) Overall Performance Reports

### Generated artifacts
- `classification_report.csv`
- `confusion_matrix.csv`
- `confusion_matrix_normalised.csv`
- `confusion_matrix_blue.png`

### Design choice
- **Blue color theme** → consistent, paper‑friendly
- Normalised confusion matrix → easier class‑imbalance interpretation

These outputs are immediately usable in the Results section.


## 8) Calibration: ECE & Reliability Curves

### Why calibration matters
A model can be accurate but **over‑confident or under‑confident**.
In governance settings, confidence miscalibration can distort decision‑making.

### Implemented
- Expected Calibration Error (ECE)
- Reliability curves per class:
  - For
  - Against
  - Abstain
- Probability histograms

### Outputs
- `prob_calibration_ece.csv`
- `reliability_*_blue.png`
- `prob_hist_*_blue.png`

This directly strengthens the *trustworthiness* argument of your agent.


## 9) Per‑Cluster Evaluation (Fairness Analysis)

### Motivation
Clusters represent **behavioural voter types** (from Agent 1).

Evaluating only overall metrics can hide:
- systematic under‑performance on minority clusters
- governance fairness issues

### What is computed per cluster
- Macro‑Precision
- Macro‑Recall
- Macro‑F1
- Accuracy
- Confusion matrices (raw + normalised)

### Outputs
- `confusion_matrix_cluster_{k}.csv`
- `confusion_matrix_cluster_{k}_normalised.csv`
- `confusion_matrix_cluster_{k}_blue.png`

This is a **core contribution** of your dissertation.


## 10) Macro‑Level Summary Table (Paper‑Ready)

### Generated files
- `macro_by_cluster.csv`
- `latex_macro_by_cluster.tex`

The LaTeX table uses:
- `booktabs`
- fixed precision
- consistent ordering

This table can be dropped **directly into Overleaf** with no edits.

It summarises:
- Overall performance
- Per‑cluster performance
→ making cross‑group comparison explicit.


## 11) Review Notes (Strengths & Improvements)

### ✅ Strengths
- Full separation of training vs evaluation
- Strong reproducibility guarantees
- Rich evaluation beyond accuracy
- Cluster‑aware fairness analysis
- Publication‑ready outputs (CSV + PNG + LaTeX)

### ⚠️ Possible Extensions
1. Add **bootstrap confidence intervals** for macro‑F1
2. Evaluate **temperature‑scaled logits** explicitly
3. Add **per‑cluster ECE**
4. Log runtime & GPU memory usage
5. Automate figure numbering to match dissertation sections

These are optional but would further strengthen a journal submission.


## 12) Summary

This evaluation module transforms Agent 2 from:
> *“a trained model”*  
into  
> **“a scientifically evaluated, interpretable, and fair governance agent.”**

It provides:
- rigorous validation
- fairness diagnostics
- calibration analysis
- paper‑ready artifacts

This is exactly the level expected for a **master’s dissertation and beyond**.
