# Agent 2 — Module Walkthrough (Code + Review)
## Metrics (`metrics.py`)

**Author:** Summer Xiong  
**Goal:** Explain the evaluation metric used in Agent 2, why it is chosen, and how it should be interpreted in the context of DAO vote prediction.

This module defines a single function:
- `macro_prf`: computes **macro-averaged precision, recall, and F1-score**

> **Key idea:** In multi-class vote prediction (For / Against / Abstain), class imbalance is common.  
> Macro-averaged metrics treat each class equally, making them suitable for fairness-oriented evaluation.


## 0) Imports

- `numpy`: numerical safety / type casting
- `sklearn.metrics.precision_recall_fscore_support`: standard multi-class PRF computation


In [None]:
from typing import Dict
import numpy as np
from sklearn.metrics import precision_recall_fscore_support


## 1) `macro_prf(y_true, y_pred)`

```python
def macro_prf(y_true, y_pred) -> Dict[str,float]:
    p,r,f,_ = precision_recall_fscore_support(
        y_true, y_pred, average="macro", zero_division=0
    )
    return {"precision": float(p), "recall": float(r), "f1": float(f)}
```

### What this function does
Computes **macro-averaged**:
- Precision
- Recall
- F1-score

across all classes.

### Input / Output
- **Input**
  - `y_true`: ground-truth labels, shape `(N,)`
  - `y_pred`: predicted labels, shape `(N,)`
- **Output**
  - dictionary with keys: `precision`, `recall`, `f1`


In [None]:
def macro_prf(y_true, y_pred) -> Dict[str, float]:
    p, r, f, _ = precision_recall_fscore_support(
        y_true, y_pred, average="macro", zero_division=0
    )
    return {"precision": float(p), "recall": float(r), "f1": float(f)}


## 2) Why Macro-Averaged Metrics?

### Class imbalance in DAO voting
In many DAOs:
- **FOR** votes dominate
- **AGAINST** and **ABSTAIN** are relatively rare

If you used **micro-averaged** or accuracy-based metrics:
- a model predicting mostly FOR could look artificially strong
- minority-class performance would be hidden

### Macro averaging
Macro-averaged PRF:
- computes precision/recall/F1 **per class**
- then averages them equally across classes

This aligns well with:
- fairness considerations
- governance analysis
- your dissertation theme (efficiency *and* fairness)


## 3) `zero_division=0`: What it Means

### The problem
If a class is never predicted:
- precision for that class is undefined (division by zero)

### Your choice
```python
zero_division=0
```

This means:
- undefined precision/recall is set to 0 instead of raising an error

### Why this is reasonable
- Penalises models that completely ignore a class
- Keeps training/evaluation loops robust
- Avoids silent metric inflation


## 4) How This Metric Fits into Agent 2

### Typical usage pattern
During validation or testing:
1. Collect logits from the model
2. Convert logits → predicted class ids via `argmax`
3. Call `macro_prf(y_true, y_pred)`
4. Log or store the resulting dictionary

This metric should be reported:
- per epoch (validation)
- for final test evaluation
- optionally **per cluster** (using `cluster_id`) for fairness analysis


## 5) Minimal Sanity Check Example

In [None]:
# Simple sanity check
y_true = np.array([0, 0, 1, 1, 2, 2])
y_pred = np.array([0, 0, 0, 1, 2, 2])

macro_prf(y_true, y_pred)


## 6) Review Notes & Possible Extensions

### ✅ Strengths
- Minimal, clear, and robust
- Uses a well-understood sklearn implementation
- Macro averaging aligns with imbalance-aware evaluation

### ⚠️ Potential Extensions
1) **Per-class metrics**
   - return precision/recall/F1 per class for deeper diagnostics
2) **Confusion matrix**
   - useful for qualitative error analysis
3) **Calibration metrics**
   - Expected Calibration Error (ECE)
   - complements your temperature scaling in the model
4) **Cluster-conditional metrics**
   - compute macro PRF per `cluster_id` to study behavioural fairness

These extensions can live in the same `metrics.py` or a separate evaluation module.


## 7) Summary

This module defines the core evaluation metric for Agent 2.  
By using **macro-averaged precision, recall, and F1**, it ensures that:
- minority voting behaviours are not ignored
- reported performance reflects balanced predictive ability
- evaluation aligns with governance fairness objectives

It is a simple but methodologically sound choice.
