# Agent 2 — Module Walkthrough (Code + Review)
## Data Loading & Preprocessing (`data_loading.py`)

**Author:** Summer Xiong  
**Goal:** Explain the Agent 2 data ingestion pipeline: reading multiple CSVs, normalising schema, selecting features, and splitting train/validation.

This module defines:
- `load_and_merge`: load one or many CSV files and concatenate them  
- `normalise_columns`: standardise column names/types into a canonical schema  
- `select_numeric_columns`: choose safe per-step numeric inputs for the model  
- `split_by_voter`: create train/validation splits **by voter** (prevents leakage)

> **Key idea:** Agent 2 should learn to predict votes from *text + per-step numeric signals* without data leakage from the same voter appearing in both train and validation.


## 0) Imports, Constants, and Label Mapping

### VALID_LABELS
```python
VALID_LABELS = {"FOR":0, "AGAINST":1, "ABSTAIN":2}
```

This defines a canonical class mapping:
- **FOR → 0**
- **AGAINST → 1**
- **ABSTAIN → 2**

Why this matters:
- ensures consistent labels across files / platforms  
- makes the classification target stable for modelling and evaluation


In [None]:
from typing import Tuple, List, Dict
from pathlib import Path

import pandas as pd
import numpy as np

VALID_LABELS = {"FOR": 0, "AGAINST": 1, "ABSTAIN": 2}
VALID_LABELS


## 1) `load_and_merge(csv_paths)`

### Purpose
Loads multiple CSV sources (e.g., exports from different DAOs or time ranges) and concatenates them into a single DataFrame.

### Inputs / Outputs
- **Input:** `csv_paths: List[Path]`
- **Output:** single DataFrame containing all rows

### Review notes
✅ Strength: simple and robust concatenation  
⚠️ Improvement: optionally validate consistent columns across frames (otherwise missing columns become NaN silently)


In [None]:
def load_and_merge(csv_paths: List[Path]) -> pd.DataFrame:
    frames = []
    for p in csv_paths:
        df = pd.read_csv(p)
        frames.append(df)
    df = pd.concat(frames, ignore_index=True)
    return df


## 2) `normalise_columns(df)`

### Purpose
This is the **core schema normaliser**. It converts different raw CSV formats into a canonical, model-ready schema.

### Canonical output columns created
- `voter` (string)  
- `vote_ts` (datetime, UTC)  
- `label_id` (Int64: 0/1/2)  
- `text` (string)  
- `vp` (float)  
- `vp_share` (float 0–1)  
- `is_whale` (bool)  
- `aligned_majority` (bool)  
- `cluster_id` (int)  

This design allows you to ingest data from slightly different exports where column names differ.

---

### Step-by-step logic
#### (A) Canonical voter column
- Accepts either `voter` or `Voter`  
- Throws an error if neither exists

#### (B) Timestamp
- Uses `Vote Timestamp` if present else falls back to `Created Time`  
- Converts to UTC using `pd.to_datetime(..., utc=True)`

#### (C) Label mapping
- Expects `Vote Label`
- Normalises strings (upper + strip)
- Maps to numeric class id using `VALID_LABELS`

#### (D) Text column
- Uses `Proposal Title` if present else `Choice_Text`
- Stores into canonical `text`

#### (E) Numeric features
- Voting power (`vp`) from `Voting Power` if present  
- Voting power share (`vp_share`) from `VP Ratio (%)` if present (converted from percent to 0–1)

#### (F) Boolean flags
- Converts string-like values to bool using `to_bool`
- Handles missing columns by defaulting to False

#### (G) Cluster id
- Uses `cluster` if available; otherwise defaults to 0

---

### Review notes (very important)
✅ Strengths
- Handles messy real-world CSV schemas
- Produces a consistent DataFrame interface for downstream window building
- Explicit error if key columns missing (voter, label)

⚠️ Potential issues / improvements
1) **Timestamp ambiguity**: falling back to `Created Time` may represent proposal creation rather than vote time.
   - Consider logging which timestamp column was used per file.
2) **Label robustness**: `.map(VALID_LABELS)` will produce NA for unexpected labels (e.g. "YES/NO", typos).
   - Consider validating and dropping/flagging unknown labels before `.astype("Int64")`.
3) **Booleans default**: returning `False` when the column is absent may conflate “unknown” with “false”.
   - Consider setting to `np.nan` or having explicit missingness flags.
4) **Cluster fallback**: defaulting missing clusters to `0` can be misleading.
   - Consider using `-1` for “unknown cluster”.


In [None]:
def normalise_columns(df: pd.DataFrame) -> pd.DataFrame:
    # Standardise key columns and types
    df = df.copy()

    # canonical voter
    if "voter" in df.columns:
        df["voter"] = df["voter"].astype(str)
    elif "Voter" in df.columns:
        df["voter"] = df["Voter"].astype(str)
    else:
        raise ValueError("Missing voter column")

    # time
    tcol = "Vote Timestamp" if "Vote Timestamp" in df.columns else "Created Time"
    df["vote_ts"] = pd.to_datetime(df[tcol], utc=True, errors="coerce")

    # label
    if "Vote Label" in df.columns:
        lab = df["Vote Label"].astype(str).str.upper().str.strip()
        df["label_id"] = lab.map(VALID_LABELS).astype("Int64")
    else:
        raise ValueError("Missing 'Vote Label' column")

    # text
    text_col = "Proposal Title" if "Proposal Title" in df.columns else "Choice_Text"
    df["text"] = df[text_col].astype(str)

    # numeric basics
    if "Voting Power" in df.columns:
        df["vp"] = pd.to_numeric(df["Voting Power"], errors="coerce")
    else:
        df["vp"] = np.nan

    if "VP Ratio (%)" in df.columns:
        df["vp_share"] = pd.to_numeric(df["VP Ratio (%)"], errors="coerce") / 100.0
    else:
        df["vp_share"] = np.nan

    # booleans
    def to_bool(x):
        if isinstance(x, str):
            return x.strip().lower() in ("1", "true", "yes", "y", "t")
        return bool(x)

    df["is_whale"] = df.get("Is Whale", False).apply(to_bool) if "Is Whale" in df.columns else False
    df["aligned_majority"] = df.get("Aligned With Majority", False).apply(to_bool) if "Aligned With Majority" in df.columns else False

    # cluster id provided
    if "cluster" in df.columns:
        df["cluster_id"] = pd.to_numeric(df["cluster"], errors="coerce").fillna(0).astype(int)
    else:
        df["cluster_id"] = 0

    return df


## 3) `select_numeric_columns(df)`

### Purpose
Selects a list of **safe per-step numeric features** to include in each time step/window.

Current behaviour:
- Includes `vp` and `vp_share` if present

### Review notes
✅ Good: avoids global/future aggregated features → reduces leakage risk  
⚠️ Consider adding booleans as numeric features (0/1) for consistency: `is_whale`, `aligned_majority`


In [None]:
def select_numeric_columns(df: pd.DataFrame) -> List[str]:
    cols = []
    for name in ["vp", "vp_share"]:
        if name in df.columns:
            cols.append(name)
    return cols


## 4) `split_by_voter(df, train_frac=0.8, seed=42)`

### Purpose
Splits into train/validation by **voter**, not by rows.

### Why this is crucial (anti-leakage)
Row-wise split can leak voter identity/behaviour patterns into validation (over-optimistic metrics).  
Voter-wise split is more defensible if your goal is generalisation to unseen voters.

### Outputs
- `train_df`, `valid_df`


In [None]:
def split_by_voter(df: pd.DataFrame, train_frac: float = 0.8, seed: int = 42):
    voters = df["voter"].dropna().unique()
    rng = np.random.default_rng(seed)
    rng.shuffle(voters)
    n_train = int(len(voters) * train_frac)
    train_v = set(voters[:n_train])
    train_df = df[df["voter"].isin(train_v)].copy()
    valid_df = df[~df["voter"].isin(train_v)].copy()
    return train_df, valid_df


## 5) Minimal Sanity Check (Template)

Once you have real CSV(s), this is the intended workflow:

1. `load_and_merge` → `df_raw`  
2. `normalise_columns` → `df_norm`  
3. `select_numeric_columns` → `num_cols`  
4. `split_by_voter` → train/valid  

Also verify there is **no voter overlap** across splits.


In [None]:
# --- TEMPLATE ONLY (edit paths to your data) ---
# paths = [Path("data") / "votes_part1.csv", Path("data") / "votes_part2.csv"]
# df_raw = load_and_merge(paths)
# df_norm = normalise_columns(df_raw)
# num_cols = select_numeric_columns(df_norm)
# train_df, valid_df = split_by_voter(df_norm, train_frac=0.8, seed=42)
#
# print("Raw:", df_raw.shape)
# print("Norm:", df_norm.shape)
# print("Numeric columns:", num_cols)
# print("Train:", train_df.shape, "Valid:", valid_df.shape)
#
# overlap = set(train_df["voter"]).intersection(set(valid_df["voter"]))
# print("Voter overlap:", len(overlap))


## 6) Summary

This module standardises heterogeneous raw vote exports into a canonical schema and creates leakage-resistant splits.

**Most important design choice:** voter-wise splitting is a strong methodological safeguard for Agent 2 evaluation.
