# Agent 2 — Module Walkthrough (Code + Review)
## Sliding Window Builder (`sliding_window.py`)

**Author:** Summer Xiong  
**Goal:** Explain how Agent 2 constructs **per-voter sliding windows** of length `W` (window_size) in chronological order.

This module defines:
- `Window` dataclass: a *raw* window object (pre-tokenisation)  
- `_normalise_time_features`: converts timestamps into normalised cyclic-ish numeric features  
- `build_windows`: groups by voter → sorts by time → builds windows → creates step-aligned texts and numeric features

> **Key idea:** Each training example is a sequence of `W` steps:
> - steps `0..W-2` are **history** with a label prefix `[LABEL_i]`
> - step `W-1` is the **current proposal** with prefix `[PREDICT]`
> and the target is the vote label at the current step.


## 0) Imports and `Window` Dataclass

### `Window`
A lightweight container (before tokenisation) holding:
- `window_texts`: list of `W` strings (history + current)  
- `window_features`: list of `W` numeric vectors (per-step, float32)  
- `target_label`: the label for the current step (0/1/2)  
- `voter_id`, `cluster_id`: metadata  
- `window_size`: stored for traceability


In [None]:
from dataclasses import dataclass
from typing import List
import pandas as pd
import numpy as np

@dataclass
class Window:
    window_texts: List[str]            # list of W texts: [LABEL_*] for history, [PREDICT] for current
    window_features: List[np.ndarray]  # list of W numeric feature vectors (np.float32)
    target_label: int                  # int in {0:For, 1:Against, 2:Abstain}
    voter_id: str
    cluster_id: int
    window_size: int


## 1) Time Feature Normalisation: `_normalise_time_features(ts)`

```python
hour = ts.hour/23.0
wday = ts.weekday()/6.0
month = (ts.month-1)/11.0
day = (ts.day-1)/30.0
```

### What this does
Transforms a UTC timestamp into four continuous features in roughly `[0, 1]`:
- hour-of-day
- weekday
- month
- day-of-month

### Why it exists
Agent 2 can learn behavioural seasonality patterns:
- some voters are more active on certain weekdays
- certain governance activity clusters around month boundaries

### Review note
This is a *simple linear normalisation* (not cyclical encoding).  
If you want a more principled encoding, consider sin/cos features for periodic variables:
- hour → (sin(2πh/24), cos(2πh/24))
- weekday → (sin(2πd/7), cos(2πd/7))


In [None]:
def _normalise_time_features(ts: pd.Timestamp) -> np.ndarray:
    # ts is timezone-aware UTC
    hour = ts.hour / 23.0
    wday = ts.weekday() / 6.0
    month = (ts.month - 1) / 11.0
    day = (ts.day - 1) / 30.0
    return np.array([hour, wday, month, day], dtype=np.float32)


## 2) Core Function: `build_windows(...)`

### Purpose
Constructs training examples by:
1. filtering valid labels  
2. grouping by voter  
3. sorting votes by time  
4. sliding a window of length `W`  
5. building **step-aligned** inputs:
   - texts with special prefixes
   - numeric feature vectors per step
6. assigning the target label as the vote at the current step

### Inputs (key arguments)
- `df`: DataFrame of vote events  
- `window_size (W)`: number of steps in each window  
- `text_col`, `label_col`, `voter_col`, `time_col`: schema column names  
- `cluster_col`: cluster id column name  
- `numeric_cols`: list of numeric feature columns to include per step

### Output
- `List[Window]` where each Window corresponds to one training example

---

### Why windowing is important (methodology)
This turns a set of independent vote events into a **sequence modelling problem**:
- history steps include label information as textual cues (`[LABEL_i]`)
- current step is marked as `[PREDICT]`
This design encourages the model to learn *behavioural conditioning*:
> “Given this voter’s recent behaviour (and proposal texts), what will they vote now?”


In [None]:
def build_windows(df: pd.DataFrame,
                  window_size: int,
                  text_col: str,
                  label_col: str,
                  voter_col: str,
                  time_col: str,
                  cluster_col: str,
                  numeric_cols: List[str]) -> List[Window]:
    """
    Build windows by grouping per voter, sorting by time, then sliding windows of length W.
    Only rows with labels in {FOR, AGAINST, ABSTAIN} (mapped to {0,1,2}) are kept.
    """
    # Filter valid labels
    df = df.copy()
    df = df[df[label_col].isin([0, 1, 2])]

    out: List[Window] = []

    # group by voter
    for voter_id, g in df.groupby(voter_col):
        g = g.sort_values(time_col).reset_index(drop=True)
        if len(g) < window_size:
            continue

        # cluster id: take mode or last known in window
        cluster_vals = g[cluster_col].astype(int).tolist() if cluster_col in g.columns else [0] * len(g)

        # slide
        for t in range(window_size - 1, len(g)):
            # history indices [t-W+1 .. t-1], current t
            hist_idx = list(range(t - window_size + 1, t))
            cur_idx = t

            # texts: history with [LABEL_i] prefix, current with [PREDICT]
            texts = []
            for idx in hist_idx:
                lab = int(g.loc[idx, label_col])
                prefix = f"[LABEL_{lab}] "
                texts.append(prefix + str(g.loc[idx, text_col]))
            texts.append("[PREDICT] " + str(g.loc[cur_idx, text_col]))

            # numeric features per step (aligned to each step)
            feats = []
            for idx in hist_idx + [cur_idx]:
                row = g.loc[idx]
                vec = []
                # basic numeric cols (already numeric)
                for col in numeric_cols:
                    v = row.get(col, 0.0)
                    try:
                        v = float(v)
                    except Exception:
                        v = 0.0
                    vec.append(v)

                # boolean flags (Is Whale, Aligned With Majority) as 0/1 if present
                for bcol in ["Is Whale", "Aligned With Majority"]:
                    if bcol in g.columns:
                        val = row.get(bcol, False)
                        if isinstance(val, str):
                            val = val.strip().lower() in ("1", "true", "yes")
                        vec.append(1.0 if bool(val) else 0.0)

                # time features (normalised)
                ts = pd.to_datetime(row[time_col], utc=True, errors="coerce")
                if pd.isna(ts):
                    feats_time = np.zeros(4, dtype=np.float32)
                else:
                    feats_time = _normalise_time_features(ts)

                feats.extend([vec + feats_time.tolist()])

            # target label at step t
            y = int(g.loc[cur_idx, label_col])

            # cluster id for the window: majority in the window (fallback to last)
            win_clusters = [cluster_vals[idx] for idx in hist_idx + [cur_idx]]
            if len(win_clusters) > 0:
                vals, counts = np.unique(win_clusters, return_counts=True)
                cluster_id = int(vals[counts.argmax()])
            else:
                cluster_id = int(cluster_vals[cur_idx])

            # record
            feats = [np.asarray(f, dtype=np.float32) for f in feats]
            out.append(Window(
                window_texts=texts,
                window_features=feats,
                target_label=y,
                voter_id=str(voter_id),
                cluster_id=cluster_id,
                window_size=window_size
            ))

    return out


## 3) Detailed Interpretation of Key Design Choices

### 3.1 Label-prefixed history (`[LABEL_i]`)
History texts are prefixed with `[LABEL_0]`, `[LABEL_1]`, `[LABEL_2]`.

**What this accomplishes**
- Exposes the model to the voter’s historical decisions as part of the textual stream  
- Allows the model to learn patterns 
  - e.g., “voter tends to vote AGAINST on certain types of proposals”

**Trade-off / risk**
- This injects label information directly into the text channel, which can be powerful but can also:
  - cause the model to over-rely on the label token patterns rather than proposal semantics
  - reduce interpretability of what it learned

If you later want stricter separation, consider providing history labels as **numeric categorical inputs** instead.

---

### 3.2 Per-step numeric features
For each step, the numeric vector is:

`[ numeric_cols..., boolean_flags..., time_features(4) ]`

So the final per-step dimension is:
- `len(numeric_cols)` + `(#boolean cols present)` + `4`

**Good property**
- Features are step-aligned; no future aggregation is used inside the window builder.

---

### 3.3 Window-level `cluster_id` via mode
You set the window’s cluster id to the **most frequent** cluster among its steps.

This is reasonable if:
- cluster assignment is stable per voter
- occasional missing/noisy cluster values exist

**If clusters are constant per voter**, you could simplify:
- take the voter’s unique cluster id (e.g., first non-missing value).


## 4) Review Notes (Strengths, Risks, and Improvements)

### ✅ Strengths
- Constructs a clean sequence modelling dataset from event-level votes
- Enforces chronological ordering per voter
- Avoids row-wise mixing by building windows within a voter group
- Includes time features and numeric signals alongside text
- Outputs are compatible with the `WindowDataset` (Module 1)

### ⚠️ Risks / Potential Issues
1) **Column name mismatch for booleans**
   - Earlier `normalise_columns` creates canonical columns `is_whale` and `aligned_majority`
   - Here you read raw columns `Is Whale` and `Aligned With Majority`
   - If you run this on normalised data, these columns may not exist and the booleans will silently be missing.

   **Fix:** use canonical names consistently (e.g., `is_whale`, `aligned_majority`) or pass them as parameters.

2) **Text leakage through label prefixes**
   - This design is intentional, but you should justify it in documentation.
   - Consider ablations: with vs. without label prefixes.

3) **Imputation default**
   - Missing numeric values default to 0.0, which may be semantically incorrect.
   - Consider explicit missing flags or per-column imputers.

4) **Time encoding**
   - Linear normalisation may underperform cyclic encoding.
   - Consider sin/cos encoding for periodicity.

5) **Performance**
   - Groupby per voter + per-window loops are Python-level operations.
   - For large datasets, this can be slow; may need optimisation or vectorisation.

---

### Recommended short-term improvements (publication-ready)
- Standardise boolean column names across the pipeline
- Add asserts:
  - `len(texts) == window_size`
  - `len(feats) == window_size`
  - numeric feature vector length is constant across steps
- Add option to include/exclude label prefixes for controlled experiments


## 5) Minimal Sanity Check (Executable Example)

This section builds a tiny synthetic DataFrame and checks:
- number of windows produced
- each window has length `W` for both texts and features
- feature vector lengths are consistent


In [None]:
# Create a small synthetic dataset (2 voters, chronological events)
df_demo = pd.DataFrame({
    "voter": ["A"] * 4 + ["B"] * 3,
    "vote_ts": pd.to_datetime([
        "2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04",
        "2024-02-01", "2024-02-02", "2024-02-03"
    ], utc=True),
    "label_id": [0, 1, 0, 2, 1, 1, 0],
    "text": ["p1", "p2", "p3", "p4", "q1", "q2", "q3"],
    "vp": [10, 12, 9, 11, 5, 6, 7],
    "vp_share": [0.10, 0.12, 0.09, 0.11, 0.05, 0.06, 0.07],
    "cluster_id": [2, 2, 2, 2, 1, 1, 1],
    # Demonstrate raw boolean columns to match your window builder:
    "Is Whale": [False, False, False, False, True, True, True],
    "Aligned With Majority": [True, False, True, True, False, False, True],
})

W = 3
wins = build_windows(
    df=df_demo,
    window_size=W,
    text_col="text",
    label_col="label_id",
    voter_col="voter",
    time_col="vote_ts",
    cluster_col="cluster_id",
    numeric_cols=["vp", "vp_share"],
)

print("Number of windows:", len(wins))
print("Example window texts:", wins[0].window_texts)
print("Feature vector length per step:", len(wins[0].window_features[0]))
print("Window size check:", len(wins[0].window_texts), len(wins[0].window_features))


## 6) Summary

This module converts event-level voting records into windowed training examples suitable for sequence models.  
Its main contribution is the **chronologically ordered, per-voter sliding window** construction with:
- label-conditioned history via `[LABEL_i]` prefixes
- step-aligned numeric + boolean + time features
- a single current-step target label

These `Window` objects are then encoded into tensors by `WindowDataset` (Module 1).
