# Agent 2 — Module Walkthrough (Code + Review)
## Dataset & Collate (`dataset.py`)

**Author:** Summer Xiong  
**Goal:** Explain *exactly* what this module does, step-by-step, with shapes, inputs/outputs, and review notes.

This module defines:
- `EncodedWindow` (dataclass): a single **encoded training example**  
- `WindowDataset` (PyTorch `Dataset`): converts raw window objects into tensors  
- `collate_fn`: batches samples into model-ready tensors

> **Key idea:** Each sample is a **window of W steps**. Each step has:
> - a **text** (tokenised into length `L`)
> - a **numeric feature vector** (size `F`)
> And the sample has a **single label** (classification target).


## 0) Environment & Dependencies

This notebook uses:
- `pandas`, `numpy` — data loading/processing  
- `scikit-learn` — scaling, KMeans clustering, silhouette score, PCA  
- `matplotlib` — static plots  
- `plotly` (optional) — radar charts for cluster profiles  
- `dataframe_image` (optional) — export styled tables as PNG

> Note: This particular module (`dataset.py`) primarily depends on **PyTorch** + **NumPy**.  
> The extra libraries above are mentioned because they are used elsewhere in Agent 2/Agent 1 notebooks.


In [None]:
from typing import List, Dict, Any
from dataclasses import dataclass

import torch
from torch.utils.data import Dataset
import numpy as np

print("torch:", torch.__version__)


## 1) `EncodedWindow` Dataclass

```python
@dataclass
class EncodedWindow:
    input_ids: torch.Tensor        # (W, L)
    attention_mask: torch.Tensor   # (W, L)
    num_feats: torch.Tensor        # (W, F)
    label: int                     # scalar
    voter_id: str
    cluster_id: int
```

### What it represents
One training example after preprocessing/encoding.

### Shapes
- `W` = number of steps in the window (sequence length at window level)
- `L` = token length per step (e.g., 128)
- `F` = number of numeric features per step

### Why keep `voter_id` and `cluster_id`?
- `voter_id`: traceability / debugging / per-voter analysis  
- `cluster_id`: supports **cluster-conditional evaluation** or **cluster-specific heads** later


In [None]:
@dataclass
class EncodedWindow:
    input_ids: torch.Tensor        # (W, L)
    attention_mask: torch.Tensor   # (W, L)
    num_feats: torch.Tensor        # (W, F)
    label: int                     # scalar
    voter_id: str
    cluster_id: int


## 2) `WindowDataset`: turning raw windows into tensors

### Purpose
`WindowDataset` takes a list of raw window objects and a tokenizer and outputs `EncodedWindow`.

### Expected interface of each raw window object `w`
Your code assumes each `w` has the following attributes:

- `w.window_texts`: `List[str]` of length `W`  
- `w.window_features`: list/array shaped `(W, F)`  
- `w.target_label`: int-like (class id)  
- `w.voter_id`: string id  
- `w.cluster_id`: int-like

If any of these are missing or have inconsistent lengths, training will break at runtime.


In [None]:
class WindowDataset(Dataset):
    def __init__(self, windows: List[Any], tokenizer, max_length: int = 128):
        self.windows = windows
        self.tok = tokenizer
        self.max_length = max_length

    def __len__(self) -> int:
        return len(self.windows)

    def __getitem__(self, idx: int) -> EncodedWindow:
        w = self.windows[idx]
        # Encode each step text separately
        ids, masks = [], []
        for t in w.window_texts:
            enc = self.tok(
                t,
                truncation=True,
                max_length=self.max_length,
                padding="max_length",
                return_tensors="pt"
            )
            ids.append(enc["input_ids"][0])          # (L,)
            masks.append(enc["attention_mask"][0])   # (L,)
        input_ids = torch.stack(ids, dim=0)          # (W, L)
        attention_mask = torch.stack(masks, dim=0)   # (W, L)
        # Numeric features
        num_feats = torch.tensor(np.stack(w.window_features, axis=0), dtype=torch.float32)  # (W, F)
        return EncodedWindow(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_feats=num_feats,
            label=int(w.target_label),
            voter_id=w.voter_id,
            cluster_id=int(w.cluster_id)
        )


### 2.1 Step-by-step inside `__getitem__`

#### (A) Tokenise each step text independently
For each step text `t` in `w.window_texts`, the tokenizer returns:
- `input_ids`: shape `(1, L)`
- `attention_mask`: shape `(1, L)`

Then `[0]` removes the batch dimension and keeps `(L,)`.

#### (B) Stack over window steps
`torch.stack(ids, dim=0)` gives:
- `input_ids`: `(W, L)`
- `attention_mask`: `(W, L)`

#### (C) Numeric features
`np.stack(w.window_features, axis=0)` forms `(W, F)` then converted to `torch.float32`.

#### (D) Return EncodedWindow
A single encoded sample with tensors + metadata.

---

### Review notes (important)
✅ Strengths
- Clear and consistent tensor shapes
- Keeps metadata (`voter_id`, `cluster_id`) for later analysis
- Clean separation of text and numeric features

⚠️ Potential issues / improvements
1) **Padding strategy**: `padding="max_length"` pads every step to `L` (simple but wasteful).
   - Consider **dynamic padding** at batch time for better efficiency.
2) **Tokenisation cost**: tokenising in `__getitem__` repeats work each epoch.
   - Consider caching or pre-tokenising if the dataset is large.
3) **Type/shape checks**: if `len(window_texts) != len(window_features)`, tensors misalign.
   - Add assertions to catch errors early.


## 3) `collate_fn`: batching samples for the model

### Purpose
PyTorch `DataLoader` uses `collate_fn` to combine a list of samples into a batch.

### Input
A list of `EncodedWindow` objects of length `B` (batch size).

### Output dictionary (model-ready)
- `input_ids`: `(B, W, L)`
- `attention_mask`: `(B, W, L)`
- `num_feats`: `(B, W, F)`
- `labels`: `(B,)`
- `clusters`: `(B,)`


In [None]:
def collate_fn(batch: List[EncodedWindow]) -> Dict[str, torch.Tensor]:
    input_ids = torch.stack([b.input_ids for b in batch], dim=0)            # (B, W, L)
    attention_mask = torch.stack([b.attention_mask for b in batch], dim=0)  # (B, W, L)
    num_feats = torch.stack([b.num_feats for b in batch], dim=0)            # (B, W, F)
    labels = torch.tensor([b.label for b in batch], dtype=torch.long)       # (B,)
    clusters = torch.tensor([b.cluster_id for b in batch], dtype=torch.long)
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "num_feats": num_feats,
        "labels": labels,
        "clusters": clusters
    }


### Review notes (collate_fn)

✅ Strengths
- Straightforward stacking and clear shapes
- Returns a dict (fits most training loops)
- Keeps `clusters` for segmentation-aware metrics

⚠️ Improvements
- If you adopt **dynamic padding**, implement it here.
- If you want per-voter error analysis, you may also return `voter_id` here (currently dropped).


## 4) Minimal Sanity Check (No Transformers Required)

We create:
- a dummy window object with the required fields
- a dummy tokenizer that mimics the `transformers` interface

This lets you validate output shapes without installing HF transformers.


In [None]:
from types import SimpleNamespace

class DummyTokenizer:
    def __call__(self, text, truncation=True, max_length=8, padding="max_length", return_tensors="pt"):
        # Simple fake tokeniser: map each character to an int, then truncate/pad
        ids = [min(ord(c), 255) for c in text][:max_length]
        ids = ids + [0] * (max_length - len(ids))
        attn = [1 if i != 0 else 0 for i in ids]
        return {
            "input_ids": torch.tensor([ids], dtype=torch.long),
            "attention_mask": torch.tensor([attn], dtype=torch.long),
        }

dummy_window = SimpleNamespace(
    window_texts=["step one", "step two", "step three"],     # W=3
    window_features=[[0.1, 1.0], [0.2, 0.9], [0.3, 0.8]],    # (W, F) where F=2
    target_label=1,
    voter_id="0xabc",
    cluster_id=2,
)

ds = WindowDataset([dummy_window], tokenizer=DummyTokenizer(), max_length=8)
sample = ds[0]

print("input_ids:", sample.input_ids.shape)         # (W, L)
print("attention_mask:", sample.attention_mask.shape)
print("num_feats:", sample.num_feats.shape)         # (W, F)
print("label:", sample.label, "cluster:", sample.cluster_id)


## 5) Summary

This module ensures Agent 2 training receives:
- windowed token ids & masks for text sequences
- aligned numeric feature sequences
- class labels for supervision
- cluster ids for segmentation-aware evaluation or modelling

It is a clean, minimal foundation for sequence-based vote prediction models.
