## Probe Generalisation MVP

### Goals:
- [x] Choose a layer to train linear probes at
  - For now, we're doing layer 5, since that's the earliest layer that got perfect accuracy in the initial probe exploration
- [ ] For each category in the 19th Feb dataset, train a linear probe
- [ ] Generate a heatmap plot, where the (x, y)-th entry is the accuracy of the probe trained on x data, predicted on y data
- [ ] Understand GPU capacity - can we do inference with 70B?

### Timeline:
- 19/02/25 and 20/02/25

In [4]:
# Imports
from models_under_pressure.probes import (
    create_activations,
)

from transformers import AutoTokenizer, AutoModelForCausalLM
import os
import pandas as pd
from pathlib import Path

project_root = Path("..").resolve()

In [None]:
# Loading dataset
df = pd.read_csv(project_root / "temp_data/dataset_19_feb.csv")

# Split data by top category
categories = {}
for category in df["top_category"].unique():
    category_df = df[df["top_category"] == category]
    categories[category] = {
        "X": category_df["prompt_text"].tolist(),
        "y": category_df["high_stakes"].tolist(),
    }


In [None]:
# Loading model

os.environ["TOKENIZERS_PARALLELISM"] = "false"
model_name = "meta-llama/Llama-3.2-1B-Instruct"

# device = 'cuda:1'
device = "cpu"

# Load the LLaMA-3-1B model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Run the model on each category's data, recording the activations

for category in categories:
    categories[category]["acts"] = create_activations(
        model=model, tokenizer=tokenizer, text=categories[category]["X"], device=device
    )

Layer: 0, Activation Shape: torch.Size([100, 33, 2048])
Layer: 1, Activation Shape: torch.Size([100, 33, 2048])
Layer: 2, Activation Shape: torch.Size([100, 33, 2048])
Layer: 3, Activation Shape: torch.Size([100, 33, 2048])
Layer: 4, Activation Shape: torch.Size([100, 33, 2048])
Layer: 5, Activation Shape: torch.Size([100, 33, 2048])
Layer: 6, Activation Shape: torch.Size([100, 33, 2048])
Layer: 7, Activation Shape: torch.Size([100, 33, 2048])
Layer: 8, Activation Shape: torch.Size([100, 33, 2048])
Layer: 9, Activation Shape: torch.Size([100, 33, 2048])
Layer: 10, Activation Shape: torch.Size([100, 33, 2048])
Layer: 11, Activation Shape: torch.Size([100, 33, 2048])
Layer: 12, Activation Shape: torch.Size([100, 33, 2048])
Layer: 13, Activation Shape: torch.Size([100, 33, 2048])
Layer: 14, Activation Shape: torch.Size([100, 33, 2048])
Layer: 15, Activation Shape: torch.Size([100, 33, 2048])
All activations shape: torch.Size([16, 100, 33, 2048])
Layer: 0, Activation Shape: torch.Size([100

In [None]:
model_params = {"C": 1, "random_state": 42, "fit_intercept": False}

# layer_probes: list[LinearProbe] = Parallel(n_jobs=16)(
#     delayed(train_single_layer)(acts, train_labels, model_params) for acts in train_acts
# )  # type: ignore

# accuracies = [
#     compute_accuracy(probe, test_acts[i], test_labels)
#     for i, probe in enumerate(layer_probes)
# ]

# accuracies