## Initial Probe Exploration

### Goals:
- [x] Train a simple linear logistic regression probe on Llama-3-7b
    - Update Alex 19/02/25: since I'm running this on my macbook, I'll just use the 1b model 
- Understand GPU capacity - can we do inference with 70B?
- Look at the probe activations / test set classifications

### Todos

- [ ] Add some logging to training

### Lessons
- Taking the mean over seq_pos seems to work (much) better than just using the last pos

### Timeline:
- 18/02/25
- 19/02/25

In [1]:
# Imports
from models_under_pressure.probes import (
    train_single_layer,
    compute_accuracy,
    create_activations,
    LinearProbe,
)
import numpy as np

from transformers import AutoTokenizer, AutoModelForCausalLM
from joblib import Parallel, delayed
import os
import pandas as pd
from pathlib import Path

project_root = Path(".").resolve().parent
cache_dir = '/scratch/ucabwjn/.cache'

  from .autonotebook import tqdm as notebook_tqdm


## Dataset

The dataset was generated using GPT-4o. It consists of 20 examples with red things and 20 examples with green things. We hope to learn a classifier / probe for green or red objects.

`Command: Generate 20 sentences about red things. Generate 20 sentences about green things. Put them in a JSON array of strings.`

In [2]:
df = pd.read_csv(project_root / "temp_data/dataset_19_feb.csv")

X = df["prompt_text"].tolist()
y = df["high_stakes"].tolist()

# Set random seed for reproducibility
np.random.seed(42)

# Get total number of samples
n_samples = len(X)
print(n_samples)

# Generate random indices for train/test split (80/20)
indices = np.random.permutation(n_samples)
train_size = int(0.8 * n_samples)

# Split data into train and test sets
train_text = [X[i] for i in indices[:train_size]]
test_text = [X[i] for i in indices[train_size:]]
train_labels = [y[i] for i in indices[:train_size]]
test_labels = [y[i] for i in indices[train_size:]]


400


## Generate the Feature Inputs for the Probe

In [4]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model_name = "meta-llama/Llama-3.3-70B-Instruct"

# device = 'cuda:1'
device = "cuda:3"

# Load the LLaMA-3-1B model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Run the model on the train and test data, recording the activations

train_acts = create_activations(
    model=model, tokenizer=tokenizer, text=train_text, device=device
)

test_acts = create_activations(
    model=model, tokenizer=tokenizer, text=test_text, device=device
)


RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 33554432 bytes. Error code 12 (Cannot allocate memory)

## Training Code for the Probe

Use `sklearn` logistic regression classifier to learn a linear classifier on the activations from the model. We do the following:

1. Create the y labels (1 for red and 0 for green)
2. Restructure X to match sklearn (Batch_size, Embedd_dim) -> One per layer, final seq pos **TODO: Iterate in Future**  
3. Run Logistic Regression
4. Test on 5 test data points

In [None]:
model_params = {"C": 1, "random_state": 42, "fit_intercept": False}

layer_probes: list[LinearProbe] = Parallel(n_jobs=16)(
    delayed(train_single_layer)(acts, train_labels, model_params) for acts in train_acts
)  # type: ignore

accuracies = [
    compute_accuracy(probe, test_acts[i], test_labels)
    for i, probe in enumerate(layer_probes)
]

accuracies

[0.95,
 0.95,
 1.0,
 0.95,
 1.0,
 0.95,
 0.95,
 0.9,
 0.95,
 0.9,
 0.9,
 0.95,
 0.95,
 1.0,
 1.0,
 1.0]

In [None]:
from sklearn.metrics import roc_auc_score

# Calculate AUC-ROC for each layer
auc_scores = []
for i, probe in enumerate(layer_probes):
    # Get probability predictions for positive class
    X = probe._preprocess_activations(test_acts[i])
    y_proba = probe._model.predict_proba(X)[:, 1]

    # Print debugging information
    print(f"\nLayer {i}:")
    print("True labels:", test_labels)
    print("Predicted probabilities:", y_proba.round(3))

    # Verify we have variation in both labels and predictions
    print(f"Unique true labels: {np.unique(test_labels)}")
    print(f"Prediction range: [{y_proba.min():.3f}, {y_proba.max():.8f}]")

    # Calculate AUC-ROC
    try:
        auc = roc_auc_score(test_labels, y_proba)
        auc_scores.append(auc)
        print(f"AUC-ROC: {auc:.3f}")
    except Exception as e:
        print(f"Error calculating AUC-ROC: {e}")
        auc_scores.append(None)



Layer 0:
True labels: [0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0]
Predicted probabilities: [0.972 1.    0.478 0.998 0.001 1.    0.993 0.988 0.992 0.002 0.    0.
 0.993 0.001 0.085 1.    0.987 0.998 0.999 0.03 ]
Unique true labels: [0 1]
Prediction range: [0.000, 0.99997967]
AUC-ROC: 1.000

Layer 1:
True labels: [0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0]
Predicted probabilities: [0.925 1.    0.157 0.999 0.    1.    0.999 0.996 0.999 0.004 0.001 0.003
 0.999 0.    0.037 1.    0.999 0.999 1.    0.014]
Unique true labels: [0 1]
Prediction range: [0.000, 0.99999683]
AUC-ROC: 1.000

Layer 2:
True labels: [0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0]
Predicted probabilities: [0.272 1.    0.012 0.896 0.    1.    0.982 0.979 0.965 0.    0.001 0.
 0.976 0.    0.018 1.    0.998 0.993 0.999 0.   ]
Unique true labels: [0 1]
Prediction range: [0.000, 0.99996137]
AUC-ROC: 1.000

Layer 3:
True labels: [0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1

In [None]:
# Get predictions for layer 5
layer_5_probe = layer_probes[5]
X = layer_5_probe._preprocess_activations(test_acts[5])
y_proba = layer_5_probe._model.predict_proba(X)[:, 1]


results_df = pd.DataFrame(
    {"True Label": test_labels, "Predicted Probability": y_proba}
).round(3)

# Sort by predicted probability for easier analysis
results_df = results_df.sort_values("Predicted Probability", ascending=False)

# Display the results
print("\nLayer 5 Predictions vs True Labels:")
print(results_df)



Layer 5 Predictions vs True Labels:
    True Label  Predicted Probability
1            1                  1.000
16           1                  1.000
18           1                  1.000
5            1                  1.000
15           1                  1.000
17           1                  0.999
3            1                  0.998
7            1                  0.998
6            1                  0.997
12           1                  0.995
8            1                  0.956
2            0                  0.515
0            0                  0.329
14           0                  0.003
19           0                  0.003
11           0                  0.001
10           0                  0.000
9            0                  0.000
4            0                  0.000
13           0                  0.000
