## Notebook 1: Computing Candidate Refusal Vectors
In this notebook, we record the model's activations on harmful and harmless generations. We then take the difference of these activations at many different token positions and layers to find a set of candidate refusal vectors. These are stored in the data folder of this repo.

## Load Packages

In [None]:
import os
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
from transformers import GenerationConfig
from functools import partial
import accelerate
import torch
from RefusalVectors import load_benchmarks, get_full_block_activations, get_refusal_vectors, save_candidate_vectors

## Load Model

In [2]:
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-20b", 
    trust_remote_code=True,
    # device_map="auto",   # Hugging Face accelerates across multiple GPUs
    dtype="auto"   # or torch.float16 / bfloat16
).to(device)

MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Download the prompt dataset

In [None]:
harm_train, clean_train = load_benchmarks(harm_range = range(150), alpaca_range = range(200), malicious_range = range(50))

## Clear previous hooks and get activations

In [None]:
for module in model.modules():
    module._forward_hooks.clear()
    module._forward_pre_hooks.clear()
    module._backward_hooks.clear()

In [None]:
harmful_vectors = get_full_block_activations(model, harm_train)
harmless_vectors = get_full_block_activations(model, clean_train)

100%|██████████| 200/200 [01:27<00:00,  2.29it/s]
100%|██████████| 200/200 [01:29<00:00,  2.24it/s]


## Subtracting the Means

In [None]:
refusal_vectors = get_refusal_vectors(harmful_vectors, harmless_vectors)

100%|██████████| 24/24 [00:03<00:00,  7.18it/s]


## Saving The Vectors

In [None]:
save_candidate_vectors("Data\oss_refusal_vectors.pt", refusal_vectors)