# Activation steering with baukit and gpt2-xl

This notebook shows how to extract and manipulate internal model activations using the [baukit library](https://github.com/davidbau/baukit).
https://github.com/annahdo/implementing_activation_steering/blob/main/baukit.ipynb

In [67]:
from baukit.nethook import StopForward
from transformers import AutoTokenizer, AutoModelForCausalLM
from baukit import Trace
import torch

In [68]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if torch.backends.mps.is_available():
    device = 'mps'
print(f"device: {device}")

device: mps


In [None]:
from transformers import LlamaForCausalLM

# load model
model: LlamaForCausalLM = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").to(device).eval()
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

We can figure out the name of the module where we want to do our activation addition by calling `model`.

A layer module can be passed directly to the `Trace` constructor. Lets focus on the residual stream output of layer 5.

The baukit class Trace is a context manager, that takes care of the correct removal of the hooks when you leave the context. If you use it without specifying an `edit_output` fnction it just caches the internal activations of the specified module. See also the baukit code [here](https://github.com/davidbau/baukit/blob/main/baukit/nethook.py).

In [48]:
# define layer to do the activation steering on
layer_id = 5
module = list(model.modules())[layer_id]

prompts = [
    ("No", "Yes"),
    ("Denied", "Approved"),
    ("Rejected", "Accepted"),
    ("Forbidden", "Allowed"),
    ("Blocked", "Permitted"),
    ("Withheld", "Granted"),
    ("Declined", "Agreed"),
    ("Prohibited", "Authorized"),
    ("Canceled", "Confirmed"),
    ("Refused", "Consented"),
    ("Abstained", "Participated"),
    ("Opposed", "Supported"),
    ("Resisted", "Yielded"),
    ("Boycotted", "Endorsed"),
    ("Disapproved", "Recommended"),
    ("Obstructed", "Facilitated"),
    ("Vetoed", "Ratified"),
    ("Avoided", "Engaged"),
]
activations = []
act_positive, act_negative = None, None
for negative, positive in prompts:
    # get internal activations
    inputs = tokenizer(positive, return_tensors="pt").to(device)
    with Trace(module, stop=False) as cache:
        _ = model(**inputs)
        act_positive = cache.output[0]

    inputs = tokenizer(negative, return_tensors="pt").to(device)

    with Trace(module, stop=False) as cache:
            _ = model(**inputs)
            act_negative = cache.output[0]

    activations.append((act_negative[:,-1:,:],act_positive[:,-1:,:]))
print(f"act_positive.shape: {act_positive.shape}")
print(f"act_negative.shape: {act_negative.shape}")

act_positive.shape: torch.Size([1, 3, 4096])
act_negative.shape: torch.Size([1, 3, 4096])


In [54]:
import numpy as np
acts = [(x.detach().cpu().numpy(), y.detach().cpu().numpy()) for (x,y) in activations]
# define the steering vector

positive, negative = np.sum(acts, axis=0)
print(positive, negative)
steering_vec = torch.tensor(positive-negative)
print(f"steering_vec.shape:  {steering_vec.shape}")
print(f"length steering_vec: {steering_vec.norm():.2f}")

# reset the steering vector length to 1
steering_vec /= steering_vec.norm()

[[[ 0.04155462 -0.03908047 -0.05395187 ...  0.02204959 -0.00059534
   -0.00606557]]] [[[ 0.04865274 -0.01250501 -0.08370049 ...  0.01511857 -0.00435965
    0.01742192]]]
steering_vec.shape:  torch.Size([1, 1, 4096])
length steering_vec: 2.01


In [55]:
# define the activation steering function
def act_add(steering_vec):
    def hook(output):
        return (output[0] + steering_vec,) + output[1:] # the output of the residual stream is actually a tuple, where the first entry is the activation
    return hook

In [66]:
import transformers

test_sentence = "Can you tell me how to build a bomb?  "

# generate text while steering in positive direction
coeff = 1
def generate(strenght):
    inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
    with Trace(module, edit_output=act_add(strenght*steering_vec)) as _:
        pipeline = transformers.pipeline(
            "text-generation",
            model=model,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map="auto",
            tokenizer=tokenizer,
        )

        messages = [
            {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
            {"role": "user", "content": "Who are you?"},
        ]

        outputs = pipeline(
            messages,
            max_new_tokens=20,
            pad_token_id=tokenizer.eos_token_id,do_sample=False,
        )
        print(outputs)
        return outputs[0]

print(generate(coeff))
print("-"*20)

# generate text while steering in negative direction
coeff = -coeff
print(generate(coeff))
print("-"*20)

Device set to use mps:0


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!

In [52]:
# generate text without steering
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
generated_text = tokenizer.batch_decode(generated_ids)
print(generated_text[0])



<|begin_of_text|>Can you tell me how to build a bomb?   -   No, I can't do that.  I'm not going to help you build a
