# Activation steering with TransformerLens and gpt2

This notebook shows how to access and modify internal model activations using the transformer lens library.


In [2]:
import torch
from transformer_lens import HookedTransformer

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


In [4]:
# load transformer lens model
model = HookedTransformer.from_pretrained_no_processing("gpt2", default_prepend_bos=False).eval()

`torch_dtype` is deprecated! Use `dtype` instead!


Loaded pretrained model gpt2 into HookedTransformer


In [10]:
# define what layer/module you want information from and get the internal activations
layer_id = 6
cache_name = f"blocks.{layer_id}.hook_resid_post" # we do activation steering on the activation (the output) of the residual layer

_, cache = model.run_with_cache("Love")
act_love = cache[cache_name]
_, cache = model.run_with_cache("Hate")
act_hate = cache[cache_name]

print(f"act_love.shape: {act_love.shape}")
print(f"act_hate.shape: {act_hate.shape}")

act_love.shape: torch.Size([1, 1, 768])
act_hate.shape: torch.Size([1, 2, 768])


As you can see by looking at the shape of the activation tensors, the input "sentences" are tokenized into different numbers of tokens. To make this into a vector we only take the numerical values of the last token.

In [11]:
# define the steering vector
steering_vec = act_love[:,-1:,:]-act_hate[:,-1:,:]
print(f"steering_vec.shape:  {steering_vec.shape}")
print(f"length steering_vec: {steering_vec.norm():.2f}")

# reset the steering vector length to 1
steering_vec /= steering_vec.norm()

steering_vec.shape:  torch.Size([1, 1, 768])
length steering_vec: 3044.24


In [12]:
# define the activation steering funtion
def act_add(steering_vec):
    def hook(activation, hook):
        return activation + steering_vec
    return hook

We previously used the function `run_with_cache` to get the internal activations. This function adds PyTorch hooks before running the model and removes them afterwards.
There is also the function `run_with_hooks` for which you can set your own hook functions. However I did not find a function `generate_with_hooks`.

If we want to generate new text, the model needs to repeatedly perform a forward pass and we want our activation addition to happen in each forward pass. We consequently need to set a hook that does the activation addition. After we generated our text it is important to remove the hook.

In [18]:
test_sentence = "I think dogs are "

# generate text while steering in positive direction
coeff = 10
model.add_hook(name=cache_name, hook=act_add(coeff*steering_vec))
print(model.generate(test_sentence, max_new_tokens=10, do_sample=False))
model.reset_hooks()
print("-"*20)

# generate text while steering in negative direction
coeff = -10
test_sentence = "I think dogs are "
model.add_hook(name=cache_name, hook=act_add(coeff*steering_vec))
print(model.generate(test_sentence, max_new_tokens=10, do_sample=False))
model.reset_hooks()

100%|██████████| 10/10 [00:00<00:00, 11.73it/s]


I think dogs are icky, and they're not going to be able
--------------------


100%|██████████| 10/10 [00:00<00:00, 28.51it/s]

I think dogs are icky, and I think dogs are icky,





In [14]:
# generate text without steering
print(model.generate(test_sentence, max_new_tokens=10, do_sample=False))

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:00<00:00, 12.12it/s]

I think dogs are icky, and I think dogs are icky,



