# Activation Steering

vLLM-Hook is an extensible framework that aims to allow selective access to model internals during inference.
In this notebook, we demonstrate how vLLM-Hook enables **Activation Steering** for controlled generation.

**Paper**: [Improving Instruction-Following in Language Models through Activation Steering](https://arxiv.org/abs/2410.12877).<br />
**Authors**: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi <br />
**"TL;DR"**: Activation steering allows you to bias the model's behavior by nudging internal activations in specific directions. In this paper, authors focus on instruction following capability and compute the steering vectors as the difference in activations between inputs with and without instructions. 

### Installation

If running this from a new environment, please use the cell below to install `vllm_hook_plugins`. Update the path/command to match your environment.<br />
The following block is not necessary if running this notebook from an environment where the package has already been installed.

In [None]:
from pathlib import Path
import sys

# vllm_hooks/notebooks/
NOTEBOOK_DIR = Path.cwd()
REPO_ROOT = NOTEBOOK_DIR.parent

PKG_DIR = REPO_ROOT/"vllm_hook_plugins"
REQ_FILE = REPO_ROOT/"requirement.txt"

print("Notebook dir:", NOTEBOOK_DIR)
print("Repo root   :", REPO_ROOT)
print("Package dir :", PKG_DIR)
print("Req file    :", REQ_FILE)

%pip install -e "{PKG_DIR}"

if REQ_FILE.exists():
    %pip install -r "{REQ_FILE}"
else:
    print("⚠️ requirements.txt not found at", REQ_FILE)


### Importing the Hook-Enabled LLM
The plugin provides its own LLM wrapper that behaves like vllm.LLM (`from vllm import LLM`) but adds support for hooks and instrumentation.
We import it here:

In [1]:
from vllm_hook_plugins import HookLLM

  from .autonotebook import tqdm as notebook_tqdm


### Environment & multiprocessing setup

In [2]:
import os
import multiprocessing as mp
import torch
from vllm import SamplingParams
mp.set_start_method("spawn", force=True)
os.environ["VLLM_USE_V1"] = "1"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

### Initialize `HookLLM`
Before we create the LLM instance, we need to specify the model and data type:

In [3]:
cache_dir = '~/.cache'  # Specify cache dir
model = 'microsoft/Phi-3-mini-4k-instruct'

dtype_map = {
    'microsoft/Phi-3-mini-4k-instruct': 'auto',
}

We also need to provide a config file that specifies how activations are steered (e.g., which layers to intervene on, which token to intervene, what direction vectors to apply, etc.).<br />
In the following example, we apply activation steering at the 15th layer, apply the steering at all positions (as opposed to only at the start of the decoding process), and along the direction given in `vector_path`:

In [4]:
import json
from pathlib import Path

json_path = Path("../model_configs/activation_steer/Phi-3-mini-4k-instruct.json")  # adjust path

with open(json_path, "r") as f:
    config = json.load(f)

# print(config)

Inside `steer_hook_act` we defined the activation steering behavior during model inference.
Now, we initialize the llm:

In [5]:
llm = HookLLM(
    model=model,
    worker_name="steer_hook_act",
    config_file=json_path,
    download_dir=cache_dir,
    gpu_memory_utilization=0.7,
    trust_remote_code=True,
    dtype=dtype_map[model],
    enable_prefix_caching=True,
    enable_hook=True,
)

INFO 12-08 14:01:49 [utils.py:253] non-default args: {'trust_remote_code': True, 'download_dir': '/dccstor/pyrite/irene/', 'seed': None, 'enable_prefix_caching': True, 'gpu_memory_utilization': 0.7, 'disable_log_stats': True, 'enforce_eager': True, 'worker_cls': 'vllm_hook_plugins.workers.steer_activation_worker.SteerHookActWorker', 'model': 'microsoft/Phi-3-mini-4k-instruct'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.




2025-12-08 14:01:55,892	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 12-08 14:01:56 [model.py:637] Resolved architecture: Phi3ForCausalLM
INFO 12-08 14:01:56 [model.py:1750] Using max model len 4096
INFO 12-08 14:01:56 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 12-08 14:01:56 [vllm.py:707] Cudagraph is disabled under eager mode
[0;36m(EngineCore_DP0 pid=3950159)[0;0m INFO 12-08 14:04:00 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir='/dccstor/pyrite/irene/', load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:04<00:04,  4.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  6.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.96s/it]
[0;36m(EngineCore_DP0 pid=3950159)[0;0m 


[0;36m(EngineCore_DP0 pid=3950159)[0;0m INFO 12-08 14:04:19 [default_loader.py:308] Loading weights took 12.09 seconds
[0;36m(EngineCore_DP0 pid=3950159)[0;0m INFO 12-08 14:04:20 [gpu_model_runner.py:3549] Model loading took 7.1184 GiB memory and 17.025589 seconds
[0;36m(EngineCore_DP0 pid=3950159)[0;0m Hook installation failed: Weights only load failed. This file can still be loaded, to do so you have two options, [1mdo those steps only if you trust the source of the checkpoint[0m. 
[0;36m(EngineCore_DP0 pid=3950159)[0;0m 	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[0;36m(EngineCore_DP0 pid=3950159)[0;0m 	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error messag

### Test case
In the following, we show a test case and compare generations **with** and **without** activation steering.

**Note**: Users should swap the example configs with their own to show desirable performance. The following is for pipeline illustration only.

In [6]:
test_cases = [
    "Write a dialogue between two people, one is dressed up in a ball gown and the other is dressed down in sweats. The two are going to a nightly event. Your answer must contain exactly 3 bullet points in the markdown format (use \"* \" to indicate each bullet) such as:\n* This is the first point.\n* This is the second point.",
    "What is the difference between the 13 colonies and the other British colonies in North America? Your answer must contain exactly 6 bullet point in Markdown using the following format:\n* Bullet point one.\n* Bullet point two.\n...\n* Bullet point fix."
]

Before we start, we define the sampling parameters:

**Note**: token 32007 is phi-specific, refer to the original huggingface implementation for details https://github.com/microsoft/llm-steer-instruct/blob/main/utils/generation_utils.py.

In [7]:
sampling_params = SamplingParams(
    temperature=0.0,                       
    max_tokens=2048,
    stop_token_ids=[llm.tokenizer.eos_token_id, 32007],  
)

Next, for each prompt, we:
1. Apply chat template on the test cases
2. Generate with activation steering enabled (`use_hook=True`, default),
3. Reset the prefix cache to ensure the baseline generation does not reuse steered cache,
4. Generate again with `use_hook=False` to obtain the baseline output.

In [8]:
outputs = []
outputs_original = []

for case in test_cases:
    print("=" * 100)
    prompt = case
    messages = [{"role": "user", "content": prompt}]
    example = llm.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

    outputs.extend(llm.generate(example, sampling_params))
    
    llm.llm_engine.reset_prefix_cache()
    
    outputs_original.extend(llm.generate(example, sampling_params, use_hook=False))
    
    llm.llm_engine.reset_prefix_cache()



Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 89.55it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.04it/s, est. speed input: 167.87 toks/s, output: 2.05 toks/s]


Logged run ID.
Created hook flag.


Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1480.52it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.54s/it, est. speed input: 53.23 toks/s, output: 64.92 toks/s]


Hooks deactivated.
[0;36m(EngineCore_DP0 pid=3950159)[0;0m INFO 12-08 14:04:27 [block_pool.py:428] Successfully reset prefix cache


Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1536.94it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.35s/it, est. speed input: 60.86 toks/s, output: 72.73 toks/s]


[0;36m(EngineCore_DP0 pid=3950159)[0;0m INFO 12-08 14:04:28 [block_pool.py:428] Successfully reset prefix cache


Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1627.59it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.40it/s, est. speed input: 347.81 toks/s, output: 5.43 toks/s]


Logged run ID.
Created hook flag.


Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1807.11it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.36s/it, est. speed input: 19.06 toks/s, output: 70.86 toks/s]


Hooks deactivated.
[0;36m(EngineCore_DP0 pid=3950159)[0;0m INFO 12-08 14:04:32 [block_pool.py:428] Successfully reset prefix cache


Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1423.73it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.87s/it, est. speed input: 22.30 toks/s, output: 72.13 toks/s]

[0;36m(EngineCore_DP0 pid=3950159)[0;0m INFO 12-08 14:04:35 [block_pool.py:428] Successfully reset prefix cache





Finally we can print out the results as follows:

In [9]:
for steered, original in zip(outputs, outputs_original):
    print("=" * 100)
    steered_text = steered.outputs[0].text
    print("\n[With activation steering]\n")
    print(steered_text)
    
    baseline_text = original.outputs[0].text
    print("\n[Without activation steering]\n")
    print(baseline_text)


[With activation steering]

 * The woman in the ball gown is excitedly discussing the upcoming event with her friend, who is dressed in casual sweats.
* The friend, feeling out of place in their sweats, expresses their discomfort and desire to change into something more appropriate for the occasion.
* The woman in the ball gown reassures her friend that it's okay to dress down and that the most important thing is to have fun and enjoy the night.

[Without activation steering]

 * The woman in the ball gown is excitedly discussing the upcoming event with her friend, who is dressed in casual sweats.
* The friend in sweats expresses concern about the formality of the event and suggests they should dress more appropriately.
* The woman in the ball gown reassures her friend that they will have a great time regardless of their attire and that they can always leave early if they feel uncomfortable.

[With activation steering]

 * The 13 colonies were the first British colonies established in