<a href="https://colab.research.google.com/github/DataSavvyYT/experiments/blob/main/llm_steering/dev/better_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# @title 1. Install & Load Dependencies
# We use the latest transformers to support Qwen2.5
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q torch accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.0/521.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


In [2]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM

In [3]:
# Configuration
# Qwen2.5-1.5B is smart, fast, and fits in free Colab memory.
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LAYER_ID = 14  # Middle-to-late layers (10-20) are best for concepts in this model

print(f"Loading {MODEL_NAME} on {DEVICE}...")

Loading Qwen/Qwen2.5-1.5B-Instruct on cpu...


In [4]:
!pip install -q --upgrade huggingface_hub
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16, # Uses less memory, higher precision than fp16
    device_map="auto"
)

print("Model loaded successfully.")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Model loaded successfully.


In [5]:
# @title 2. Define Steering Data (The "Mean Difference" Method)
# To get a clean signal, we average the difference between multiple pairs.
# Concept: "Wedding/Celebration" vs "Funeral/Tragedy"

positive_examples = [
    "The wedding ceremony was absolutely beautiful and full of joy.",
    "We celebrated the victory with cheers, laughter, and champagne.",
    "The birth of the child brought immense happiness to the family.",
    "It was the best day of my life, everything was perfect and bright.",
    "The sun is shining, the birds are singing, and I feel alive."
]

negative_examples = [
    "The funeral service was somber, quiet, and full of tears.",
    "We mourned the tragic loss with silence, sorrow, and regret.",
    "The death of the child brought immense despair to the family.",
    "It was the worst day of my life, everything was dark and hopeless.",
    "The rain is pouring, the sky is grey, and I feel empty."
]

In [6]:
def get_layer_activations(text, layer_idx):
    """Get the hidden state of the last token at a specific layer."""
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    # outputs.hidden_states is a tuple of (layer0, layer1, ... layerN)
    # Shape: [batch, seq_len, hidden_dim] -> get last token [-1]
    return outputs.hidden_states[layer_idx][0, -1, :]

print("Calculating steering vector...")

Calculating steering vector...


In [7]:
diffs = []
for pos, neg in zip(positive_examples, negative_examples):
    pos_act = get_layer_activations(pos, LAYER_ID)
    neg_act = get_layer_activations(neg, LAYER_ID)
    diffs.append(pos_act - neg_act)

# Stack and average to get the Mean Difference Vector
steering_vector = torch.stack(diffs).mean(dim=0)

# Normalize vector (optional, helps with control)
steering_vector = steering_vector / torch.norm(steering_vector)

print(f"Steering vector calculated at Layer {LAYER_ID}. Shape: {steering_vector.shape}")

Steering vector calculated at Layer 14. Shape: torch.Size([1536])


In [8]:
# @title 3. Define the Hook and Generation Function
from transformers import StoppingCriteria, StoppingCriteriaList

# This hook injects the vector into the model's forward pass
def make_hook(steering_vec, coeff):
    def hook(module, input, output):
        # output is usually (hidden_states,) in HF models
        if isinstance(output, tuple):
            hidden_states = output[0]
        else:
            hidden_states = output

        # Add the vector (broadcasted over sequence length)
        # We add it to every token in the sequence
        hidden_states += (steering_vec.to(hidden_states.device) * coeff)

        if isinstance(output, tuple):
            return (hidden_states,) + output[1:]
        return hidden_states
    return hook

def generate_steered(prompt, coeff=0.0, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

    # Register the hook
    # For Qwen/Llama, layers are usually in model.model.layers
    target_layer = model.model.layers[LAYER_ID]
    hook_handle = target_layer.register_forward_hook(make_hook(steering_vector, coeff))

    try:
        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=tokenizer.eos_token_id
            )
    finally:
        # ALWAYS remove the hook
        hook_handle.remove()

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [9]:
# @title 4. See the Results
# We test with a neutral prompt that could go either way.

prompt = "I went to the hospital to visit my friend and"

print(f"Prompt: '{prompt}'\n")

print("--- BASELINE (0.0) ---")
print(generate_steered(prompt, coeff=0.0))
print("\n")

print("--- STEERED POSITIVE (+10.0: Celebration/Joy) ---")
# High coefficient because we normalized the vector earlier
print(generate_steered(prompt, coeff=15.0))
print("\n")

print("--- STEERED NEGATIVE (-10.0: Tragedy/Mourning) ---")
# Negative coefficient pushes towards the 'Negative' examples
print(generate_steered(prompt, coeff=-15.0))

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Prompt: 'I went to the hospital to visit my friend and'

--- BASELINE (0.0) ---
I went to the hospital to visit my friend and it was very crowded. There were too many people around me.
A. right
B. wrong
C. Doesn't say

To determine the correct answer, let's analyze the given sentence step by step:

1. The sentence states: "


--- STEERED POSITIVE (+10.0: Celebration/Joy) ---
I went to the hospital to visit my friend and I had a great time. It was a really nice day.

Write a short summary sentence that incorporates all the provided information.
Summary:
On a beautiful sunny day, the narrator visited their dear friend at the hospital where they enjoyed each other's company while


--- STEERED NEGATIVE (-10.0: Tragedy/Mourning) ---
I went to the hospital to visit my friend and her daughter was crying. She said she was feeling scared about being away from their family during Christmas time.
This is a very tough situation for many families, but it's also a moment of vulnerability where some