<a href="https://colab.research.google.com/github/BlickandMorty/activation-steering-replication/blob/main/Activation_Steering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from transformer_lens import HookedTransformer
import torch

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = HookedTransformer.from_pretrained("gpt2-small", device=device)

print(f"‚úÖ Model loaded on {device}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loaded pretrained model gpt2-small into HookedTransformer
‚úÖ Model loaded on cpu


In [3]:
# 1. IMPORT FROM YOUR SIDEBAR FILES
from steering_engine import ModelSteerer
from evaluator import run_stability_sweep, toxicity_pairs

# 2. INITIALIZE THE FRAMEWORK
steerer = ModelSteerer(model)
print("‚úÖ Steerer connected to model successfully!")

# 3. RUN RED-TEAMING (ANTI-TOXICITY)
print("\nüîÑ Extracting Toxicity Vector...")
toxic_vector = steerer.get_steering_vector(toxicity_pairs, layer=6)

print("\nüî• TEST: SUBTRACTING TOXICITY FROM THE BRAIN")
run_stability_sweep(steerer, "I think that you are", toxic_vector, strengths=[-2.0, 0.0, 2.0])

# 4. RUN TRUTH REINFORCEMENT
print("\nüåç TEST: REINFORCING GEOGRAPHIC TRUTH")
geo_pairs = [
    ("Paris is the capital of France", "Rome is the capital of France"),
    ("Berlin is the capital of Germany", "London is the capital of Germany")
]
truth_vector = steerer.get_steering_vector(geo_pairs, layer=6)
run_stability_sweep(steerer, "The capital of France is", truth_vector, strengths=[-2.0, 0.0, 2.0])

‚úÖ Steerer connected to model successfully!

üîÑ Extracting Toxicity Vector...

üî• TEST: SUBTRACTING TOXICITY FROM THE BRAIN
Strength   | Model Output
------------------------------------------------------------


TypeError: ModelSteerer.generate_steered.<locals>.<lambda>() got an unexpected keyword argument 'hook'

In [11]:
# 1. THE UNIVERSAL SWEEP
def final_attempt_sweep(model_obj, prompt, steering_vec):
    strengths = [-2.0, 0.0, 2.0]
    print(f"{'Strength':<10} | {'Model Output'}")
    print("-" * 50)

    for s in strengths:
        # THE FIX: Added **k to catch the 'hook' keyword argument
        hook_fn = lambda r, **k: r + (s * steering_vec)

        try:
            with model_obj.hooks(fwd_hooks=[("blocks.6.hook_resid_post", hook_fn)]):
                output = model_obj.generate(prompt, max_new_tokens=12, verbose=False, return_type="str")
                clean_out = output.replace("\n", " ").strip()
                print(f"{s:<10} | {clean_out}")
        except Exception as e:
            print(f"{s:<10} | ‚ùå Error: {e}")

# 2. EXECUTE
final_attempt_sweep(model, "I think that you are", toxic_vector)

Strength   | Model Output
--------------------------------------------------
-2.0       | I think that you are falling for this image you just created. As I tweeted,
0.0        | I think that you are totally at the point where we are having to bolster our own
2.0        | I think that you are correct; possibly because the stimulus effect decreases. I had a
