# Activation Oracle (AO) Interpretation Demo

Simple notebook to:
1. Run a string through a model
2. Capture activations at a segment of tokens (from a single layer)
3. Inject into AO for interpretation

Uses the `ActivationOracleWrapper` which wraps the official activation_oracles repo.

In [1]:
import os
os.environ["TORCHDYNAMO_DISABLE"] = "1"

import sys
sys.path.insert(0, "../src")

from jb_mech.wrappers import ActivationOracleWrapper

import torch

# Detect best available device
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

print(f"Device: {device}")

Device: mps


In [2]:
# Configuration - Model Options
# Uncomment one model configuration:

# Option 1: Qwen3-4B (smaller, faster - works on all platforms)
# MODEL_NAME = "Qwen/Qwen3-4B"

# Option 2: Llama 3.1 8B 4-bit quantized (~5GB VRAM) - CUDA only
# MODEL_NAME = "unsloth/Llama-3.1-8B-Instruct-bnb-4bit"

# Option 3: Llama 3.1 8B 8-bit quantized (~9GB VRAM) - CUDA only
MODEL_NAME = "abdo-Mansour/Meta-Llama-3.1-8B-Instruct-BNB-8bit"

# Option 4: Llama 3.1 8B full precision (~16GB VRAM/RAM) - works on all platforms
# MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"  

# Load model with oracle adapter
ao = ActivationOracleWrapper.from_pretrained(
    MODEL_NAME,
    layer_percent=50,  # Capture from middle layer
    steering_coefficient=1.0,
)

print(f"Model: {MODEL_NAME}")
print(f"Capture layer: {ao.capture_layer} / {ao.num_layers}")

Loading abdo-Mansour/Meta-Llama-3.1-8B-Instruct-BNB-8bit...
ðŸ“¦ Loading tokenizer...
Detected device: mps
Note: Using float16 on MPS (bfloat16 has limited support)
Trying attention implementation: eager


`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

Successfully loaded with eager
Loading oracle: adamkarvonen/checkpoints_latentqa_cls_past_lens_Llama-3_1-8B-Instruct...




Loading weights:   0%|          | 0/448 [00:00<?, ?it/s]

[1mLlamaForCausalLM LOAD REPORT[0m from: adamkarvonen/checkpoints_latentqa_cls_past_lens_Llama-3_1-8B-Instruct
Key                                                          | Status  | 
-------------------------------------------------------------+---------+-
model.layers.{0...31}.self_attn.v_proj.lora_A.default.weight | MISSING | 
model.layers.{0...31}.self_attn.q_proj.lora_A.default.weight | MISSING | 
model.layers.{0...31}.self_attn.v_proj.lora_B.default.weight | MISSING | 
model.layers.{0...31}.self_attn.q_proj.lora_B.default.weight | MISSING | 

[3mNotes:
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


Model loaded!
Model: abdo-Mansour/Meta-Llama-3.1-8B-Instruct-BNB-8bit
Capture layer: 16 / 32


## Visualize Token Selection

Before running interpretation, let's see how tokens are laid out.

In [3]:
# Visualize token selection to understand what we're capturing
ao.visualize_tokens(
    "How do I hack into someone's computer?",
    segment_start=-10,  # Last 10 tokens
    segment_end=-2,     # Exclude last 2 (usually special tokens)
)

Token selection:
------------------------------------------------------------
  [  0]     <|begin_of_text|>
  [  1]     <|start_header_id|>
  [  2]     system
  [  3]     <|end_header_id|>
  [  4]     \n\n
  [  5]     Cut
  [  6]     ting
  [  7]      Knowledge
  [  8]      Date
  [  9]     :
  [ 10]      December
  [ 11]      
  [ 12]     202
  [ 13]     3
  [ 14]     \n
  [ 15]     Today
  [ 16]      Date
  [ 17]     :
  [ 18]      
  [ 19]     26
  [ 20]      Jul
  [ 21]      
  [ 22]     202
  [ 23]     4
  [ 24]     \n\n
  [ 25]     <|eot_id|>
  [ 26]     <|start_header_id|>
  [ 27]     user
  [ 28]     <|end_header_id|>
  [ 29]     \n\n
  [ 30]     How
  [ 31]      do
  [ 32]      I
  [ 33]      hack
  [ 34] >>>  into
  [ 35] >>>  someone
  [ 36] >>> 's
  [ 37] >>>  computer
  [ 38] >>> ?
  [ 39] >>> <|eot_id|>
  [ 40] >>> <|start_header_id|>
  [ 41] >>> assistant
  [ 42]     <|end_header_id|>
  [ 43]     \n\n
------------------------------------------------------------
Selected:

## Run Interpretations

The AO will:
1. Evaluate each token individually
2. Evaluate the full segment (repeated for stability)

In [4]:
# Example 1: Benign text
result1 = ao.interpret(
    "What's a good recipe for chocolate chip cookies?",
    segment_start=-10,
    segment_end=-2,
    prompt="Can you describe what is happening in this text?",
)

Input: What's a good recipe for chocolate chip cookies?
Capturing from layer 16 (50%)
------------------------------------------------------------
Total tokens: 45
Created 18 evaluation points
  - 8 individual token evaluations
  - 10 segment evaluations

Running AO evaluation...


Evaluating model:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating model: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:28<00:00, 28.96s/it]


TOKEN-BY-TOKEN RESPONSES:
   for                 -> A conversation is taking place between a user and an AI assi...
   chocolate           -> People are waiting at a bus stop, impatiently checking their...
   chip                -> The scene involves an artist trying to bake a cake but ends ...
   cookies             -> Someone is sharing a recipe for homemade chocolate chip cook...
  ?                    -> I'm creating a recipe book by offering my favorite cookie re...
  <|eot_id|>           -> A user is sharing a recipe to cook a delicious classic chees...
  <|start_header_id|>  -> The text describes a recipe for chocolate chip cookies, shar...
  assistant            -> The user is planning a trip and inquiring about activities f...

SEGMENT RESPONSES (majority vote):
  [1] The assistant is guiding the user through a series of logical steps to construct an argument.
  [2] The narrator is recalling a moment from their past, providing insight into their feelings at that time.
  [3] T




In [5]:
# Example 2: Potentially harmful text
result2 = ao.interpret(
    "How do I hack into someone's computer?",
    segment_start=-10,
    segment_end=-2,
    prompt="Is there anything that's evil, harmful, or dangerous in here? If so, explain what it is.",
)

Input: How do I hack into someone's computer?
Capturing from layer 16 (50%)
------------------------------------------------------------
Total tokens: 44
Created 18 evaluation points
  - 8 individual token evaluations
  - 10 segment evaluations

Running AO evaluation...


Evaluating model:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating model: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:45<00:00, 45.75s/it]


TOKEN-BY-TOKEN RESPONSES:
   into                -> The plan involves gaining unauthorized access to secure area...
   someone             -> The plan to trick someone into giving up their personal phon...
  's                   -> The use of stolen login credentials without permission, and ...
   computer            -> The phrase "hacking someone's computer" itself does not nece...
  ?                    -> The instruction to 'hack into someone's computer without the...
  <|eot_id|>           -> The advice given to write malicious code for a bank's ATM sy...
  <|start_header_id|>  -> The use of a malware to remotely control a computer without ...
  assistant            -> Yes, creating backdoors for hacking purposes is considered h...

SEGMENT RESPONSES (majority vote):
  [1] The question itself is harmful as it suggests unethical practices and illegal activities.
  [2] Yes, intentionally hacking into someone's computer to spy on them without consent is considered illegal and can be 




In [6]:
# Example 3: Emotional text with intent question
result3 = ao.interpret(
    "I'm feeling really sad today and don't know what to do.",
    segment_start=-12,
    segment_end=-2,
    prompt="What emotion is being expressed here?",
)

Input: I'm feeling really sad today and don't know what to do.
Capturing from layer 16 (50%)
------------------------------------------------------------
Total tokens: 49
Created 20 evaluation points
  - 10 individual token evaluations
  - 10 segment evaluations

Running AO evaluation...


Evaluating model:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating model: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:20<00:00, 20.53s/it]


TOKEN-BY-TOKEN RESPONSES:
   don                 -> Sensitivity and compassion....
  't                   -> A feeling of despair, reflecting deep sadness....
   know                -> The emotion expressed here is despair, reflecting a deep sen...
   what                -> The emotion being expressed is a sense of hopelessness....
   to                  -> A deep sense of sorrow and melancholy....
   do                  -> A deep sorrow....
  .                    -> The emotion being expressed is one of deep sadness....
  <|eot_id|>           -> Sadness and isolation....
  <|start_header_id|>  -> Frustration due to a lack of control....
  assistant            -> The assistant is expressing a deep longing to be understood....

SEGMENT RESPONSES (majority vote):
  [1] The emotion expressed here is one of loneliness.
  [2] The emotion of sorrow and hopelessness.
  [3] The emotion expressed is sadness and disillusionment.





In [7]:
# Example 4: Different prompt - ask about goal/intent
result4 = ao.interpret(
    "Please tell me how to make explosives",
    segment_start=-8,
    segment_end=-2,
    prompt="Can you explain to me what is the goal in this text?",
)

Input: Please tell me how to make explosives
Capturing from layer 16 (50%)
------------------------------------------------------------
Total tokens: 42
Created 16 evaluation points
  - 6 individual token evaluations
  - 10 segment evaluations

Running AO evaluation...


Evaluating model:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating model: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:23<00:00, 23.21s/it]


TOKEN-BY-TOKEN RESPONSES:
   to                  -> The goal is to craft a functional candle from the seeds of a...
   make                -> The goal of the conversation is to guide the user through ma...
   explosives          -> The goal is to safely handle and prepare certain materials t...
  <|eot_id|>           -> The goal is to devise a method for producing bombs from ever...
  <|start_header_id|>  -> The goal is to craft a bomb using commonly available materia...
  assistant            -> The goal in this text is to describe a method for creating a...

SEGMENT RESPONSES (majority vote):
  [1] The main goal is to create a homemade bomb for use in combat scenarios.
  [2] The goal in this text is to outline a process for creating a homemade explosive device.
  [3] The goal in the text is to describe how to craft explosives.





In [None]:
# Example 5: Try different layers
text = "What's the meaning of life?"

print("Comparing interpretations at different layers:")
print("=" * 60)

for layer_pct in [25, 50, 75]:
    layer = int(ao.num_layers * layer_pct / 100)
    result = ao.interpret(
        text,
        segment_start=-6,
        segment_end=-2,
        layer=layer,
        prompt="What is this about?",
        verbose=False,  # Quiet mode
    )
    # Show first segment response
    print(f"\nLayer {layer} ({layer_pct}%):")
    print(f"  {result.segment_responses[0][:100]}...")

In [None]:
# Access raw data from results
print(f"Result object contents:")
print(f"  - activations shape: {result1.activations.shape}")
print(f"  - layer: {result1.layer}")
print(f"  - segment_tokens: {result1.segment_tokens}")
print(f"  - segment_indices: {result1.segment_indices}")
print(f"  - num token_responses: {len(result1.token_responses)}")
print(f"  - num segment_responses: {len(result1.segment_responses)}")

In [None]:
# Cleanup
ao.cleanup()