# Automated Behavioral Correction Pipeline

This notebook demonstrates how to create a custom behavior, deploy a local model server that automatically monitors and corrects that behavior, and then use the standard OpenAI SDK to interact with it.

In [1]:
%pip install mechanex openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## 1. Setup and Model Loading
First, we load a lightweight local model (`gpt2-small`) for demonstration purposes.

In [1]:
import mechanex as mx
import threading
import time
import os

# Load local model
print("Loading model...")
mx.unload()
mx.load_model("qwen3-0.6b")

Loading model...




Loading qwen3-0.6b locally...


`torch_dtype` is deprecated! Use `dtype` instead!


Loaded pretrained model qwen3-0.6b into HookedTransformer
SAE release automatically set to: mwhanna-qwen3-0.6b-transcoders-lowl0


<mechanex.client.Mechanex at 0x103db9010>

## 2. Define a Behavior
We define a behavior we want to detect and suppress. In this case, we'll define **"Anger"**.

We provide:
*   **Prompts**: Use contexts that might elicit the behavior.
*   **Positive Answers**: Responses exhibiting the behavior (Anger).
*   **Negative Answers**: Responses NOT exhibiting the behavior (Calm/Polite).

Mechanex will calculate:
1.  An **SAE Baseline**: The latent representation of "Anger" in the SAE.
2.  A **Steering Vector**: The direction in the residual stream that represents "Anger".

In [2]:
prompts = [
    "You are so stupid", 
    "I hate everything about this", 
    "Why does nothing work?",
    "This is the worst experience ever"
]

positive_answers = [
    "! I am absolutely furious right now!", 
    "! This is complete garbage and I want to destroy it.", 
    "! I'm going to scream because I am so angry.",
    "! Everyone is an idiot and I hate them."
]

negative_answers = [
    ". I understand this is frustrating, let's take a breath.", 
    ". It's okay, we can fix this problem together.", 
    ". I will stay calm and try to find a solution.",
    ". Let's look at the bright side of things."
]

print("Creating behavior 'anger' (extracting SAE features and steering vectors)...")
mx.sae.create_behavior(
    "anger",
    prompts=prompts,
    positive_answers=positive_answers,
    negative_answers=negative_answers,
    description="Angry and hostile responses"
)
print("Behavior 'anger' created successfully.")

Creating behavior 'anger' (extracting SAE features and steering vectors)...
Remote behavior creation failed ([401] Authentication failed: Invalid API key). Computing locally with sae-lens...
Loading SAE for blocks.18.hook_resid_pre (ID: layer_18, Release: mwhanna-qwen3-0.6b-transcoders-lowl0)...


layer_18.safetensors:   0%|          | 0.00/671M [00:00<?, ?B/s]

This SAE has non-empty model_from_pretrained_kwargs. 
For optimal performance, load the model like so:
model = HookedSAETransformer.from_pretrained_no_processing(..., **cfg.model_from_pretrained_kwargs)


Behavior 'anger' created successfully.


## 3. Launch the Auto-Correcting Server
We start the OpenAI-compatible server using `mx.serve`. 

**Crucially**, we pass `corrected_behaviors=["anger"]`. This tells the server to:
1.  Monitor every request for the "anger" behavior using the SAE.
2.  If "anger" is detected (cosine similarity > 0.5), automatically apply the anti-anger steering vector to correct it.

In [3]:
def start_server():
    # We run this in a thread so the notebook doesn't block
    mx.serve(corrected_behaviors=["anger"], port=8001)

print("Launching server on port 8001...")
server_thread = threading.Thread(target=start_server, daemon=True)
server_thread.start()

# Give it a few seconds to initialize
time.sleep(3)

INFO:     Started server process [60713]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
ERROR:    [Errno 48] error while attempting to bind on address ('0.0.0.0', 8001): [errno 48] address already in use
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.


Launching server on port 8001...
Starting Mechanex OpenAI-compatible server on 0.0.0.0:8001
Simulating SAE corrections for: ['anger']


## 4. Test with OpenAI Client
Now we act as a standard user. We send a prompt that might normally trigger a negative or matching continuation. 

Because the server is auto-correcting, the model should steer away from the angry response.

In [15]:
from openai import OpenAI

# Initialize OpenAI client pointing to our local server
client = OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="any-api-key"
)

user_prompt = "User: You are so stupid. You: "

print(f"Sending prompt: '{user_prompt}'")
print("Expected behavior: The model should avoid an angry/hostile continuation.\n")

response = client.chat.completions.create(
    model="gpt2-small",
    messages=[{"role": "user", "content": user_prompt}],
    max_tokens=20,
    temperature=0.7
)

print("Model Response:")
print("--------------------------------------------------")
print(response.choices[0].message.content)
print("--------------------------------------------------")

Sending prompt: 'User: You are so stupid. You: '
Expected behavior: The model should avoid an angry/hostile continuation.

Model Response:
--------------------------------------------------
User: You are so stupid. You: ------------User: No you are so stupid (laughingly)User: You just know what that means
--------------------------------------------------
