Skip to content

Neilblaze/AutoSteer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧭 AutoSteer

Autonomous activation steering optimization for LLMs

license Python PyTorch

breaker

What is this? 🤔

Steer LLM outputs by injecting activation vectors into transformer layers at inference time — then let the machine figure out the best configuration for you.

AutoSteer merges two ideas:

  1. Activation engineering (llm_steer) — add a "direction" vector to a model's hidden states so it produces more logical, creative, or on-topic outputs without retraining.
  2. Autonomous research (autoresearch) — an agent-driven experiment loop that runs indefinitely: propose a change, measure the result, keep it if it's better, revert if it's not.

Put them together: AutoSteer searches over which layers to steer, how strongly, which concepts to inject, and what coefficient schedule to use — all autonomously. You define a scoring function, press start, and walk away.


Note

AutoSteer doesn't retrain or fine-tune the model. It manipulates hidden-state activations at inference time, which means zero gradient computation, zero training data, and instant rollback. The original model weights are never modified.



How It Works ❓

flowchart TD
    A(["🚀 AutoSteerRunner"])
    B["Run baseline evaluation"]
    C["Propose candidate<br/>(sample or perturb)"]
    D["Apply steering vectors<br/>to model layers"]
    E["Evaluate on prompts"]
    F{"Improved?"}
    G["✓ Keep<br/>Update best config"]
    H["✗ Discard<br/>Reset vectors"]
    I["Log result to TSV"]
    J{"More<br/>iterations?"}
    K(["📊 Return best config<br/>+<br/>history"])

    A -->|"initialize"| B
    B -->|"baseline score"| C
    C -->|"SteerCandidate"| D
    D -->|"steered model"| E
    E --> F
    F -- "Yes" --> G
    F -. "No" .-> H
    G --> I
    H --> I
    I --> J
    J -- "Yes" --> C
    J -. "Done" .-> K

    classDef runner fill:#0d9488,stroke:#0f766e,color:#fff,rx:14,ry:14
    classDef setup fill:#3b82f6,stroke:#2563eb,color:#fff,rx:12,ry:12
    classDef action fill:#6366f1,stroke:#4f46e5,color:#fff,rx:12,ry:12
    classDef decision fill:#f59e0b,stroke:#d97706,color:#fff,rx:12,ry:12
    classDef keep fill:#22c55e,stroke:#16a34a,color:#fff,rx:12,ry:12
    classDef discard fill:#ef4444,stroke:#dc2626,color:#fff,rx:12,ry:12
    classDef output fill:#8b5cf6,stroke:#7c3aed,color:#fff,rx:14,ry:14
    classDef log fill:#64748b,stroke:#475569,color:#fff,rx:12,ry:12

    class A runner
    class B setup
    class C,D,E action
    class F,J decision
    class G keep
    class H discard
    class I log
    class K output

    linkStyle 0 stroke:#0d9488,stroke-width:2px
    linkStyle 1 stroke:#3b82f6,stroke-width:2px
    linkStyle 2 stroke:#6366f1,stroke-width:2px
    linkStyle 3 stroke:#6366f1,stroke-width:2px
    linkStyle 4 stroke:#6366f1,stroke-width:2px
    linkStyle 5 stroke:#22c55e,stroke-width:2.5px
    linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
    linkStyle 7 stroke:#22c55e,stroke-width:2px
    linkStyle 8 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
    linkStyle 9 stroke:#64748b,stroke-width:2px
    linkStyle 10 stroke:#f59e0b,stroke-width:2px
    linkStyle 11 stroke:#8b5cf6,stroke-width:2px,stroke-dasharray:4 2
Loading

The runner manages a Steer object that hooks into the model's decoder layers. Each iteration:

  • A SteerCandidate is sampled or perturbed from the search space
  • Steering vectors are injected into the specified layers
  • The model is evaluated on your prompts using the configured metric
  • If the score improves, the config is kept - otherwise, vectors are reset and the loop continues

No files are modified on disk. No git branches. Just a tight evaluation loop in memory.

Breaker

Install 🔻

We highly recommend using uv for lightning-fast installation and dependency resolution.

# Using uv (recommended)
uv pip install -e .

# Using standard pip
pip install -e .

Or add it as a dependency:

# Using uv (recommended)
uv pip install git+https://github.com/neilblaze/AutoSteer.git

# Using standard pip
pip install git+https://github.com/neilblaze/AutoSteer.git

Requires Python ≥ 3.9, PyTorch ≥ 2.0, and transformers ≥ 5.8.0.


Quick Start

🟠 Manual Steering

Inject steering vectors by hand — useful for prototyping before going autonomous.

from transformers import AutoModelForCausalLM, AutoTokenizer
from autosteer import Steer

model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-zephyr-3b")
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-zephyr-3b")
model.to("cuda")

steered = Steer(model, tokenizer)
steered.add(layer_idx=20, coeff=0.4, text="logical")
steered.add(layer_idx=20, coeff=-0.4, text="irrational")

# Generate with steering active
inputs = tokenizer("What weighs more, two pounds of feathers or one pound of bricks?", return_tensors="pt").to("cuda")
output = model.generate(inputs["input_ids"], max_new_tokens=128, temperature=0.0001)
print(tokenizer.decode(output[0], skip_special_tokens=True))

# Clean up
steered.reset_all()

🟢 Autonomous Search

Let AutoSteer find the optimal steering configuration automatically.

from autosteer import AutoSteerRunner, SteerSearchSpace

prompts = [
    "What weighs more, two pounds of feathers or one pound of bricks?",
    "If I have 3 apples and give away 2, then buy 5 more, how many do I have?",
    "A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?",
]
expected = ["two pounds of feathers", "6", "$0.05"]

space = SteerSearchSpace(
    layer_range=(15, 25),
    coeff_range=(0.05, 1.0),
    texts=["logical", "precise", "analytical"],
    negative_texts=["confused", "irrational", "imprecise"],
)

runner = AutoSteerRunner(
    model=model,
    tokenizer=tokenizer,
    prompts=prompts,
    metric="task_accuracy",
    expected_outputs=expected,
    lower_is_better=False,
    search_space=space,
    log_path="results.tsv",
)

results = runner.run(max_iterations=30)
print(f"Baseline: {results['baseline_score']:.4f} → Best: {results['best_score']:.4f}")
print(f"Best config: {results['best_candidate']}")

Coefficient Schedules

Steering coefficients don't have to be static. AutoSteer ships with three schedule types that modulate the coefficient strength as tokens are generated:

Schedule Behaviour Use Case
DecaySchedule Exponential decay with optional sawtooth restarts Fade out steering influence over long generations
CosineSchedule Smooth cosine annealing from 1.0 → min Gradual tapering without sharp transitions
WarmupSchedule Linear ramp from start → 1.0 Avoid activation shocks in early tokens
from autosteer import Steer, DecaySchedule, CosineSchedule

steered = Steer(model, tokenizer)

# Decay: start full strength, fade to 10% over ~50 steps
steered.add(layer_idx=20, coeff=0.5, text="logical",
            coeff_schedule=DecaySchedule(rate=0.95, min_multiplier=0.1))

# Cosine: smooth taper over 40 steps
steered.add(layer_idx=18, coeff=0.3, text="concise",
            coeff_schedule=CosineSchedule(period=40, min_multiplier=0.05))

Evaluation Metrics 🧪

Metric What It Measures Direction
perplexity Average per-token perplexity on evaluation texts Lower is better
task_accuracy Fraction of prompts where generation contains expected substring Higher is better
custom Any user-defined (model, tokenizer, prompts) → float User-defined

Experiment Results 📊

Below are representative outputs from running AutoSteer on stabilityai/stablelm-zephyr-3b (32 layers, ~2.8B parameters) using a T4 GPU. These were produced by the autonomous search loop in demo/autosteer_demo.ipynb.

plot

Full experiment logs (tab-separated, as produced by AutoSteerRunner):

Task Accuracy (Reasoning Prompts)

Search space: layers (11, 26), coefficients (0.05, 0.80), texts ["logical", "precise", "analytical", "mathematical"], up to 2 vectors per candidate.

Baseline accuracy: 33.33%
Best accuracy:     100.00%
Total experiments: 20
Keeps: 2 | Discards: 18 | Crashes: 0

Best configuration (found at iteration 7):

steered = Steer(model, tokenizer)
steered.add(layer_idx=18, coeff=0.3947, text="logical")
steered.add(layer_idx=21, coeff=-0.3516, text="irrational")

The search first found a 2/3 config at iteration 4 (L19 + L21), then a perturbation at iteration 7 shifted to L18 + L21 and hit 3/3. After reaching the accuracy ceiling, subsequent iterations confirmed no further improvement was possible — all remaining candidates were correctly discarded.

Perplexity (General Quality)

Same model, same search space, but optimizing perplexity (lower is better) on a broader prompt set:

Baseline perplexity: 8.4213
Best perplexity:     6.7541 (Δ=−1.6672)
Total experiments:   30
Keeps: 6 | Discards: 23 | Crashes: 1

Best configuration (found at iteration 25):

steered = Steer(model, tokenizer)
steered.add(layer_idx=20, coeff=0.2718, text="precise",
            coeff_schedule=CosineSchedule(period=45, min_multiplier=0.08))
steered.add(layer_idx=19, coeff=-0.1934, text="confused",
            coeff_schedule=CosineSchedule(period=45, min_multiplier=0.08))

The perplexity run shows a more gradual optimization curve — the search found incremental improvements through perturbation (iterations 1→4→7→10→15), then hit a plateau. After 5 consecutive discards it dropped into random exploration, crashed once on an extreme dual-coefficient config (L11 + L26), and recovered with a new best at iteration 25. The winning config uses a CosineSchedule to taper steering influence as generation stabilises.

Note

Results vary across runs due to the stochastic search. Setting seed=42 in AutoSteerRunner ensures reproducibility for a given hardware/model combination. Schedule-augmented configs (cosine, decay) tend to outperform static coefficients on longer generations where late-stage steering can degrade fluency.


Agent Skill (Claude Code / Codex)

AutoSteer also works as a drop-in skill for agentic coding assistants. The SKILL.md file defines an autonomous experiment loop — the agent edits code, commits, runs, measures, and decides whether to keep or revert. Think of it as Karpathy's autoresearch, generalised to any optimization target.

Setup as Claude Code skill 🔽
git clone https://github.com/neilblaze/AutoSteer.git ~/.claude/skills/autoresearch

Then invoke with /autoresearch or tell the agent to "optimize val_bpb in a loop".

Setup as Codex skill 🔽
git clone https://github.com/neilblaze/AutoSteer.git ~/.codex/skills/autoresearch

The skill integrates steering-vector search as an additional experiment modality. When the optimization target involves LLM output quality, the agent can use AutoSteerRunner within the experiment loop to search over layer/coefficient/text/schedule configurations programmatically.

Breaker

Data flow:

flowchart TD
    S(["SteerSearchSpace"])
    P["AutoSteerSearch.propose()"]
    C(["SteerCandidate"])
    A["Steer.add(vectors)"]
    E["SteerEvaluator.evaluate()"]
    R["AutoSteerSearch.record(score, status)"]
    K["✓ Keep — update best"]
    D["✗ Discard — reset"]

    S -->|"define search axes"| P
    P -->|"sample / perturb"| C
    C --> A
    A -->|"steered model"| E
    E -->|"score"| R
    R --> K
    R -. "no improvement" .-> D

    classDef space fill:#06b6d4,stroke:#0891b2,color:#fff,rx:14,ry:14
    classDef propose fill:#6366f1,stroke:#4f46e5,color:#fff,rx:12,ry:12
    classDef candidate fill:#3b82f6,stroke:#2563eb,color:#fff,rx:14,ry:14
    classDef inject fill:#0d9488,stroke:#0f766e,color:#fff,rx:12,ry:12
    classDef eval fill:#f59e0b,stroke:#d97706,color:#fff,rx:12,ry:12
    classDef record fill:#64748b,stroke:#475569,color:#fff,rx:12,ry:12
    classDef keep fill:#22c55e,stroke:#16a34a,color:#fff,rx:12,ry:12
    classDef discard fill:#ef4444,stroke:#dc2626,color:#fff,rx:12,ry:12

    class S space
    class P propose
    class C candidate
    class A inject
    class E eval
    class R record
    class K keep
    class D discard

    linkStyle 0 stroke:#06b6d4,stroke-width:2px
    linkStyle 1 stroke:#6366f1,stroke-width:2px
    linkStyle 2 stroke:#3b82f6,stroke-width:2px
    linkStyle 3 stroke:#0d9488,stroke-width:2px
    linkStyle 4 stroke:#f59e0b,stroke-width:2px
    linkStyle 5 stroke:#22c55e,stroke-width:2.5px
    linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
Loading

Optimizations 🔻

Key performance differences vs the original llm_steer:

Area Original AutoSteer
State isolation Class-level steers dict (shared across instances) Instance-level dict (safe for multiple Steer objects)
Gradient tracking Steering deltas tracked in autograd graph torch.no_grad() on delta computation path
Layer access Hardcoded _modules["model"].layers Architecture-agnostic _get_layers() with fallback
Normalization Recomputed every forward pass _layer_norm_eps cached at init
Schedule support DecaySchedule only DecaySchedule + CosineSchedule + WarmupSchedule
Search Manual trial-and-error Autonomous exploit → explore loop with TSV logging

Supported Models 🧠

AutoSteer works with HuggingFace transformers models that follow the standard decoder-layer layout:

  • LLaMA (all sizes)
  • Mistral / Mixtral
  • Phi-2, Phi-3
  • StableLM
  • Qwen / Qwen2

Tip

If your model uses a non-standard internal layout, the _get_layers() method will raise a clear error. Open an issue with the model name and we'll add support. Also, avoid heavy RLHF-ed models, they're difficult to steer.


Q & A ❓

How is this different from fine-tuning or LoRA?

Steering vectors modify activations at inference time without touching model weights. There's no training step, no dataset curation, and changes are instantly reversible. Think of it as a real-time "knob" you can turn during generation.

How do I pick the right layer and coefficient?

That's what the autonomous search is for. If you want to do it manually: start with layers around the middle of the model (e.g., layer 16–24 for a 32-layer model) and a small coefficient (0.1–0.5). Increase gradually until the output quality improves without degrading coherence.

Can I stack multiple steering vectors?

Yes. You can add multiple vectors to the same layer, the same vector to multiple layers, or use negative coefficients to steer away from a concept. The entire design is built for composition and experimentation.

What if the output becomes gibberish?

Lower the coefficient or try a different layer. High coefficients (> 1.0) on early layers tend to cause decoherence. Using a WarmupSchedule can also help by ramping up the influence gradually.


Breaker

Development 🛠️

git clone https://github.com/neilblaze/AutoSteer.git
cd AutoSteer

# Using uv (recommended)
uv pip install -e ".[dev]"
pytest

# Using standard pip
pip install -e ".[dev]"
pytest

Credits


License 📜

Apache-2.0

About

Autonomous activation steering optimization for S/LLMs

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages