Steer LLM outputs by injecting activation vectors into transformer layers at inference time — then let the machine figure out the best configuration for you.
AutoSteer merges two ideas:
- Activation engineering (llm_steer) — add a "direction" vector to a model's hidden states so it produces more logical, creative, or on-topic outputs without retraining.
- Autonomous research (autoresearch) — an agent-driven experiment loop that runs indefinitely: propose a change, measure the result, keep it if it's better, revert if it's not.
Put them together: AutoSteer searches over which layers to steer, how strongly, which concepts to inject, and what coefficient schedule to use — all autonomously. You define a scoring function, press start, and walk away.
Note
AutoSteer doesn't retrain or fine-tune the model. It manipulates hidden-state activations at inference time, which means zero gradient computation, zero training data, and instant rollback. The original model weights are never modified.
flowchart TD
A(["🚀 AutoSteerRunner"])
B["Run baseline evaluation"]
C["Propose candidate<br/>(sample or perturb)"]
D["Apply steering vectors<br/>to model layers"]
E["Evaluate on prompts"]
F{"Improved?"}
G["✓ Keep<br/>Update best config"]
H["✗ Discard<br/>Reset vectors"]
I["Log result to TSV"]
J{"More<br/>iterations?"}
K(["📊 Return best config<br/>+<br/>history"])
A -->|"initialize"| B
B -->|"baseline score"| C
C -->|"SteerCandidate"| D
D -->|"steered model"| E
E --> F
F -- "Yes" --> G
F -. "No" .-> H
G --> I
H --> I
I --> J
J -- "Yes" --> C
J -. "Done" .-> K
classDef runner fill:#0d9488,stroke:#0f766e,color:#fff,rx:14,ry:14
classDef setup fill:#3b82f6,stroke:#2563eb,color:#fff,rx:12,ry:12
classDef action fill:#6366f1,stroke:#4f46e5,color:#fff,rx:12,ry:12
classDef decision fill:#f59e0b,stroke:#d97706,color:#fff,rx:12,ry:12
classDef keep fill:#22c55e,stroke:#16a34a,color:#fff,rx:12,ry:12
classDef discard fill:#ef4444,stroke:#dc2626,color:#fff,rx:12,ry:12
classDef output fill:#8b5cf6,stroke:#7c3aed,color:#fff,rx:14,ry:14
classDef log fill:#64748b,stroke:#475569,color:#fff,rx:12,ry:12
class A runner
class B setup
class C,D,E action
class F,J decision
class G keep
class H discard
class I log
class K output
linkStyle 0 stroke:#0d9488,stroke-width:2px
linkStyle 1 stroke:#3b82f6,stroke-width:2px
linkStyle 2 stroke:#6366f1,stroke-width:2px
linkStyle 3 stroke:#6366f1,stroke-width:2px
linkStyle 4 stroke:#6366f1,stroke-width:2px
linkStyle 5 stroke:#22c55e,stroke-width:2.5px
linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
linkStyle 7 stroke:#22c55e,stroke-width:2px
linkStyle 8 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
linkStyle 9 stroke:#64748b,stroke-width:2px
linkStyle 10 stroke:#f59e0b,stroke-width:2px
linkStyle 11 stroke:#8b5cf6,stroke-width:2px,stroke-dasharray:4 2
The runner manages a Steer object that hooks into the model's decoder layers. Each iteration:
- A
SteerCandidateis sampled or perturbed from the search space - Steering vectors are injected into the specified layers
- The model is evaluated on your prompts using the configured metric
- If the score improves, the config is kept - otherwise, vectors are reset and the loop continues
No files are modified on disk. No git branches. Just a tight evaluation loop in memory.
We highly recommend using uv for lightning-fast installation and dependency resolution.
# Using uv (recommended)
uv pip install -e .
# Using standard pip
pip install -e .Or add it as a dependency:
# Using uv (recommended)
uv pip install git+https://github.com/neilblaze/AutoSteer.git
# Using standard pip
pip install git+https://github.com/neilblaze/AutoSteer.gitRequires Python ≥ 3.9, PyTorch ≥ 2.0, and transformers ≥ 5.8.0.
Inject steering vectors by hand — useful for prototyping before going autonomous.
from transformers import AutoModelForCausalLM, AutoTokenizer
from autosteer import Steer
model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-zephyr-3b")
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-zephyr-3b")
model.to("cuda")
steered = Steer(model, tokenizer)
steered.add(layer_idx=20, coeff=0.4, text="logical")
steered.add(layer_idx=20, coeff=-0.4, text="irrational")
# Generate with steering active
inputs = tokenizer("What weighs more, two pounds of feathers or one pound of bricks?", return_tensors="pt").to("cuda")
output = model.generate(inputs["input_ids"], max_new_tokens=128, temperature=0.0001)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# Clean up
steered.reset_all()Let AutoSteer find the optimal steering configuration automatically.
from autosteer import AutoSteerRunner, SteerSearchSpace
prompts = [
"What weighs more, two pounds of feathers or one pound of bricks?",
"If I have 3 apples and give away 2, then buy 5 more, how many do I have?",
"A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?",
]
expected = ["two pounds of feathers", "6", "$0.05"]
space = SteerSearchSpace(
layer_range=(15, 25),
coeff_range=(0.05, 1.0),
texts=["logical", "precise", "analytical"],
negative_texts=["confused", "irrational", "imprecise"],
)
runner = AutoSteerRunner(
model=model,
tokenizer=tokenizer,
prompts=prompts,
metric="task_accuracy",
expected_outputs=expected,
lower_is_better=False,
search_space=space,
log_path="results.tsv",
)
results = runner.run(max_iterations=30)
print(f"Baseline: {results['baseline_score']:.4f} → Best: {results['best_score']:.4f}")
print(f"Best config: {results['best_candidate']}")Steering coefficients don't have to be static. AutoSteer ships with three schedule types that modulate the coefficient strength as tokens are generated:
| Schedule | Behaviour | Use Case |
|---|---|---|
DecaySchedule |
Exponential decay with optional sawtooth restarts | Fade out steering influence over long generations |
CosineSchedule |
Smooth cosine annealing from 1.0 → min | Gradual tapering without sharp transitions |
WarmupSchedule |
Linear ramp from start → 1.0 | Avoid activation shocks in early tokens |
from autosteer import Steer, DecaySchedule, CosineSchedule
steered = Steer(model, tokenizer)
# Decay: start full strength, fade to 10% over ~50 steps
steered.add(layer_idx=20, coeff=0.5, text="logical",
coeff_schedule=DecaySchedule(rate=0.95, min_multiplier=0.1))
# Cosine: smooth taper over 40 steps
steered.add(layer_idx=18, coeff=0.3, text="concise",
coeff_schedule=CosineSchedule(period=40, min_multiplier=0.05))| Metric | What It Measures | Direction |
|---|---|---|
perplexity |
Average per-token perplexity on evaluation texts | Lower is better |
task_accuracy |
Fraction of prompts where generation contains expected substring | Higher is better |
custom |
Any user-defined (model, tokenizer, prompts) → float |
User-defined |
Below are representative outputs from running AutoSteer on stabilityai/stablelm-zephyr-3b (32 layers, ~2.8B parameters) using a T4 GPU. These were produced by the autonomous search loop in demo/autosteer_demo.ipynb.
Full experiment logs (tab-separated, as produced by AutoSteerRunner):
demo/results/task_accuracy_stablelm3b.tsv— 20 iterations,task_accuracymetricdemo/results/perplexity_stablelm3b.tsv— 30 iterations,perplexitymetric
Search space: layers (11, 26), coefficients (0.05, 0.80), texts ["logical", "precise", "analytical", "mathematical"], up to 2 vectors per candidate.
Baseline accuracy: 33.33%
Best accuracy: 100.00%
Total experiments: 20
Keeps: 2 | Discards: 18 | Crashes: 0
Best configuration (found at iteration 7):
steered = Steer(model, tokenizer)
steered.add(layer_idx=18, coeff=0.3947, text="logical")
steered.add(layer_idx=21, coeff=-0.3516, text="irrational")The search first found a 2/3 config at iteration 4 (L19 + L21), then a perturbation at iteration 7 shifted to L18 + L21 and hit 3/3. After reaching the accuracy ceiling, subsequent iterations confirmed no further improvement was possible — all remaining candidates were correctly discarded.
Same model, same search space, but optimizing perplexity (lower is better) on a broader prompt set:
Baseline perplexity: 8.4213
Best perplexity: 6.7541 (Δ=−1.6672)
Total experiments: 30
Keeps: 6 | Discards: 23 | Crashes: 1
Best configuration (found at iteration 25):
steered = Steer(model, tokenizer)
steered.add(layer_idx=20, coeff=0.2718, text="precise",
coeff_schedule=CosineSchedule(period=45, min_multiplier=0.08))
steered.add(layer_idx=19, coeff=-0.1934, text="confused",
coeff_schedule=CosineSchedule(period=45, min_multiplier=0.08))The perplexity run shows a more gradual optimization curve — the search found incremental improvements through perturbation (iterations 1→4→7→10→15), then hit a plateau. After 5 consecutive discards it dropped into random exploration, crashed once on an extreme dual-coefficient config (L11 + L26), and recovered with a new best at iteration 25. The winning config uses a CosineSchedule to taper steering influence as generation stabilises.
Note
Results vary across runs due to the stochastic search. Setting seed=42 in AutoSteerRunner ensures reproducibility for a given hardware/model combination. Schedule-augmented configs (cosine, decay) tend to outperform static coefficients on longer generations where late-stage steering can degrade fluency.
AutoSteer also works as a drop-in skill for agentic coding assistants. The SKILL.md file defines an autonomous experiment loop — the agent edits code, commits, runs, measures, and decides whether to keep or revert. Think of it as Karpathy's autoresearch, generalised to any optimization target.
Setup as Claude Code skill 🔽
git clone https://github.com/neilblaze/AutoSteer.git ~/.claude/skills/autoresearchThen invoke with /autoresearch or tell the agent to "optimize val_bpb in a loop".
Setup as Codex skill 🔽
git clone https://github.com/neilblaze/AutoSteer.git ~/.codex/skills/autoresearchThe skill integrates steering-vector search as an additional experiment modality. When the optimization target involves LLM output quality, the agent can use AutoSteerRunner within the experiment loop to search over layer/coefficient/text/schedule configurations programmatically.
Data flow:
flowchart TD
S(["SteerSearchSpace"])
P["AutoSteerSearch.propose()"]
C(["SteerCandidate"])
A["Steer.add(vectors)"]
E["SteerEvaluator.evaluate()"]
R["AutoSteerSearch.record(score, status)"]
K["✓ Keep — update best"]
D["✗ Discard — reset"]
S -->|"define search axes"| P
P -->|"sample / perturb"| C
C --> A
A -->|"steered model"| E
E -->|"score"| R
R --> K
R -. "no improvement" .-> D
classDef space fill:#06b6d4,stroke:#0891b2,color:#fff,rx:14,ry:14
classDef propose fill:#6366f1,stroke:#4f46e5,color:#fff,rx:12,ry:12
classDef candidate fill:#3b82f6,stroke:#2563eb,color:#fff,rx:14,ry:14
classDef inject fill:#0d9488,stroke:#0f766e,color:#fff,rx:12,ry:12
classDef eval fill:#f59e0b,stroke:#d97706,color:#fff,rx:12,ry:12
classDef record fill:#64748b,stroke:#475569,color:#fff,rx:12,ry:12
classDef keep fill:#22c55e,stroke:#16a34a,color:#fff,rx:12,ry:12
classDef discard fill:#ef4444,stroke:#dc2626,color:#fff,rx:12,ry:12
class S space
class P propose
class C candidate
class A inject
class E eval
class R record
class K keep
class D discard
linkStyle 0 stroke:#06b6d4,stroke-width:2px
linkStyle 1 stroke:#6366f1,stroke-width:2px
linkStyle 2 stroke:#3b82f6,stroke-width:2px
linkStyle 3 stroke:#0d9488,stroke-width:2px
linkStyle 4 stroke:#f59e0b,stroke-width:2px
linkStyle 5 stroke:#22c55e,stroke-width:2.5px
linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
Key performance differences vs the original llm_steer:
| Area | Original | AutoSteer |
|---|---|---|
| State isolation | Class-level steers dict (shared across instances) |
Instance-level dict (safe for multiple Steer objects) |
| Gradient tracking | Steering deltas tracked in autograd graph | torch.no_grad() on delta computation path |
| Layer access | Hardcoded _modules["model"].layers |
Architecture-agnostic _get_layers() with fallback |
| Normalization | Recomputed every forward pass | _layer_norm_eps cached at init |
| Schedule support | DecaySchedule only |
DecaySchedule + CosineSchedule + WarmupSchedule |
| Search | Manual trial-and-error | Autonomous exploit → explore loop with TSV logging |
AutoSteer works with HuggingFace transformers models that follow the standard decoder-layer layout:
- LLaMA (all sizes)
- Mistral / Mixtral
- Phi-2, Phi-3
- StableLM
- Qwen / Qwen2
Tip
If your model uses a non-standard internal layout, the _get_layers() method will raise a clear error. Open an issue with the model name and we'll add support. Also, avoid heavy RLHF-ed models, they're difficult to steer.
How is this different from fine-tuning or LoRA?
Steering vectors modify activations at inference time without touching model weights. There's no training step, no dataset curation, and changes are instantly reversible. Think of it as a real-time "knob" you can turn during generation.
How do I pick the right layer and coefficient?
That's what the autonomous search is for. If you want to do it manually: start with layers around the middle of the model (e.g., layer 16–24 for a 32-layer model) and a small coefficient (0.1–0.5). Increase gradually until the output quality improves without degrading coherence.
Can I stack multiple steering vectors?
Yes. You can add multiple vectors to the same layer, the same vector to multiple layers, or use negative coefficients to steer away from a concept. The entire design is built for composition and experimentation.
What if the output becomes gibberish?
Lower the coefficient or try a different layer. High coefficients (> 1.0) on early layers tend to cause decoherence. Using a WarmupSchedule can also help by ramping up the influence gradually.
git clone https://github.com/neilblaze/AutoSteer.git
cd AutoSteer
# Using uv (recommended)
uv pip install -e ".[dev]"
pytest
# Using standard pip
pip install -e ".[dev]"
pytest- Mihaiii/llm_steer — the original activation-steering library that AutoSteer builds upon.
- Karpathy/autoresearch — the autonomous experiment loop concept adapted for steering-vector search.
- Steering GPT-2-XL by Adding an Activation Vector — foundational research on representation engineering.


