🧭 AutoSteer

Autonomous activation steering optimization for LLMs

What is this? 🤔

Steer LLM outputs by injecting activation vectors into transformer layers at inference time — then let the machine figure out the best configuration for you.

AutoSteer merges two ideas:

Activation engineering (llm_steer) — add a "direction" vector to a model's hidden states so it produces more logical, creative, or on-topic outputs without retraining.
Autonomous research (autoresearch) — an agent-driven experiment loop that runs indefinitely: propose a change, measure the result, keep it if it's better, revert if it's not.

Put them together: AutoSteer searches over which layers to steer, how strongly, which concepts to inject, and what coefficient schedule to use — all autonomously. You define a scoring function, press start, and walk away.

Note

AutoSteer doesn't retrain or fine-tune the model. It manipulates hidden-state activations at inference time, which means zero gradient computation, zero training data, and instant rollback. The original model weights are never modified.

How It Works ❓

flowchart TD
    A(["🚀 AutoSteerRunner"])
    B["Run baseline evaluation"]
    C["Propose candidate<br/>(sample or perturb)"]
    D["Apply steering vectors<br/>to model layers"]
    E["Evaluate on prompts"]
    F{"Improved?"}
    G["✓ Keep<br/>Update best config"]
    H["✗ Discard<br/>Reset vectors"]
    I["Log result to TSV"]
    J{"More<br/>iterations?"}
    K(["📊 Return best config<br/>+<br/>history"])

    A -->|"initialize"| B
    B -->|"baseline score"| C
    C -->|"SteerCandidate"| D
    D -->|"steered model"| E
    E --> F
    F -- "Yes" --> G
    F -. "No" .-> H
    G --> I
    H --> I
    I --> J
    J -- "Yes" --> C
    J -. "Done" .-> K

    classDef runner fill:#0d9488,stroke:#0f766e,color:#fff,rx:14,ry:14
    classDef setup fill:#3b82f6,stroke:#2563eb,color:#fff,rx:12,ry:12
    classDef action fill:#6366f1,stroke:#4f46e5,color:#fff,rx:12,ry:12
    classDef decision fill:#f59e0b,stroke:#d97706,color:#fff,rx:12,ry:12
    classDef keep fill:#22c55e,stroke:#16a34a,color:#fff,rx:12,ry:12
    classDef discard fill:#ef4444,stroke:#dc2626,color:#fff,rx:12,ry:12
    classDef output fill:#8b5cf6,stroke:#7c3aed,color:#fff,rx:14,ry:14
    classDef log fill:#64748b,stroke:#475569,color:#fff,rx:12,ry:12

    class A runner
    class B setup
    class C,D,E action
    class F,J decision
    class G keep
    class H discard
    class I log
    class K output

    linkStyle 0 stroke:#0d9488,stroke-width:2px
    linkStyle 1 stroke:#3b82f6,stroke-width:2px
    linkStyle 2 stroke:#6366f1,stroke-width:2px
    linkStyle 3 stroke:#6366f1,stroke-width:2px
    linkStyle 4 stroke:#6366f1,stroke-width:2px
    linkStyle 5 stroke:#22c55e,stroke-width:2.5px
    linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
    linkStyle 7 stroke:#22c55e,stroke-width:2px
    linkStyle 8 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3
    linkStyle 9 stroke:#64748b,stroke-width:2px
    linkStyle 10 stroke:#f59e0b,stroke-width:2px
    linkStyle 11 stroke:#8b5cf6,stroke-width:2px,stroke-dasharray:4 2

The runner manages a Steer object that hooks into the model's decoder layers. Each iteration:

A SteerCandidate is sampled or perturbed from the search space
Steering vectors are injected into the specified layers
The model is evaluated on your prompts using the configured metric
If the score improves, the config is kept - otherwise, vectors are reset and the loop continues

No files are modified on disk. No git branches. Just a tight evaluation loop in memory.

Install 🔻

We highly recommend using uv for lightning-fast installation and dependency resolution.

# Using uv (recommended)
uv pip install -e .

# Using standard pip
pip install -e .

Or add it as a dependency:

# Using uv (recommended)
uv pip install git+https://github.com/neilblaze/AutoSteer.git

# Using standard pip
pip install git+https://github.com/neilblaze/AutoSteer.git

Requires Python ≥ 3.9, PyTorch ≥ 2.0, and transformers ≥ 5.8.0.

Quick Start

🟠 Manual Steering

Inject steering vectors by hand — useful for prototyping before going autonomous.

from transformers import AutoModelForCausalLM, AutoTokenizer
from autosteer import Steer

model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-zephyr-3b")
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-zephyr-3b")
model.to("cuda")

steered = Steer(model, tokenizer)
steered.add(layer_idx=20, coeff=0.4, text="logical")
steered.add(layer_idx=20, coeff=-0.4, text="irrational")

# Generate with steering active
inputs = tokenizer("What weighs more, two pounds of feathers or one pound of bricks?", return_tensors="pt").to("cuda")
output = model.generate(inputs["input_ids"], max_new_tokens=128, temperature=0.0001)
print(tokenizer.decode(output[0], skip_special_tokens=True))

# Clean up
steered.reset_all()

🟢 Autonomous Search

Let AutoSteer find the optimal steering configuration automatically.

from autosteer import AutoSteerRunner, SteerSearchSpace

prompts = [
    "What weighs more, two pounds of feathers or one pound of bricks?",
    "If I have 3 apples and give away 2, then buy 5 more, how many do I have?",
    "A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?",
]
expected = ["two pounds of feathers", "6", "$0.05"]

space = SteerSearchSpace(
    layer_range=(15, 25),
    coeff_range=(0.05, 1.0),
    texts=["logical", "precise", "analytical"],
    negative_texts=["confused", "irrational", "imprecise"],
)

runner = AutoSteerRunner(
    model=model,
    tokenizer=tokenizer,
    prompts=prompts,
    metric="task_accuracy",
    expected_outputs=expected,
    lower_is_better=False,
    search_space=space,
    log_path="results.tsv",
)

results = runner.run(max_iterations=30)
print(f"Baseline: {results['baseline_score']:.4f} → Best: {results['best_score']:.4f}")
print(f"Best config: {results['best_candidate']}")

Coefficient Schedules

Steering coefficients don't have to be static. AutoSteer ships with three schedule types that modulate the coefficient strength as tokens are generated:

Schedule	Behaviour	Use Case
`DecaySchedule`	Exponential decay with optional sawtooth restarts	Fade out steering influence over long generations
`CosineSchedule`	Smooth cosine annealing from 1.0 → min	Gradual tapering without sharp transitions
`WarmupSchedule`	Linear ramp from start → 1.0	Avoid activation shocks in early tokens

from autosteer import Steer, DecaySchedule, CosineSchedule

steered = Steer(model, tokenizer)

# Decay: start full strength, fade to 10% over ~50 steps
steered.add(layer_idx=20, coeff=0.5, text="logical",
            coeff_schedule=DecaySchedule(rate=0.95, min_multiplier=0.1))

# Cosine: smooth taper over 40 steps
steered.add(layer_idx=18, coeff=0.3, text="concise",
            coeff_schedule=CosineSchedule(period=40, min_multiplier=0.05))

Evaluation Metrics 🧪

Metric	What It Measures	Direction
`perplexity`	Average per-token perplexity on evaluation texts	Lower is better
`task_accuracy`	Fraction of prompts where generation contains expected substring	Higher is better
`custom`	Any user-defined `(model, tokenizer, prompts) → float`	User-defined

Experiment Results 📊

Below are representative outputs from running AutoSteer on stabilityai/stablelm-zephyr-3b (32 layers, ~2.8B parameters) using a T4 GPU. These were produced by the autonomous search loop in demo/autosteer_demo.ipynb.

Full experiment logs (tab-separated, as produced by AutoSteerRunner):

demo/results/task_accuracy_stablelm3b.tsv — 20 iterations, task_accuracy metric
demo/results/perplexity_stablelm3b.tsv — 30 iterations, perplexity metric

Task Accuracy (Reasoning Prompts)

Search space: layers (11, 26), coefficients (0.05, 0.80), texts ["logical", "precise", "analytical", "mathematical"], up to 2 vectors per candidate.

Baseline accuracy: 33.33%
Best accuracy:     100.00%
Total experiments: 20
Keeps: 2 | Discards: 18 | Crashes: 0

Best configuration (found at iteration 7):

steered = Steer(model, tokenizer)
steered.add(layer_idx=18, coeff=0.3947, text="logical")
steered.add(layer_idx=21, coeff=-0.3516, text="irrational")

The search first found a 2/3 config at iteration 4 (L19 + L21), then a perturbation at iteration 7 shifted to L18 + L21 and hit 3/3. After reaching the accuracy ceiling, subsequent iterations confirmed no further improvement was possible — all remaining candidates were correctly discarded.

Perplexity (General Quality)

Same model, same search space, but optimizing perplexity (lower is better) on a broader prompt set:

Baseline perplexity: 8.4213
Best perplexity:     6.7541 (Δ=−1.6672)
Total experiments:   30
Keeps: 6 | Discards: 23 | Crashes: 1

Best configuration (found at iteration 25):

steered = Steer(model, tokenizer)
steered.add(layer_idx=20, coeff=0.2718, text="precise",
            coeff_schedule=CosineSchedule(period=45, min_multiplier=0.08))
steered.add(layer_idx=19, coeff=-0.1934, text="confused",
            coeff_schedule=CosineSchedule(period=45, min_multiplier=0.08))

The perplexity run shows a more gradual optimization curve — the search found incremental improvements through perturbation (iterations 1→4→7→10→15), then hit a plateau. After 5 consecutive discards it dropped into random exploration, crashed once on an extreme dual-coefficient config (L11 + L26), and recovered with a new best at iteration 25. The winning config uses a CosineSchedule to taper steering influence as generation stabilises.

Note

Results vary across runs due to the stochastic search. Setting seed=42 in AutoSteerRunner ensures reproducibility for a given hardware/model combination. Schedule-augmented configs (cosine, decay) tend to outperform static coefficients on longer generations where late-stage steering can degrade fluency.

Agent Skill (Claude Code / Codex)

AutoSteer also works as a drop-in skill for agentic coding assistants. The SKILL.md file defines an autonomous experiment loop — the agent edits code, commits, runs, measures, and decides whether to keep or revert. Think of it as Karpathy's autoresearch, generalised to any optimization target.

Setup as Claude Code skill 🔽

git clone https://github.com/neilblaze/AutoSteer.git ~/.claude/skills/autoresearch

Then invoke with /autoresearch or tell the agent to "optimize val_bpb in a loop".

Setup as Codex skill 🔽

git clone https://github.com/neilblaze/AutoSteer.git ~/.codex/skills/autoresearch

The skill integrates steering-vector search as an additional experiment modality. When the optimization target involves LLM output quality, the agent can use AutoSteerRunner within the experiment loop to search over layer/coefficient/text/schedule configurations programmatically.

Data flow:

flowchart TD
    S(["SteerSearchSpace"])
    P["AutoSteerSearch.propose()"]
    C(["SteerCandidate"])
    A["Steer.add(vectors)"]
    E["SteerEvaluator.evaluate()"]
    R["AutoSteerSearch.record(score, status)"]
    K["✓ Keep — update best"]
    D["✗ Discard — reset"]

    S -->|"define search axes"| P
    P -->|"sample / perturb"| C
    C --> A
    A -->|"steered model"| E
    E -->|"score"| R
    R --> K
    R -. "no improvement" .-> D

    classDef space fill:#06b6d4,stroke:#0891b2,color:#fff,rx:14,ry:14
    classDef propose fill:#6366f1,stroke:#4f46e5,color:#fff,rx:12,ry:12
    classDef candidate fill:#3b82f6,stroke:#2563eb,color:#fff,rx:14,ry:14
    classDef inject fill:#0d9488,stroke:#0f766e,color:#fff,rx:12,ry:12
    classDef eval fill:#f59e0b,stroke:#d97706,color:#fff,rx:12,ry:12
    classDef record fill:#64748b,stroke:#475569,color:#fff,rx:12,ry:12
    classDef keep fill:#22c55e,stroke:#16a34a,color:#fff,rx:12,ry:12
    classDef discard fill:#ef4444,stroke:#dc2626,color:#fff,rx:12,ry:12

    class S space
    class P propose
    class C candidate
    class A inject
    class E eval
    class R record
    class K keep
    class D discard

    linkStyle 0 stroke:#06b6d4,stroke-width:2px
    linkStyle 1 stroke:#6366f1,stroke-width:2px
    linkStyle 2 stroke:#3b82f6,stroke-width:2px
    linkStyle 3 stroke:#0d9488,stroke-width:2px
    linkStyle 4 stroke:#f59e0b,stroke-width:2px
    linkStyle 5 stroke:#22c55e,stroke-width:2.5px
    linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 3

Optimizations 🔻

Key performance differences vs the original llm_steer:

Area	Original	AutoSteer
State isolation	Class-level `steers` dict (shared across instances)	Instance-level dict (safe for multiple Steer objects)
Gradient tracking	Steering deltas tracked in autograd graph	`torch.no_grad()` on delta computation path
Layer access	Hardcoded `_modules["model"].layers`	Architecture-agnostic `_get_layers()` with fallback
Normalization	Recomputed every forward pass	`_layer_norm_eps` cached at init
Schedule support	`DecaySchedule` only	`DecaySchedule` + `CosineSchedule` + `WarmupSchedule`
Search	Manual trial-and-error	Autonomous `exploit → explore` loop with TSV logging

Supported Models 🧠

AutoSteer works with HuggingFace transformers models that follow the standard decoder-layer layout:

LLaMA (all sizes)
Mistral / Mixtral
Phi-2, Phi-3
StableLM
Qwen / Qwen2

Tip

If your model uses a non-standard internal layout, the _get_layers() method will raise a clear error. Open an issue with the model name and we'll add support. Also, avoid heavy RLHF-ed models, they're difficult to steer.

Q & A ❓

How is this different from fine-tuning or LoRA?

Steering vectors modify activations at inference time without touching model weights. There's no training step, no dataset curation, and changes are instantly reversible. Think of it as a real-time "knob" you can turn during generation.

How do I pick the right layer and coefficient?

That's what the autonomous search is for. If you want to do it manually: start with layers around the middle of the model (e.g., layer 16–24 for a 32-layer model) and a small coefficient (0.1–0.5). Increase gradually until the output quality improves without degrading coherence.

Can I stack multiple steering vectors?

Yes. You can add multiple vectors to the same layer, the same vector to multiple layers, or use negative coefficients to steer away from a concept. The entire design is built for composition and experimentation.

What if the output becomes gibberish?

Lower the coefficient or try a different layer. High coefficients (> 1.0) on early layers tend to cause decoherence. Using a WarmupSchedule can also help by ramping up the influence gradually.

Development 🛠️

git clone https://github.com/neilblaze/AutoSteer.git
cd AutoSteer

# Using uv (recommended)
uv pip install -e ".[dev]"
pytest

# Using standard pip
pip install -e ".[dev]"
pytest

Credits

Mihaiii/llm_steer — the original activation-steering library that AutoSteer builds upon.
Karpathy/autoresearch — the autonomous experiment loop concept adapted for steering-vector search.
Steering GPT-2-XL by Adding an Activation Vector — foundational research on representation engineering.

License 📜

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
autosteer		autosteer
demo		demo
references		references
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧭 AutoSteer

What is this? 🤔

How It Works ❓

Install 🔻

Quick Start

🟠 Manual Steering

🟢 Autonomous Search

Coefficient Schedules

Evaluation Metrics 🧪

Experiment Results 📊

Task Accuracy (Reasoning Prompts)

Perplexity (General Quality)

Agent Skill (Claude Code / Codex)

Optimizations 🔻

Supported Models 🧠

Q & A ❓

Development 🛠️

Credits

License 📜

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧭 AutoSteer

What is this? 🤔

How It Works ❓

Install 🔻

Quick Start

🟠 Manual Steering

🟢 Autonomous Search

Coefficient Schedules

Evaluation Metrics 🧪

Experiment Results 📊

Task Accuracy (Reasoning Prompts)

Perplexity (General Quality)

Agent Skill (Claude Code / Codex)

Optimizations 🔻

Supported Models 🧠

Q & A ❓

Development 🛠️

Credits

License 📜

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages