Open source by Santander AI Lab. An LLM / AI-safety guardrail research library / evaluation harness (autoresearch-style): it searches over a single mutable
policy.mdsurface to minimize attack success rate (ASR) against a fixed evaluation suite, with a benign-pass floor.
Part of Santander AI Open Source — open source AI projects from Banco Santander (santander.com).
autoguardrails is a small alignment research scaffold inspired by Karpathy's autoresearch.
Instead of searching over train.py, this repo searches over policy.md.
The idea is the same:
- keep the mutable surface tiny
- keep the evaluator fixed
- run under a fixed wall-clock budget
- compare candidates with one top-line metric
- log every keep or discard decision
In this repo, the top-line metric is attack success rate (ASR, lower is better), with a benign-pass floor so the system cannot win by refusing everything.
For day-to-day experimentation, three files matter most:
program.md: the human-owned instructions for the looppolicy.md: the only file you should edit between runsresults.tsv: the append-only run log
Everything else is fixed harness code or fixed evaluation data.
- Mutable surface:
policy.md - Fixed suite:
eval_suite.jsonl - Fixed judge prompt:
judge_prompt.md - Fixed harness:
autoguardrails/ - Acceptance rule: keep a candidate only if
ASRimproves and benign pass does not fall by more than 2 percentage points - Runtime budget: fixed by the harness config, currently 5 minutes per evaluation pass
If you want a mental model closer to the original autoresearch, think of autoguardrails/ as the fixed helper layer and policy.md as the single file under search.
Run from the repository root.
- Record a baseline.
python -m autoguardrails baseline --reset --repeat 2 --notes "initial baseline"-
Edit only
policy.md. -
Score the new candidate.
python -m autoguardrails candidate --repeat 2 --notes "cover jailbreak and obfuscation"- Inspect the current kept result.
python -m autoguardrails status- Inspect the full log.
cat results.tsvIf a candidate is rejected, the harness restores policy.md to the last accepted version automatically.
If you prefer a single entrypoint, use run_autoguardrails.sh:
sh run_autoguardrails.sh status
sh run_autoguardrails.sh evaluate
sh run_autoguardrails.sh baseline "initial baseline" 2
sh run_autoguardrails.sh candidate "cover jailbreak and obfuscation" 2On Windows, run the wrapper from Git Bash or another POSIX-compatible shell.
The default setup uses a deterministic local stub so the repo works offline. To run real experiments, point the target model and the judge model at OpenAI-compatible endpoints.
Target model variables:
AUTOGUARDRAILS_TARGET_PROVIDER=openai_compatibleAUTOGUARDRAILS_TARGET_MODELAUTOGUARDRAILS_TARGET_API_BASEAUTOGUARDRAILS_TARGET_API_KEY
Judge model variables:
AUTOGUARDRAILS_JUDGE_PROVIDER=openai_compatibleAUTOGUARDRAILS_JUDGE_MODELAUTOGUARDRAILS_JUDGE_API_BASEAUTOGUARDRAILS_JUDGE_API_KEY
Example:
export AUTOGUARDRAILS_TARGET_PROVIDER=openai_compatible
export AUTOGUARDRAILS_TARGET_MODEL=gpt-4.1-mini
export AUTOGUARDRAILS_TARGET_API_BASE=https://your-endpoint.example/v1
export AUTOGUARDRAILS_TARGET_API_KEY=your-target-key
export AUTOGUARDRAILS_JUDGE_PROVIDER=openai_compatible
export AUTOGUARDRAILS_JUDGE_MODEL=gpt-4.1-mini
export AUTOGUARDRAILS_JUDGE_API_BASE=https://your-endpoint.example/v1
export AUTOGUARDRAILS_JUDGE_API_KEY=your-judge-key
python -m autoguardrails baseline --reset --repeat 2 --notes "real-model baseline"Use a frozen judge setup during a run series. Do not switch judge prompts or judge models mid-experiment.
A simple offline emulation cycle looks like this:
- Record the baseline.
- Add one policy change family to
policy.md. - Run
candidate. - Keep the change only if the harness accepts it.
- Repeat with one new change at a time.
One example candidate change that improves the bundled stub is to add explicit handling for:
- jailbreak phrasing such as "ignore previous instructions", "roleplay", and "developer mode"
- obfuscation requests such as translation, base64, rot13, JSON-only formatting, or schema conversion
That gives you a realistic first improvement curve without changing the evaluator.
program.md: experiment instructions and constraintspolicy.md: mutable guardrail policy under searchjudge_prompt.md: frozen judge prompteval_suite.jsonl: fixed attack and benign eval casesresults.tsv: run logrun_autoguardrails.sh: convenience wrapper around the CLIautoguardrails/: fixed Python harnesstests/: regression and safety checks for the harness
See autoguardrails/README.md for the code architecture and tests/README.md for the test strategy.
- This scaffold is intentionally single-turn and narrow in scope.
- It does not model tools, file access, or multi-step agent actions.
- The bundled stub is for harness verification only; it is not a realistic safety model.
- The eval suite is fixed by design. If you change it, start a new experiment lineage instead of comparing against old results.
- Python 3.10+
- No third-party runtime dependencies — the harness is built entirely on the Python standard library and runs offline by default.
- Optional, for development only:
ruff,black,mypy,pytest,pytest-cov(see CONTRIBUTING.md). - Optional, for real-model experiments: access to an OpenAI-compatible chat-completions endpoint (configured via the
AUTOGUARDRAILS_*environment variables described above).
Contributions are welcome! Please read our Contributing Guidelines and Code of Conduct before getting started.
- Report bugs and request features via GitHub Issues.
- External contributors sign the CLA (handled automatically by the CLA Assistant bot on your first PR).
- Run
ruff check .,black --check .,mypy autoguardrails, andpytestbefore opening a PR. - Respect the research contract:
policy.mdis the only mutable surface;eval_suite.jsonlandjudge_prompt.mdare frozen.
Please report security vulnerabilities responsibly. See our Security Policy for how to report (do not open a public issue for vulnerabilities). Contact: security-opensource@gruposantander.com or use GitHub Security Advisories.
This project is licensed under the Apache License 2.0 — see the LICENSE and NOTICE files for details.
Copyright (c) 2026 Santander Group
SPDX-License-Identifier: Apache-2.0
If you use autoguardrails in your research, please cite it:
@software{autoguardrails2026,
author = {{Santander AI Lab}},
title = {autoguardrails: an autoresearch-style guardrail policy loop},
year = {2026},
url = {https://github.com/SantanderAI/autoguardrails},
license = {Apache-2.0}
}