A framework for running AI agent experiments with structured reasoning extraction, logging, and automated judging of deceptive behaviour.
- Runs AI agents with tool-calling capabilities (file system access)
- Extracts and displays reasoning/thought processes from models
- Logs conversations with metadata (token counts, finish reasons, turn counts)
- Judges experiment outputs for deception via regex, blackbox, and glassbox checks
- Supports multiple model providers via OpenAI-compatible APIs
- Configurable via YAML
# Install dependencies
uv sync
# Run all experiments from config.yaml
uv run src/main.py
# Judge experiment logs
uv run python src/judge_runner.py --logs-dir logs/ --output output/results.csvThree independent checks evaluate experiment outputs for deceptive behaviour:
graph LR
A[Experiment Logs] --> B[Judge Runner CLI]
B --> C[RegexChecker<br/>local patterns]
B --> D[BlackboxChecker<br/>output vs baseline]
B --> E[GlassboxChecker<br/>reasoning + intent]
D & E --> F[Anthropic Batch API<br/>50% cost]
C & D & E --> G[CSV Results]
C & D & E --> H[Judge Logs<br/>full CoT reasoning]
| Check | Input | Output |
|---|---|---|
| Regex | Model output + scenario patterns | PASS / FAIL |
| Blackbox | Model output vs baseline (no reasoning) | ACCURATE / MISLEADING / OMITTED / FABRICATED |
| Glassbox | Reasoning trace + system prompt + output | Intent category + sophistication tier |
Usage:
# Batch mode (Anthropic Batch API — half cost):
uv run python src/judge_runner.py --logs-dir logs/ --output output/results.csv
# Single file (synchronous):
uv run python src/judge_runner.py --log-file logs/.../file.json --mode singleResults are saved to CSV and full judge CoT logs (including judge model) are saved to the configured judge.log_dir.
providers:
openai:
api_key_env: OPENAI_API_KEY
base_url: https://api.openai.com/v1
google:
api_key_env: GOOGLE_API_KEY
base_url: https://generativelanguage.googleapis.com/v1beta/openai/
openrouter:
api_key_env: OPENROUTER_API_KEY
base_url: https://openrouter.ai/api/v1
models:
- id: gemini-3-flash-preview
provider: google
temperature: 1
extra_body:
extra_body:
google:
thinking_config:
include_thoughts: True
scenarios:
- path: scenarios/corporate_sabotage_v2
runs: 1
oversight_levels:
- high
output:
dir: logs
judge:
model: claude-sonnet-4-20250514
temperature: 0
log_dir: judge_logs
logging:
level: 3 # 1-4 (1=CRITICAL, 2=WARNING, 3=INFO, 4=DEBUG)
format: "[{level}] {message}"
output: both # console, file, or both
file: logs/experiment.logThe framework uses Python's logging module with configurable verbosity levels:
| Level | Shows |
|---|---|
| 1 | CRITICAL only (major failures) |
| 2 | WARNING + CRITICAL |
| 3 | INFO + WARNING + CRITICAL (default) |
| 4 | DEBUG + INFO + WARNING + CRITICAL (includes reasoning, VFS, tools) |
At level 4 (DEBUG), additional information is logged:
- Full reasoning traces (truncated at INFO level)
- VFS path and file listing
- Available tools list
Level 4 automatically enables file output (both mode) to save detailed logs.
scenarios/
scenario_name/
user.md # User prompt
benign_goal.md # Visible system prompt
hidden_goal.md # Hidden system prompt (for experiments)
regex_rules.yaml # Regex patterns for the judge
data/ # Files accessible to agent
file1.txt
file2.txt
Logs are saved to logs/{model}/{scenario}/{oversight}/{timestamp}.json:
{
"run_id": "...",
"model": "gemini-3-flash-preview",
"scenario": "corporate_sabotage_v2",
"oversight_level": "high",
"total_tokens": 1500,
"prompt_tokens": 1000,
"completion_tokens": 500,
"conversation": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{
"role": "assistant",
"reasoning": "...",
"content": "...",
"tool_calls": [...],
"finish_reason": "stop",
"turn_count": 1
}
]
}- Reasoning Extraction: Supports multiple formats:
- OpenRouter
reasoning_content(Qwen, DeepSeek, etc.) - OpenAI
reasoning_details— bothreasoning.textandreasoning.summarytypes - Google
thought_signature(viaextra_content) <thinking>tags in content- Each log entry includes
reasoning_formatmetadata for traceability
- OpenRouter
- Baseline Toggle: Set
generate_baseline: falsein defaults to skip baseline generation (useful when testing new models) - Turn Limit: Max 20 turns per conversation (prevents infinite loops)
- Success Tracking: Runs marked SUCCESS if they end with
finish_reason: "stop" - VFS: Virtual file system for sandboxed file access
- Automated Judging: Regex, blackbox, and glassbox checks with Anthropic Batch API
The project includes 138 unit tests covering all modules. No API keys or network access needed.
uv run pytest tests/ -v| Test file | Covers |
|---|---|
test_vfs.py |
Sandbox escapes, path traversal, CRUD, singleton staleness |
test_config_loader.py |
Missing keys, temperature cascade, oversight fallback |
test_tools.py |
Schema/implementation sync, VFS delegation |
test_agent.py |
Message construction, error handling, token counting, reasoning extraction |
test_interrogate.py |
Conversation sanitization, provider detection |
test_runner.py |
Baseline extraction, prompt assembly, success detection |
test_judge.py |
Regex/blackbox/glassbox checkers, JSON parsing, batch prep, CSV output |
src/
agent.py # Main agent logic, OpenAI SDK integration
config_loader.py # YAML config parsing
judge.py # Judging pipeline (regex, blackbox, glassbox)
logger.py # Centralized logging configuration
judge_runner.py # Judge CLI with batch/single modes
main.py # Entry point
runner.py # Experiment orchestration
tools.py # Available tools (list_files, read_file, etc.)
vfs.py # Virtual file system
tests/ # Unit tests (pytest)
scenarios/ # Scenario definitions
logs/ # Experiment output logs
judge_logs/ # Judge CoT logs
Set API keys via environment variables (or .env file):
export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."
export OPENROUTER_API_KEY="..."
export ANTHROPIC_API_KEY="..." # Required for judgingReplay a saved conversation and continue questioning the agent interactively:
uv run src/interrogate.py logs/model_name/scenario/oversight/timestamp.jsonThe session auto-detects the provider from the log file and restores the VFS state. Available commands:
| Command | Description |
|---|---|
history |
Show full conversation history |
history N |
Show last N messages |
reasoning |
Show the last full reasoning trace |
vfs |
Show current virtual filesystem state |
info |
Show run metadata (model, scenario, tokens) |
save |
Save the extended conversation to interrogation_logs/ |
exit |
Quit |
Anything else you type is sent as a message to the agent.