Inspect AI Sandbox Evaluation
New colocated Docker sandbox mode where Factorio server and FLE Python environment run in a single container managed by Inspect's native sandbox() API. No external cluster management needed.
# Build the sandbox image (auto-builds on first use)
fle sandbox build
# Run evaluation in sandbox mode
fle inspect-eval --sandbox --env-id iron_ore_throughput --model openai/gpt-4o
fle inspect-eval --sandbox --env-id open_play_production --model openai/gpt-4o --scenario default_lab_scenarioArchitecture: A persistent HTTP bridge daemon inside the container maintains FactorioInstance state and serves requests over a Unix domain socket. The host-side solver communicates via sandbox().exec() calls to a thin CLI client.
New files: Dockerfile, compose.yaml, supervisord.conf, bridge_service.py, bridge_client.py, sandbox_solver.py, sandbox_eval_set.py
Agent Namespace Protection
Agent-defined functions and variables are now checked against the FLE namespace before being created. Attempting to shadow a built-in tool (e.g., def move_to(...) or get_entities = []) raises a clear NameError instead of silently breaking the evaluation.
Fix print() Capture Inside Agent-Defined Functions
print() calls inside agent-defined functions are now correctly captured and returned as STDOUT. Previously, prints inside functions (especially nested helpers, try/except blocks, and wrapper patterns like def safe(fn, ...)) were silently lost.
Root causes fixed:
print()is now routed tonamespace.logvia globals injection inSerializableFunction.reconstruct()- Removed AST
print→logrewriting inside function bodies (caused infinite recursion with agent-definedloghelpers) - Fixed function redefinition bug where redefining a function used stale bytecode from the previous definition
12 new unit tests covering prints in simple functions, nested functions, try/except blocks, safe wrappers, and cross-eval persistence.
Package Reorganisation
fle/eval/inspect_integration/→fle/eval/inspect/integration/- New
fle/eval/inspect/sandbox/for sandbox-specific files - Shared
fle/eval/inspect/eval_set.pywith DRY task factory functions used by both integration and sandbox eval sets
Other Changes
- Path resolution env vars:
FLE_MODS_DIR/FLE_TOOLS_DIRenvironment variable overrides for containerised deployments --scenarioflag: Configure which Factorio scenario to load via CLI- Prompt improvements: Policy-writing tips moved from task target to system prompt
- Observation formatting: 2-space indentation instead of tabs in tree formatter for consistent rendering
- STDOUT capture: Increased output limit from 64 to 512 lines; fixed truncation to keep first lines
- Auto-start cluster: Integration mode auto-starts Factorio cluster when no servers are reachable
- Vision fallback: Graceful degradation when sprites aren't installed
- Absolute paths:
fle inspect-evalsubprocess commands now use absolute paths (works from any directory)