Skip to content

v0.4.3 - Inspect AI Sandbox Evaluation

Latest

Choose a tag to compare

@JackHopkins JackHopkins released this 06 Apr 18:50
· 3 commits to main since this release

Inspect AI Sandbox Evaluation

New colocated Docker sandbox mode where Factorio server and FLE Python environment run in a single container managed by Inspect's native sandbox() API. No external cluster management needed.

# Build the sandbox image (auto-builds on first use)
fle sandbox build

# Run evaluation in sandbox mode
fle inspect-eval --sandbox --env-id iron_ore_throughput --model openai/gpt-4o
fle inspect-eval --sandbox --env-id open_play_production --model openai/gpt-4o --scenario default_lab_scenario

Architecture: A persistent HTTP bridge daemon inside the container maintains FactorioInstance state and serves requests over a Unix domain socket. The host-side solver communicates via sandbox().exec() calls to a thin CLI client.

New files: Dockerfile, compose.yaml, supervisord.conf, bridge_service.py, bridge_client.py, sandbox_solver.py, sandbox_eval_set.py

Agent Namespace Protection

Agent-defined functions and variables are now checked against the FLE namespace before being created. Attempting to shadow a built-in tool (e.g., def move_to(...) or get_entities = []) raises a clear NameError instead of silently breaking the evaluation.

Fix print() Capture Inside Agent-Defined Functions

print() calls inside agent-defined functions are now correctly captured and returned as STDOUT. Previously, prints inside functions (especially nested helpers, try/except blocks, and wrapper patterns like def safe(fn, ...)) were silently lost.

Root causes fixed:

  • print() is now routed to namespace.log via globals injection in SerializableFunction.reconstruct()
  • Removed AST printlog rewriting inside function bodies (caused infinite recursion with agent-defined log helpers)
  • Fixed function redefinition bug where redefining a function used stale bytecode from the previous definition

12 new unit tests covering prints in simple functions, nested functions, try/except blocks, safe wrappers, and cross-eval persistence.

Package Reorganisation

  • fle/eval/inspect_integration/fle/eval/inspect/integration/
  • New fle/eval/inspect/sandbox/ for sandbox-specific files
  • Shared fle/eval/inspect/eval_set.py with DRY task factory functions used by both integration and sandbox eval sets

Other Changes

  • Path resolution env vars: FLE_MODS_DIR / FLE_TOOLS_DIR environment variable overrides for containerised deployments
  • --scenario flag: Configure which Factorio scenario to load via CLI
  • Prompt improvements: Policy-writing tips moved from task target to system prompt
  • Observation formatting: 2-space indentation instead of tabs in tree formatter for consistent rendering
  • STDOUT capture: Increased output limit from 64 to 512 lines; fixed truncation to keep first lines
  • Auto-start cluster: Integration mode auto-starts Factorio cluster when no servers are reachable
  • Vision fallback: Graceful degradation when sprites aren't installed
  • Absolute paths: fle inspect-eval subprocess commands now use absolute paths (works from any directory)