Release v0.4.3 - Inspect AI Sandbox Evaluation · JackHopkins/factorio-learning-environment

Inspect AI Sandbox Evaluation

New colocated Docker sandbox mode where Factorio server and FLE Python environment run in a single container managed by Inspect's native sandbox() API. No external cluster management needed.

# Build the sandbox image (auto-builds on first use)
fle sandbox build

# Run evaluation in sandbox mode
fle inspect-eval --sandbox --env-id iron_ore_throughput --model openai/gpt-4o
fle inspect-eval --sandbox --env-id open_play_production --model openai/gpt-4o --scenario default_lab_scenario

Architecture: A persistent HTTP bridge daemon inside the container maintains FactorioInstance state and serves requests over a Unix domain socket. The host-side solver communicates via sandbox().exec() calls to a thin CLI client.

New files: Dockerfile, compose.yaml, supervisord.conf, bridge_service.py, bridge_client.py, sandbox_solver.py, sandbox_eval_set.py

Agent Namespace Protection

Agent-defined functions and variables are now checked against the FLE namespace before being created. Attempting to shadow a built-in tool (e.g., def move_to(...) or get_entities = []) raises a clear NameError instead of silently breaking the evaluation.

Fix print() Capture Inside Agent-Defined Functions

print() calls inside agent-defined functions are now correctly captured and returned as STDOUT. Previously, prints inside functions (especially nested helpers, try/except blocks, and wrapper patterns like def safe(fn, ...)) were silently lost.

Root causes fixed:

print() is now routed to namespace.log via globals injection in SerializableFunction.reconstruct()
Removed AST print→log rewriting inside function bodies (caused infinite recursion with agent-defined log helpers)
Fixed function redefinition bug where redefining a function used stale bytecode from the previous definition

12 new unit tests covering prints in simple functions, nested functions, try/except blocks, safe wrappers, and cross-eval persistence.

Package Reorganisation

fle/eval/inspect_integration/ → fle/eval/inspect/integration/
New fle/eval/inspect/sandbox/ for sandbox-specific files
Shared fle/eval/inspect/eval_set.py with DRY task factory functions used by both integration and sandbox eval sets

Other Changes

Path resolution env vars: FLE_MODS_DIR / FLE_TOOLS_DIR environment variable overrides for containerised deployments
--scenario flag: Configure which Factorio scenario to load via CLI
Prompt improvements: Policy-writing tips moved from task target to system prompt
Observation formatting: 2-space indentation instead of tabs in tree formatter for consistent rendering
STDOUT capture: Increased output limit from 64 to 512 lines; fixed truncation to keep first lines
Auto-start cluster: Integration mode auto-starts Factorio cluster when no servers are reachable
Vision fallback: Graceful degradation when sprites aren't installed
Absolute paths: fle inspect-eval subprocess commands now use absolute paths (works from any directory)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.3 - Inspect AI Sandbox Evaluation

Choose a tag to compare

Sorry, something went wrong.