feat: automate full VM lifecycle in correction flywheel script by abrichr · Pull Request #186 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-23T00:53:24Z

Summary

Integrate all manual infrastructure steps (VM start, SSH wait, container start, iptables fix, tunnel setup, WAA probe, VM deallocate) into scripts/run_correction_flywheel.py so the full correction flywheel runs end-to-end with a single command
Add --manage-vm and --setup-tunnels flags for opt-in VM lifecycle and tunnel management (existing behavior preserved without these flags)
Add --baseline-model / --guided-model flags to use different planner models for attempt vs retry phases (e.g., weaker model for baseline to increase chance of failure, stronger model with guidance)

Changes

New infrastructure functions (inline, matching azure_vm.py / ssh_tunnel.py patterns):

start_vm() / get_vm_ip() / get_vm_state() / wait_for_ssh() / deallocate_vm() -- VM lifecycle via az CLI
start_container() -- idempotent: checks state, starts existing or runs new container with correct flags (--cap-add NET_ADMIN, storage mount, etc.)
apply_iptables_fix() -- exempts port 5050 from DNAT (idempotent, uses iptables -C check)
setup_tunnels() -- kills stale tunnels on target ports, creates fresh SSH tunnels (5001->5000, 5050->5051, 8006->8006)
setup_eval_proxy() -- socat bridge for evaluate server (systemd or manual)
wait_for_waa() -- polls /probe through tunnel with configurable timeout

Design decisions:

VM deallocate runs in try/finally -- always executes even on error
Individual phase errors are caught and logged; Phase 4 report always generates with partial results
All operations are idempotent (safe to re-run)
--mock mode unchanged (no VM management needed)

Full automated command:

python scripts/run_correction_flywheel.py \
    --task-config example_tasks/clear-browsing-data-chrome.yaml \
    --demo-dir ./demos \
    --manage-vm \
    --setup-tunnels \
    --output flywheel_results/

Test plan

python scripts/run_correction_flywheel.py --task-config example_tasks/notepad-hello.yaml --mock --output /tmp/flywheel_mock -- mock mode still works
Verify --manage-vm starts and deallocates VM correctly
Verify --setup-tunnels kills existing tunnels and creates new ones
Verify --baseline-model gpt-4o-mini uses the weaker model for Phase 1
Verify VM deallocate runs on error (kill script mid-Phase-1, check VM state)
Verify partial report generation when a phase fails

🤖 Generated with Claude Code

Integrate all manual infrastructure steps so the flywheel runs end-to-end deterministically with a single command: python scripts/run_correction_flywheel.py \ --task-config example_tasks/clear-browsing-data-chrome.yaml \ --demo-dir ./demos --manage-vm --setup-tunnels New infrastructure functions (inline, matching azure_vm.py patterns): - start_vm / get_vm_ip / get_vm_state / wait_for_ssh / deallocate_vm - start_container (docker start or docker run with correct flags) - apply_iptables_fix (exempt port 5050 from DNAT, idempotent) - setup_tunnels (kill stale, create SSH tunnels for 5001/5050/8006) - setup_eval_proxy (socat bridge for evaluate server) - wait_for_waa (poll /probe through tunnel) Design decisions: - --manage-vm flag: opt-in VM start/deallocate lifecycle - --setup-tunnels flag: opt-in tunnel setup with port cleanup - --baseline-model / --guided-model: use different planner models for Phase 1 vs Phase 3 (e.g., gpt-4o-mini baseline to ensure failure) - VM deallocate in try/finally (always runs, even on error) - Phase errors are caught individually; report always generated with partial results - All operations are idempotent (safe to re-run) - --mock mode unchanged (no VM management needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abrichr merged commit 748534b into main Mar 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: automate full VM lifecycle in correction flywheel script#186

feat: automate full VM lifecycle in correction flywheel script#186
abrichr merged 1 commit into
mainfrom
feat/flywheel-full-automation

abrichr commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 23, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant