Skip to content

feat: automate full VM lifecycle in correction flywheel script#186

Merged
abrichr merged 1 commit into
mainfrom
feat/flywheel-full-automation
Mar 23, 2026
Merged

feat: automate full VM lifecycle in correction flywheel script#186
abrichr merged 1 commit into
mainfrom
feat/flywheel-full-automation

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 23, 2026

Summary

  • Integrate all manual infrastructure steps (VM start, SSH wait, container start, iptables fix, tunnel setup, WAA probe, VM deallocate) into scripts/run_correction_flywheel.py so the full correction flywheel runs end-to-end with a single command
  • Add --manage-vm and --setup-tunnels flags for opt-in VM lifecycle and tunnel management (existing behavior preserved without these flags)
  • Add --baseline-model / --guided-model flags to use different planner models for attempt vs retry phases (e.g., weaker model for baseline to increase chance of failure, stronger model with guidance)

Changes

New infrastructure functions (inline, matching azure_vm.py / ssh_tunnel.py patterns):

  • start_vm() / get_vm_ip() / get_vm_state() / wait_for_ssh() / deallocate_vm() -- VM lifecycle via az CLI
  • start_container() -- idempotent: checks state, starts existing or runs new container with correct flags (--cap-add NET_ADMIN, storage mount, etc.)
  • apply_iptables_fix() -- exempts port 5050 from DNAT (idempotent, uses iptables -C check)
  • setup_tunnels() -- kills stale tunnels on target ports, creates fresh SSH tunnels (5001->5000, 5050->5051, 8006->8006)
  • setup_eval_proxy() -- socat bridge for evaluate server (systemd or manual)
  • wait_for_waa() -- polls /probe through tunnel with configurable timeout

Design decisions:

  • VM deallocate runs in try/finally -- always executes even on error
  • Individual phase errors are caught and logged; Phase 4 report always generates with partial results
  • All operations are idempotent (safe to re-run)
  • --mock mode unchanged (no VM management needed)

Full automated command:

python scripts/run_correction_flywheel.py \
    --task-config example_tasks/clear-browsing-data-chrome.yaml \
    --demo-dir ./demos \
    --manage-vm \
    --setup-tunnels \
    --output flywheel_results/

Test plan

  • python scripts/run_correction_flywheel.py --task-config example_tasks/notepad-hello.yaml --mock --output /tmp/flywheel_mock -- mock mode still works
  • Verify --manage-vm starts and deallocates VM correctly
  • Verify --setup-tunnels kills existing tunnels and creates new ones
  • Verify --baseline-model gpt-4o-mini uses the weaker model for Phase 1
  • Verify VM deallocate runs on error (kill script mid-Phase-1, check VM state)
  • Verify partial report generation when a phase fails

🤖 Generated with Claude Code

Integrate all manual infrastructure steps so the flywheel runs
end-to-end deterministically with a single command:

  python scripts/run_correction_flywheel.py \
      --task-config example_tasks/clear-browsing-data-chrome.yaml \
      --demo-dir ./demos --manage-vm --setup-tunnels

New infrastructure functions (inline, matching azure_vm.py patterns):
- start_vm / get_vm_ip / get_vm_state / wait_for_ssh / deallocate_vm
- start_container (docker start or docker run with correct flags)
- apply_iptables_fix (exempt port 5050 from DNAT, idempotent)
- setup_tunnels (kill stale, create SSH tunnels for 5001/5050/8006)
- setup_eval_proxy (socat bridge for evaluate server)
- wait_for_waa (poll /probe through tunnel)

Design decisions:
- --manage-vm flag: opt-in VM start/deallocate lifecycle
- --setup-tunnels flag: opt-in tunnel setup with port cleanup
- --baseline-model / --guided-model: use different planner models
  for Phase 1 vs Phase 3 (e.g., gpt-4o-mini baseline to ensure failure)
- VM deallocate in try/finally (always runs, even on error)
- Phase errors are caught individually; report always generated
  with partial results
- All operations are idempotent (safe to re-run)
- --mock mode unchanged (no VM management needed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit 748534b into main Mar 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant