feat: automate full VM lifecycle in correction flywheel script#186
Merged
Conversation
Integrate all manual infrastructure steps so the flywheel runs
end-to-end deterministically with a single command:
python scripts/run_correction_flywheel.py \
--task-config example_tasks/clear-browsing-data-chrome.yaml \
--demo-dir ./demos --manage-vm --setup-tunnels
New infrastructure functions (inline, matching azure_vm.py patterns):
- start_vm / get_vm_ip / get_vm_state / wait_for_ssh / deallocate_vm
- start_container (docker start or docker run with correct flags)
- apply_iptables_fix (exempt port 5050 from DNAT, idempotent)
- setup_tunnels (kill stale, create SSH tunnels for 5001/5050/8006)
- setup_eval_proxy (socat bridge for evaluate server)
- wait_for_waa (poll /probe through tunnel)
Design decisions:
- --manage-vm flag: opt-in VM start/deallocate lifecycle
- --setup-tunnels flag: opt-in tunnel setup with port cleanup
- --baseline-model / --guided-model: use different planner models
for Phase 1 vs Phase 3 (e.g., gpt-4o-mini baseline to ensure failure)
- VM deallocate in try/finally (always runs, even on error)
- Phase errors are caught individually; report always generated
with partial results
- All operations are idempotent (safe to re-run)
- --mock mode unchanged (no VM management needed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/run_correction_flywheel.pyso the full correction flywheel runs end-to-end with a single command--manage-vmand--setup-tunnelsflags for opt-in VM lifecycle and tunnel management (existing behavior preserved without these flags)--baseline-model/--guided-modelflags to use different planner models for attempt vs retry phases (e.g., weaker model for baseline to increase chance of failure, stronger model with guidance)Changes
New infrastructure functions (inline, matching
azure_vm.py/ssh_tunnel.pypatterns):start_vm()/get_vm_ip()/get_vm_state()/wait_for_ssh()/deallocate_vm()-- VM lifecycle viaazCLIstart_container()-- idempotent: checks state, starts existing or runs new container with correct flags (--cap-add NET_ADMIN, storage mount, etc.)apply_iptables_fix()-- exempts port 5050 from DNAT (idempotent, usesiptables -Ccheck)setup_tunnels()-- kills stale tunnels on target ports, creates fresh SSH tunnels (5001->5000, 5050->5051, 8006->8006)setup_eval_proxy()-- socat bridge for evaluate server (systemd or manual)wait_for_waa()-- polls/probethrough tunnel with configurable timeoutDesign decisions:
try/finally-- always executes even on error--mockmode unchanged (no VM management needed)Full automated command:
python scripts/run_correction_flywheel.py \ --task-config example_tasks/clear-browsing-data-chrome.yaml \ --demo-dir ./demos \ --manage-vm \ --setup-tunnels \ --output flywheel_results/Test plan
python scripts/run_correction_flywheel.py --task-config example_tasks/notepad-hello.yaml --mock --output /tmp/flywheel_mock-- mock mode still works--manage-vmstarts and deallocates VM correctly--setup-tunnelskills existing tunnels and creates new ones--baseline-model gpt-4o-miniuses the weaker model for Phase 1🤖 Generated with Claude Code