Skip to content

feat: add interactive recording workflow with auto-infrastructure and VM IP detection#57

Merged
abrichr merged 12 commits into
mainfrom
feat/vm-ip-autodetect-screen-stability
Mar 2, 2026
Merged

feat: add interactive recording workflow with auto-infrastructure and VM IP detection#57
abrichr merged 12 commits into
mainfrom
feat/vm-ip-autodetect-screen-stability

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 1, 2026

Summary

Adds VM IP auto-detection, screen stability detection, interactive step correction during recording, and automatic infrastructure deployment (--auto flag) for the WAA demo recording workflow.

Commits

  1. 5373c19 fix: replace LibreOffice screenshot with full desktop view

    • Swaps the README screenshot from a release-hosted crop to a local screenshots/waa_libreoffice_desktop.png showing the full macOS → Chrome → noVNC → Windows 11 → LibreOffice stack
  2. 6d0a3fb feat: add VM IP auto-detection and screen stability detection

    • New openadapt_evals/infrastructure/vm_ip.py with resolve_vm_ip(): layered fallback (explicit → pool registry → Azure CLI)
    • New _compare_screenshots() and _wait_for_stable_screen() for pixel-level screen stability detection (99.5% threshold, 3 consecutive checks)
    • run_dc_eval.py and record_waa_demos.py use resolve_vm_ip() instead of hardcoded IPs
    • 24 new tests across test_vm_ip.py and test_screen_stability.py
  3. 44db6e6 fix: regenerate suggested steps after task restart

    • After a QEMU hard reset mid-recording, takes a fresh screenshot and regenerates VLM-suggested steps instead of reusing stale ones
  4. e577823 feat: add interactive step correction during recording

    • Users can type feedback at any step to refine remaining steps via VLM
    • New functions: _refine_steps(), _refine_remaining_steps(), _interactive_step_review(), _interactive_remaining_review()
    • Step parsing/formatting utilities: _parse_step_list(), _format_step_list(), _display_steps(), _display_current_step()
    • Interactive commands: [Enter] advance, [d] done, [r] redo, [R] restart, [s] refresh, [x] retry, [u] undo, or type feedback
  5. 73473df fix: validate task args before VM IP resolution

    • Guards against Fire passing True for --tasks when used without a value, before attempting VM IP resolution that would fail confusingly
  6. 26f3766 refactor: extract screen stability into module and recording loop into function

    • Moves compare_screenshots and wait_for_stable_screen into openadapt_evals/infrastructure/screen_stability.py
    • Removes fragile importlib hack from test_screen_stability.py — tests now import directly
    • Extracts per-task recording loop into _record_single_task() for readability
    • Fixes pre-existing bug: len(steps)len(steps_meta) in completion message
  7. 05b261c feat: add --auto flag for automatic infrastructure deployment

    • New --auto flag (and granular --auto-vm, --auto-tunnel, --auto-container) for record-waa
    • Auto-recovery: starts VM, establishes SSH tunnels (prefers autossh), starts Docker container + socat proxy, waits for WAA readiness
    • VM cleanup on exit: atexit + signal handlers offer to deallocate if script started the VM
    • Checkpoint/resume system: saves recording state after each step, offers to resume on next run
    • Pre-fetches task configs before QEMU reset to avoid stale socat bridge issues

Files changed (11 files, +1706/-217)

File Change
openadapt_evals/infrastructure/vm_ip.py New: VM IP auto-detection module
openadapt_evals/infrastructure/screen_stability.py New: screen comparison + stability detection
tests/test_vm_ip.py New: 14 tests for VM IP resolution
tests/test_screen_stability.py New: 10 tests for screen stability
screenshots/waa_libreoffice_desktop.png New: full desktop screenshot
scripts/record_waa_demos.py Major: +1259/-217 lines — auto-infra, step correction, checkpoint/resume
scripts/run_dc_eval.py Minor: use resolve_vm_ip()
openadapt_evals/infrastructure/__init__.py Exports new modules
openadapt_evals/infrastructure/qemu_reset.py Docstring: remove hardcoded IP
README.md Screenshot reference updated
.beads/issues.jsonl Bead tracking

Test plan

  • pytest tests/test_vm_ip.py tests/test_screen_stability.py -v — 24/24 pass
  • Manual: python scripts/record_waa_demos.py record-waa --auto --tasks=04d9aeaf — verify auto-recovery flow
  • Manual: verify checkpoint resume (interrupt mid-recording, re-run)
  • Manual: verify step correction (type feedback during recording)

🤖 Generated with Claude Code

abrichr and others added 10 commits March 1, 2026 13:49
The previous screenshot showed only the Calc window. The new one shows
the full context: macOS Chrome browser with noVNC tab, Windows 11
desktop inside QEMU, LibreOffice Calc welcome dialog, and Windows
taskbar. This better demonstrates the VM evaluation infrastructure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resolve_vm_ip() with layered resolution: explicit arg → pool
  registry (fast, local) → Azure CLI query (always accurate, ~3s)
- Remove hardcoded 172.173.66.131 defaults from record_waa_demos.py
  and run_dc_eval.py; --vm-ip is now auto-detected if omitted
- Add _wait_for_stable_screen() that polls QEMU framebuffer (free)
  until 3 consecutive screenshots match (99.5% similarity threshold),
  replacing the fixed time.sleep(3) that caused stale screenshots
- Add _compare_screenshots() with numpy-vectorized pixel comparison
- 24 new tests (14 for VM IP, 10 for screen stability)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the user presses 'R' to restart a task, the QEMU hard reset
produces a new stable screenshot, but the suggested steps were not
regenerated. The stale steps from the previous screenshot were
displayed. Now _generate_steps() is called again with the fresh
screenshot after every restart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After generating suggested steps from the screenshot, the user can now
type corrections (e.g., "step 9 formula should reference Sheet1.B2")
and the VLM will regenerate with the feedback. Loop continues until
the user presses Enter to accept.

Also refactors _generate_steps into smaller functions:
- _build_setup_desc(): extracts setup description from task config
- _vlm_call(): shared OpenAI API call helper
- _refine_steps(): sends feedback + screenshot for revised steps
- _display_steps(): pretty-prints step box
- _interactive_step_review(): correction loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the tasks-type guard above resolve_vm_ip() call so that
input validation happens before any real work. Fixes CI failure
where resolve_vm_ip raises RuntimeError in environments without
Azure access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o function

- Move _compare_screenshots and _wait_for_stable_screen from
  scripts/record_waa_demos.py into openadapt_evals/infrastructure/screen_stability.py
  as public functions (compare_screenshots, wait_for_stable_screen)
- Script wrappers delegate to the new module, preserving all call sites
- Update tests/test_screen_stability.py to import from the module directly,
  removing the fragile importlib.util.spec_from_file_location hack
- Extract per-task recording loop from cmd_record_waa() into _record_single_task()
  for readability and testability
- Fix pre-existing bug: len(steps) -> len(steps_meta) in completion message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…oyment

When the WAA server is not reachable, the script now:
- With --auto: starts VM, establishes SSH tunnels, starts Docker container
  and socat proxy, then waits for WAA to boot. Confirms with user before
  starting VM (cost warning). Auto-deallocates VM on exit/signal.
- Without --auto: prints actionable help message showing --auto and
  granular flags (--auto-vm, --auto-tunnel, --auto-container).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New script converts WAA recordings (meta.json + screenshots) to demo
text files for eval-suite, with two modes:
- text: instant, free, uses step descriptions from meta.json
- vlm: richer, sends screenshots to VLM for Observation/Intent/Result

Generated both text-only and VLM-enriched demos for task 04d9aeaf
(LibreOffice Calc annual changes). No VM or openadapt-ml needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 15: VLM described after-state instead of before-state, and
referenced C3 instead of C2.
Step 17: VLM hallucinated "CLICK cell D3" — should be D2 (first
data row for OA changes formula).
Step 18: Cascading fix from step 17.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr and others added 2 commits March 1, 2026 23:35
- Remove unused _compare_screenshots wrapper in record_waa_demos.py
- Use f.get('path', '?') instead of f['path'] in _build_setup_desc
- Ensure demo .txt files end with trailing newline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The VLM (gpt-4.1-mini) was hallucinating cell references and other details
that contradicted the recorded actions from meta.json (e.g., "D3" instead
of "D2"). Three improvements to the converter pipeline:

1. Strengthen the VLM prompt to label the recorded action as "GROUND-TRUTH"
   and explicitly instruct the model not to substitute different cell refs,
   values, or formulas based on visual interpretation.

2. Add post-hoc validation that extracts cell references, formulas, and
   quoted text from both the ground-truth step and the VLM's Action field.
   On mismatch, the Action field is replaced with the ground-truth
   description while preserving the VLM's Observation/Intent/Result.

3. Upgrade default model from gpt-4.1-mini to gpt-4.1 and lower
   temperature from 0.1 to 0.0 for more deterministic output. The --model
   flag allows overriding back to gpt-4.1-mini if cost is a concern.

Regenerated demo for 04d9aeaf with the fixed pipeline — previously
hallucinated cell references (steps 15, 17, 18) are now correct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr merged commit 3463f9d into main Mar 2, 2026
1 check passed
@abrichr abrichr changed the title feat: add VM IP auto-detection and screen stability detection feat: add interactive recording workflow with auto-infrastructure and VM IP detection Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant