Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .beads/issues.jsonl
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
{"id":"openadapt-evals-dke","title":"SYSTEM: Create knowledge persistence workflow using Beads","description":"Every fix/approach must be logged as a Beads issue with:\n1. Problem description\n2. Attempted solution\n3. Result (worked/failed/partial)\n4. Root cause if known\n5. Files changed\n\nBefore any fix attempt, agent MUST:\n1. Run 'bd list --labels=fix,approach' to see prior attempts\n2. Review what was tried before\n3. Document new attempt BEFORE implementing\n\nAfter context compaction, first action:\n1. Run 'bd ready' for current tasks\n2. Run 'bd list --labels=recurring' for known recurring issues\n3. Check docs/RECURRING_ISSUES.md for patterns","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T19:00:18.155796-05:00","created_by":"Richard Abrich","updated_at":"2026-02-23T16:21:13.18811-05:00","closed_at":"2026-02-14T12:22:52.357373-05:00"}
{"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-02-23T16:21:13.188539-05:00","closed_at":"2026-02-08T13:23:34.84444-05:00","labels":["testing","waa"],"comments":[{"id":3,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Session Recovery 2026-01-22 17:58: Previous agents killed during compaction. VM state: Docker/containerd unhealthy, disk /mnt only 32GB (need 47GB+ for vanilla WAA). Git-lfs failing. User feedback: 1) use beads, 2) larger disk, 3) clean up CLI, 4) vanilla WAA config.","created_at":"2026-01-22T18:05:45Z"},{"id":4,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Launched 3 parallel agents: ae159fc (VM disk upgrade), aabad47 (CLI cleanup), aee4e8a (fix containerd). Check /private/tmp/claude/-Users-abrichr-oa-src-openadapt-ml/tasks/*.output for results.","created_at":"2026-01-22T18:06:18Z"},{"id":5,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"WORKFLOW DOCUMENTED: VM config changes = delete VM -\u003e update code -\u003e relaunch. Added to CLAUDE.md. Default VM size now D8ds_v5 (300GB). Launching fresh VM now.","created_at":"2026-01-22T18:09:12Z"},{"id":6,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:20: VM resources cleaned up, launched agent a9be1f8 to add auto-cleanup to CLI, WAA setup retrying in background (b04fcbe). Workflow documented in CLAUDE.md and STATUS.md.","created_at":"2026-01-22T18:11:56Z"},{"id":7,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:30: VM created with D8s_v3 fallback (D8ds_v5 quota 0), IP 20.120.37.97. Restored waa_deploy symlink. Docker image building. W\u0026B integration agent a21c3ef running.","created_at":"2026-01-22T18:25:29Z"},{"id":8,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 19:05: WAA Docker image built successfully! Container running. Windows booting. VM: 20.120.37.97, VNC: http://20.120.37.97:8006","created_at":"2026-01-22T18:47:03Z"}]}
{"id":"openadapt-evals-hvm","title":"VL model fix PR #18 ready to merge","notes":"2026-02-08: openadapt-ml PR #18 was already merged on 2026-01-29. VL model fix is done.","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.491938-05:00","created_by":"Richard Abrich","updated_at":"2026-02-08T12:55:19.233249-05:00","closed_at":"2026-02-08T12:55:19.233249-05:00","close_reason":"PR #18 already merged 2026-01-29"}
{"id":"openadapt-evals-mx8","title":"Analyze evaluation results and publish findings","description":"After demo-conditioned evaluation completes, analyze results: success rates, failure modes, demo impact. Create data-driven roadmap for improvements.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:06.328838-05:00","created_by":"Richard Abrich","updated_at":"2026-02-14T12:23:06.328838-05:00"}
{"id":"openadapt-evals-mx8","title":"Analyze evaluation results and publish findings","description":"After demo-conditioned evaluation completes, analyze results: success rates, failure modes, demo impact. Create data-driven roadmap for improvements.","notes":"wright repo (OpenAdaptAI/wright) scaffolding underway. Herald + consilium repos transferred to OpenAdaptAI org. Wright will be the orchestration layer for eval pipeline.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:06.328838-05:00","created_by":"Richard Abrich","updated_at":"2026-03-02T00:08:08.422633-05:00"}
{"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"2026-03-01: GPU grant applications reviewed and rewritten (11 files). Writing done, blocked on eval results (DC signal on harder tasks). Detailed status tracked in openadapt-internal (private repo).","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-03-01T23:35:11.042286-05:00"}
{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"wright repo created (OpenAdaptAI/wright), scaffolding in progress. Herald + consilium transferred to OpenAdaptAI org.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-03-02T00:08:08.423284-05:00"}
{"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,7 @@ benchmark_live.json

# Cost reports (generated during evaluation runs)
cost_report.json

# WAA recordings (PNGs + meta.json from record-waa sessions)
# Gitignored to protect from `git stash -u` and `git clean -f`
waa_recordings/
149 changes: 114 additions & 35 deletions scripts/record_waa_demos.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@
from __future__ import annotations

import json
import os
import shutil
import socket
import subprocess
import sys
Expand Down Expand Up @@ -1324,28 +1326,39 @@ def _auto_start_container(vm_ip: str) -> bool:


def _auto_start_socat(vm_ip: str) -> bool:
"""Start socat proxy on the VM for port 5050 forwarding. Returns True on success."""
"""Start socat proxy on the VM for port 5050 forwarding.

Tries the socat-waa-evaluate systemd service first (preferred: auto-restarts
on failure). Falls back to the legacy nohup approach for older VMs that
don't have the service installed.
"""
print(f" Starting socat proxy on {vm_ip} (VM:5051 -> container:5050)...")
# The socat command runs in the background on the VM
socat_cmd = (
'nohup socat TCP-LISTEN:5051,fork,reuseaddr '
'EXEC:"docker exec -i winarena socat - TCP\\:localhost\\:5050" '
'&>/dev/null &'
script = (
"if systemctl list-unit-files socat-waa-evaluate.service "
"| grep -q socat-waa-evaluate; then "
" sudo systemctl restart socat-waa-evaluate.service; "
"else "
" killall socat 2>/dev/null || true; sleep 1; "
" which socat >/dev/null 2>&1 "
" || sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -qq socat; "
" nohup socat TCP-LISTEN:5051,fork,reuseaddr "
" 'EXEC:docker exec -i winarena socat - TCP\\:127.0.0.1\\:5050' "
" </dev/null >/dev/null 2>&1 &; "
"fi"
)
result = subprocess.run(
["ssh",
"-o", "ConnectTimeout=10",
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
f"{_AUTO_SSH_USER}@{vm_ip}",
socat_cmd],
script],
capture_output=True, text=True, timeout=30,
)
if result.returncode != 0:
print(f" WARNING: socat setup returned non-zero: {result.stderr.strip()}")
# Not fatal — socat may already be running
else:
print(" Socat proxy started.")
print(f" ERROR: socat proxy setup failed: {result.stderr.strip()}")
return False
print(" Socat proxy established (VM:5051 -> container:5050).")
return True


Expand Down Expand Up @@ -1459,6 +1472,64 @@ def _attempt_auto_recovery(
return False


# Default external backup root (outside the git repo)
_BACKUP_ROOT = Path.home() / "oa" / "recordings"


def _backup_file(src: Path, task_id: str) -> None:
"""Hardlink *src* to the external backup directory.

Creates ``~/oa/recordings/{task_id}/{src.name}``. Uses a hardlink (zero
extra disk space) and falls back to ``shutil.copy2`` for cross-device
scenarios. Failures are silently ignored — recording must never be
interrupted by a backup error.
"""
try:
backup_dir = _BACKUP_ROOT / task_id
backup_dir.mkdir(parents=True, exist_ok=True)
dest = backup_dir / src.name
if dest.exists():
dest.unlink()
try:
os.link(src, dest)
except OSError:
shutil.copy2(src, dest)
except Exception:
pass # Silent — never interrupt recording


def _save_incremental_meta(
task_dir: Path,
task_id: str,
instruction: str,
steps_meta: list[dict],
step_plans: list[dict],
server: str,
is_final: bool = False,
) -> None:
"""Write ``meta.json`` atomically after each step.

Writes to a ``.tmp`` file first and renames, preventing corrupt partial
writes on crash. Includes a ``recording_complete`` boolean so downstream
scripts can detect partial recordings.
"""
meta = {
"task_id": task_id,
"instruction": instruction,
"num_steps": len(steps_meta),
"steps": steps_meta,
"step_plans": step_plans,
"server_url": server,
"recorded_at": datetime.now(timezone.utc).isoformat(),
"recording_complete": is_final,
}
tmp = task_dir / "meta.json.tmp"
final = task_dir / "meta.json"
tmp.write_text(json.dumps(meta, indent=2), encoding="utf-8")
tmp.replace(final)
_backup_file(final, task_id)


def cmd_record_waa(
tasks: str = ",".join(HARDER_TASK_IDS),
server: str = "http://localhost:5001",
Expand Down Expand Up @@ -1859,9 +1930,9 @@ def _hard_reset_task_env() -> bytes:

while remaining_steps:
# Save before screenshot
(task_dir / f"step_{step_idx:02d}_before.png").write_bytes(
before_png
)
before_path = task_dir / f"step_{step_idx:02d}_before.png"
before_path.write_bytes(before_png)
_backup_file(before_path, task_id)

# Display current step
total = len(completed_steps) + len(remaining_steps)
Expand All @@ -1879,24 +1950,28 @@ def _hard_reset_task_env() -> bytes:
# RETRY: discard this attempt, take fresh before screenshot
print(" Retrying step (taking fresh screenshot)...")
before_png = _take_screenshot(server)
(task_dir / f"step_{step_idx:02d}_before.png").write_bytes(
before_png
)
retry_path = task_dir / f"step_{step_idx:02d}_before.png"
retry_path.write_bytes(before_png)
_backup_file(retry_path, task_id)
continue

elif user_input == "":
# ADVANCE: action done, move to next step
after_png = _take_screenshot(server)
(task_dir / f"step_{step_idx:02d}_after.png").write_bytes(
after_png
)
after_path = task_dir / f"step_{step_idx:02d}_after.png"
after_path.write_bytes(after_png)
_backup_file(after_path, task_id)
done_step = remaining_steps.pop(0)
completed_steps.append(done_step)
steps_meta.append({
"action_hint": None,
"suggested_step": done_step,
"step_was_refined": step_idx in refined_indices,
})
_save_incremental_meta(
task_dir, task_id, instruction, steps_meta,
step_plans, server,
)
before_png = after_png
step_idx += 1
print(f" Step {step_num} recorded.")
Expand All @@ -1915,14 +1990,18 @@ def _hard_reset_task_env() -> bytes:
elif user_input.lower() == "d":
# DONE: task finished (possibly before all steps)
after_png = _take_screenshot(server)
(task_dir / f"step_{step_idx:02d}_after.png").write_bytes(
after_png
)
after_path = task_dir / f"step_{step_idx:02d}_after.png"
after_path.write_bytes(after_png)
_backup_file(after_path, task_id)
steps_meta.append({
"action_hint": "d",
"suggested_step": remaining_steps[0],
"step_was_refined": step_idx in refined_indices,
})
_save_incremental_meta(
task_dir, task_id, instruction, steps_meta,
step_plans, server,
)
step_idx += 1
total = len(completed_steps) + len(remaining_steps)
print(f"\n Task marked done at step {step_num} of {total}. Finishing recording.")
Expand Down Expand Up @@ -1960,6 +2039,10 @@ def _hard_reset_task_env() -> bytes:
print(" Restarting task (soft reset — closing apps, re-running setup)...")
for f in task_dir.glob("step_*.png"):
f.unlink()
# Clean external backup for this task too
backup_dir = _BACKUP_ROOT / task_id
if backup_dir.exists():
shutil.rmtree(backup_dir, ignore_errors=True)
before_png = _soft_reset_task_env()
print(f"\n VNC: {vnc_url}")
print(f" Task: {instruction}\n")
Expand Down Expand Up @@ -1996,6 +2079,10 @@ def _hard_reset_task_env() -> bytes:
print(" Restarting task (hard reset — QEMU reboot)...")
for f in task_dir.glob("step_*.png"):
f.unlink()
# Clean external backup for this task too
backup_dir = _BACKUP_ROOT / task_id
if backup_dir.exists():
shutil.rmtree(backup_dir, ignore_errors=True)
before_png = _hard_reset_task_env()
print(f"\n VNC: {vnc_url}")
print(f" Task: {instruction}\n")
Expand Down Expand Up @@ -2081,18 +2168,10 @@ def _hard_reset_task_env() -> bytes:
)
# No action taken — loop re-displays the (possibly new) current step

# Save metadata
meta = {
"task_id": task_id,
"instruction": instruction,
"num_steps": len(steps_meta),
"steps": steps_meta,
"step_plans": step_plans,
"server_url": server,
"recorded_at": datetime.now(timezone.utc).isoformat(),
}
(task_dir / "meta.json").write_text(
json.dumps(meta, indent=2), encoding="utf-8"
# Save final metadata (marks recording as complete)
_save_incremental_meta(
task_dir, task_id, instruction, steps_meta,
step_plans, server, is_final=True,
)

# Task completed successfully — remove checkpoint
Expand Down