-
Notifications
You must be signed in to change notification settings - Fork 0
feat: consolidate benchmark infrastructure and add results section #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Phase 1 of repo consolidation: Adapters restructuring: - Move adapters/waa.py → adapters/waa/mock.py - Move adapters/waa_live.py → adapters/waa/live.py - Create adapters/waa/__init__.py for clean imports New infrastructure/ directory: - Copy vm_monitor.py from openadapt-ml - Copy azure_ops_tracker.py from openadapt-ml - Copy ssh_tunnel.py from openadapt-ml New waa_deploy/ directory: - Copy Dockerfile for WAA Docker image - Copy api_agent.py for in-container agent - Copy start_waa_server.bat New namespaced CLI (oa evals): - Create cli/main.py with 'oa' entry point - Create cli/vm.py with VM management commands - Commands: oa evals vm, oa evals run, oa evals mock, etc. Delete dead code (verified unused): - benchmarks/agent.py, base.py, waa.py, waa_live.py (deprecated shims) - benchmarks/auto_screenshot.py, dashboard_server.py - benchmarks/generate_synthetic_demos.py, live_api.py - benchmarks/validate_demos.py, validate_screenshots.py Dependencies: - Add requests and httpx to core dependencies - Register 'oa' CLI entry point in pyproject.toml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix classify_task_complexity to check medium before simple - Added "multitasking" to complex indicators - Added "file_explorer" to simple indicators and domains - Reordered checks: complex > medium > simple - Update test_cost_optimization.py to match simplified estimate_cost API - Remove tests for unimplemented optimization params - Add test_estimate_cost_basic and test_estimate_cost_single_worker - Update test_target_cost_with_optimizations to use calculate_potential_savings - Update test_evaluate_endpoint.py to match current adapter behavior - Adapter returns 0 score when evaluation unavailable (no fallback scoring) - Update assertions to check for "unavailable" or "evaluator" in reason All 188 tests now pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add benchmark results section to track: - Baseline reproduction (GPT-4o vs paper reported ~19.5%) - Model comparison (GPT-4o, Claude Sonnet 4.5) - Domain breakdown by Windows application Placeholders will be replaced with actual results once full WAA evaluation completes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
abrichr
added a commit
to OpenAdaptAI/openadapt-ml
that referenced
this pull request
Jan 28, 2026
WAA benchmark results belong in openadapt-evals (the benchmark infrastructure package) rather than openadapt-ml (the training package). See: OpenAdaptAI/openadapt-evals#22 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merged
3 tasks
Member
Author
|
Related PR (benchmark consolidation): OpenAdaptAI/openadapt-ml#17 |
Remove local beads state changes that don't belong in this PR. The issues.jsonl changes were just comment ID renumbering, not substantive changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9c140bc to
0130cc8
Compare
Delete deprecated stubs and unused tools from benchmarks/: Deprecated stubs (re-exported from canonical locations): - agent.py - was re-exporting from openadapt_evals.agents - base.py - was re-exporting from openadapt_evals.adapters.base - waa.py - was re-exporting from openadapt_evals.adapters.waa - waa_live.py - was re-exporting from openadapt_evals.adapters.waa_live Unused standalone tools: - auto_screenshot.py - Playwright screenshot tool, only self-referenced - dashboard_server.py - Flask dashboard, only self-referenced - generate_synthetic_demos.py - LLM demo generator, never imported - live_api.py - Simple Flask API, never imported - validate_demos.py - Demo validator, never imported - validate_screenshots.py - Screenshot validator, never imported Also fixes imports in: - azure.py: WAAAdapter now imported from adapters.waa - adapters/waa/live.py: docstring example updated All 188 tests pass after deletion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes since 0.1.0:
- Task ID format: mock_{domain}_{number:03d} (e.g., mock_browser_001)
- Restructured adapters to waa/ subdirectory
- Added infrastructure/ directory
- Dead code cleanup
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
abrichr
added a commit
to OpenAdaptAI/openadapt-ml
that referenced
this pull request
Jan 29, 2026
) * docs: add verified repo consolidation plan - Two-package architecture: openadapt-evals (foundation) + openadapt-ml (ML) - Verified audit findings: 10 dead files confirmed, 3 previously marked dead but used - CLI namespacing: oa evals <cmd>, oa ml <cmd> - Dependency direction: openadapt-ml depends on openadapt-evals (not circular) - Agents with ML deps (PolicyAgent, BaselineAgent) move to openadapt-ml - adapters/waa/ subdirectory pattern for benchmark organization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: add openadapt-evals as optional dependency Add [benchmarks] optional dependency for benchmark evaluation: - pip install openadapt-ml[benchmarks] This is part of the repo consolidation to establish: - openadapt-evals: Foundation for benchmarks + infrastructure - openadapt-ml: ML training (depends on evals for benchmarks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(cli): clarify serve vs dashboard command naming - oa ml serve: serve trained models for inference - oa ml dashboard: training dashboard for monitoring This distinguishes the two use cases clearly: - serve = model inference endpoint - dashboard = training progress UI Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(benchmarks): consolidate to re-export from openadapt-evals Migrate benchmark infrastructure to two-package architecture: - openadapt-evals: Foundation package with all adapters, agents, runner - openadapt-ml: ML-specific agents that wrap openadapt-ml internals Changes: - Convert base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py to deprecation stubs that re-export from openadapt-evals - Keep only ML-specific agents in agent.py: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent - Update __init__.py to import from openadapt-evals with deprecation warning - Update tests to import from correct locations - Remove test_waa_live.py (tests belong in openadapt-evals) Net: -3540 lines of duplicate code removed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(benchmarks): delete deprecation stubs, import from openadapt-evals Remove deprecation stubs since there are no external users. Tests now import directly from openadapt-evals (canonical location). Deleted: - base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py Kept: - agent.py (ML-specific agents: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent) - __init__.py (simplified to only export ML-specific agents) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): add WAA benchmark results section with placeholders Add section 15 for Windows Agent Arena benchmark results with clearly marked placeholders. Results will be filled in when full evaluation completes. Warning banner indicates PR should not merge until placeholders are replaced. Sections added: - 15.1 Benchmark Overview - 15.2 Baseline Reproduction (paper vs our run) - 15.3 Model Comparison (GPT-4o, Claude, Qwen variants) - 15.4 Domain Breakdown Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): move WAA benchmark results to openadapt-evals WAA benchmark results belong in openadapt-evals (the benchmark infrastructure package) rather than openadapt-ml (the training package). See: OpenAdaptAI/openadapt-evals#22 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(cli): add VNC auto-launch and --fast VM option - Add setup_vnc_tunnel_and_browser() helper for automatic VNC access - Add VM_SIZE_FAST constants with D8 series sizes - Add VM_SIZE_FAST_FALLBACKS for automatic region/size retry - Add --fast flag to create command for faster installations - Add --fast flag to start command for more QEMU resources (6 cores, 16GB) - Opens browser automatically after container starts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add WAA speedup options documentation - Document --fast VM flag usage - Explain parallelization options - Detail golden image approach for future optimization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): add benchmark execution logs section - Add section 13.5 with log viewing commands - Add benchmark run commands with examples - Renumber screenshot capture tool section to 13.6 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): clarify --run flag for benchmark execution logs - Add logs --run command for viewing task progress - Add logs --run -f for live streaming - Add logs --run --tail N for last N lines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): add example output for logs commands - Add example output for `logs` (container status) - Add example output for `logs --run -f` (benchmark execution) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(cli): add --progress flag for benchmark ETA - Add _show_benchmark_progress() function - Parse run logs for completed task count - Calculate elapsed time and estimated remaining - Show progress percentage Example usage: uv run python -m openadapt_ml.benchmarks.cli logs --progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(research): add cua.ai vs openadapt-ml WAA comparison Comprehensive analysis of Cua (YC X25) computer-use agent platform: - Architecture comparison (composite agents, sandbox-first) - Benchmark framework differences (cua-bench vs openadapt-evals) - Training data generation (trajectory replotting) - Recommendations: adopt patterns, not full migration Key findings: - Cua's parallelization uses multiple sandboxes (like our multi-VM plan) - Composite agent pattern could reduce API costs - HTML capture enables training data diversity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(cli): add parallelization support with --worker-id and --num-workers WAA natively supports parallel execution by distributing tasks across workers. Usage: # Run on single VM (default) run --num-tasks 154 # Run in parallel on multiple VMs VM1: run --num-tasks 154 --worker-id 0 --num-workers 3 VM2: run --num-tasks 154 --worker-id 1 --num-workers 3 VM3: run --num-tasks 154 --worker-id 2 --num-workers 3 Tasks auto-distribute: worker 0 gets tasks 0-51, worker 1 gets 52-103, etc. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(research): add market positioning and strategic differentiation Expand cua_waa_comparison.md with: - Success rate gap analysis (38.1% vs 19.5%) - Market positioning comparison (TAM, buyers, value props) - Where sandbox approach fails (Citrix, licensed SW, compliance) - Shell applications convergence opportunities - Bottom line: Windows enterprise automation is hard, validates OpenAdapt approach Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(waa): add parallelization and scalable benchmark design docs - Add WAA_PARALLELIZATION_DESIGN.md documenting: - Official WAA approach (Azure ML Compute) - Our dedicated VM approach (dev/debug) - When to use each approach - Add WAA_UNATTENDED_SCALABLE.md documenting: - Goal: unattended, scalable, programmatic WAA - Synthesized approach using official run_azure.py - Implementation plan and cost estimates - Update Dockerfile comments to clarify: - API agents (api-claude, api-openai) run externally - openadapt-evals CLI connects via SSH tunnel - No internal run.py patching needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * style: fix ruff formatting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(imports): update internal code to import from openadapt-evals Replace imports from deleted benchmark files with direct imports from openadapt-evals: - azure.py: BenchmarkResult, BenchmarkTask, WAAAdapter - waa_demo/runner.py: BenchmarkAction, WAAMockAdapter, etc. This completes the migration to the two-package architecture where openadapt-evals is the canonical source for benchmark infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(imports): add missing EvaluationConfig import - Update azure.py to import BenchmarkAgent from openadapt_evals - Add EvaluationConfig to runner.py imports Fixes CI failure: F821 Undefined name `EvaluationConfig` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(deps): require openadapt-evals>=0.1.1 v0.1.0 uses task ID format "browser_1" but tests expect "mock_browser_001" which was added in v0.1.1. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of benchmark infrastructure consolidation for openadapt-evals.
Related PR: OpenAdaptAI/openadapt-ml#17
Changes
1. Adapters restructuring:
adapters/waa.py→adapters/waa/mock.pyadapters/waa_live.py→adapters/waa/live.pyadapters/waa/__init__.pyfor clean imports2. New infrastructure/ directory:
vm_monitor.pyfrom openadapt-mlazure_ops_tracker.pyfrom openadapt-mlssh_tunnel.pyfrom openadapt-ml3. New waa_deploy/ directory:
api_agent.pyfor in-container agentstart_waa_server.bat4. New namespaced CLI (
oa evals):cli/main.pywith 'oa' entry pointcli/vm.pywith VM management commandsoa evals vm,oa evals run,oa evals mock, etc.5. Delete dead code (verified unused):
benchmarks/agent.py,base.py,waa.py,waa_live.py(deprecated shims)benchmarks/auto_screenshot.py,dashboard_server.pybenchmarks/generate_synthetic_demos.py,live_api.pybenchmarks/validate_demos.py,validate_screenshots.py6. Test fixes:
classify_task_complexityto check medium before simpletest_cost_optimization.pyto match simplifiedestimate_costAPItest_evaluate_endpoint.pyto match current adapter behavior7. WAA benchmark results section:
Test plan
oa evals vm, etc.)🤖 Generated with Claude Code