refactor(cli): minimal WAA CLI with vanilla image support#14
Merged
Conversation
fcee717 to
47a4d85
Compare
- Refactor CLI from 6800 to ~1300 lines with flat command structure - Add analyze command to parse and summarize benchmark results - Add --num-tasks flag to limit number of tasks to run - Fix Python 3.9 compatibility by copying Python from vanilla WAA image (fixes transformers 4.46.2 compatibility with GroundingDINO) - Add coverage and analysis artifacts to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5c51626 to
070225b
Compare
abrichr
added a commit
that referenced
this pull request
Feb 5, 2026
- Fix broken build badge (publish.yml → release.yml) - Add prominent "Parallel WAA Benchmark Evaluation" section near top - Add detailed "WAA Benchmark Workflow" section (#14) with: - Single VM and parallel pool workflows - VNC access instructions - Architecture diagram - Cost estimates - Update section numbering (Limitations → 15, Roadmap → 16) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3 tasks
abrichr
added a commit
that referenced
this pull request
Feb 5, 2026
* docs(readme): add parallel WAA evaluation section, fix build badge - Fix broken build badge (publish.yml → release.yml) - Add prominent "Parallel WAA Benchmark Evaluation" section near top - Add detailed "WAA Benchmark Workflow" section (#14) with: - Single VM and parallel pool workflows - VNC access instructions - Architecture diagram - Cost estimates - Update section numbering (Limitations → 15, Roadmap → 16) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(readme): address self-review feedback - Fix anchor placement (move before heading for proper navigation) - Correct pool-delete → pool-cleanup (actual command name) - Add pool-status example for getting worker IPs - Add "prices vary by region" caveat Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete CLI refactor for WAA benchmark automation. Replaces the 6800-line CLI with a minimal 1300-line implementation that uses the vanilla Microsoft WAA image.
Key Changes
CLI Refactor (-5500 lines)
vmsubcommand structure with flat commands:create,run,probe,analyze, etc.analyzecommand to parse and summarize benchmark results--num-tasksto limit the number of tasks to runVanilla WAA Image Support
windowsarena/winarena:latestDocker image172.30.0.2)Python 3.9 Compatibility Fix
AttributeError: 'BertModel' has no attribute 'get_head_mask'Results Analysis
analyzecommand parses downloaded benchmark logsCommands
Files Changed
openadapt_ml/benchmarks/cli.py- Complete refactor (6800 → 1300 lines)openadapt_ml/benchmarks/waa_deploy/Dockerfile- Python 3.9 compatibilitydocs/CLI_V2_DESIGN.md- Design documentation.gitignore- Coverage and analysis artifactsTest Plan
probe- Correctly detects WAA server statusrun --num-tasks 2- Limits tasks correctlyanalyze- Parses benchmark logs and shows results by domainlogs- Shows container logs🤖 Generated with Claude Code