Skip to content

feat: add YAML-based custom task evaluation without forking WAA#125

Merged
abrichr merged 4 commits intomainfrom
feat/custom-task-evaluation
Mar 18, 2026
Merged

feat: add YAML-based custom task evaluation without forking WAA#125
abrichr merged 4 commits intomainfrom
feat/custom-task-evaluation

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Mar 18, 2026

Summary

  • Simple YAML format for defining custom tasks with real end-to-end evaluation
  • No WAA fork or Docker modification required — evaluator configs sent client-side
  • Four check types: command (PowerShell/Python output), file (exists/contains), screenshot (VLM judge), python (arbitrary code)
  • Milestone support for dense partial rewards (reward = milestones_passed / total)
  • 5 example tasks: notepad hello, desktop folder, calc formula, clear browsing data (Chrome + Edge)
  • 22 tests passing

How it works

Users write YAML, our code translates to WAA evaluator format at runtime:

name: "Open Notepad and type Hello World"
setup:
  - launch: "notepad.exe"
evaluate:
  - check: command
    run: "powershell -c 'Get-Process notepad | Measure | Select -Expand Count'"
    expect: "1"
  - check: screenshot
    description: "Notepad shows Hello World"
milestones:
  - name: "Notepad open"
    check: command
    run: "powershell -c 'Get-Process notepad | Measure | Select -Expand Count'"
    expect: "1"

Test plan

  • 22 unit tests passing (YAML loading, WAA translation, VLM judge, milestones, examples)
  • End-to-end test against live WAA VM with example tasks

🤖 Generated with Claude Code

abrichr and others added 4 commits March 17, 2026 21:40
Users can define tasks with setup commands and evaluation checks in
simple YAML files. The WAA server already accepts evaluator configs in
POST /evaluate — this module translates YAML into that format.

Four check types:
- command: run PowerShell/Python on VM, check output
- file: check file exists or contains expected content
- screenshot: VLM judges screenshot (one-sentence description)
- python: run arbitrary Python on VM

Includes milestone support for dense partial rewards, VLM-based
screenshot evaluation, 5 example tasks (notepad, folder, calc,
clear-browsing-data for Chrome and Edge), and 22 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RLEnvironment.evaluate_dense() uses TaskConfig milestones to compute
partial credit (milestones_passed / total). This gives GRPO gradient
signal even when no task fully completes — an agent passing 3/5
milestones gets reward 0.6 vs 0.0 for binary evaluation.

- evaluate_dense(): milestone-based evaluation, falls back to binary
- load_task_config(): convenience method to set TaskConfig
- collect_rollout() uses dense rewards when milestones are defined
- reset() uses TaskConfig for task loading (bypasses server lookup)
- Trajectory info includes milestone_score, binary_score, counts

9 new tests, all passing. No changes to existing evaluate() behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Execute setup commands were being double-wrapped in python -c.
Now passed through as-is to WAA's execute handler.

Validated against live WAA VM: milestones correctly evaluate
(VLM screenshot check + command check both work).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validates full chain: TaskConfig YAML → RLEnvironment → collect_rollout
→ dense rewards → TRL rollout_func output shape.

Key test: multiple_rollouts_produce_reward_variance proves that
milestone-based rewards produce [1.0, 0.67, 0.33, 0.0] across
4 rollouts — GRPO can compute meaningful advantages from this,
even when binary task completion is 0%.

5 tests, no VM or GPU required (uses mock adapter).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit e62377b into main Mar 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant