feat: add YAML-based custom task evaluation without forking WAA by abrichr · Pull Request #125 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-18T01:40:23Z

Summary

Simple YAML format for defining custom tasks with real end-to-end evaluation
No WAA fork or Docker modification required — evaluator configs sent client-side
Four check types: command (PowerShell/Python output), file (exists/contains), screenshot (VLM judge), python (arbitrary code)
Milestone support for dense partial rewards (reward = milestones_passed / total)
5 example tasks: notepad hello, desktop folder, calc formula, clear browsing data (Chrome + Edge)
22 tests passing

How it works

Users write YAML, our code translates to WAA evaluator format at runtime:

name: "Open Notepad and type Hello World"
setup:
  - launch: "notepad.exe"
evaluate:
  - check: command
    run: "powershell -c 'Get-Process notepad | Measure | Select -Expand Count'"
    expect: "1"
  - check: screenshot
    description: "Notepad shows Hello World"
milestones:
  - name: "Notepad open"
    check: command
    run: "powershell -c 'Get-Process notepad | Measure | Select -Expand Count'"
    expect: "1"

Test plan

22 unit tests passing (YAML loading, WAA translation, VLM judge, milestones, examples)
End-to-end test against live WAA VM with example tasks

🤖 Generated with Claude Code

Users can define tasks with setup commands and evaluation checks in simple YAML files. The WAA server already accepts evaluator configs in POST /evaluate — this module translates YAML into that format. Four check types: - command: run PowerShell/Python on VM, check output - file: check file exists or contains expected content - screenshot: VLM judges screenshot (one-sentence description) - python: run arbitrary Python on VM Includes milestone support for dense partial rewards, VLM-based screenshot evaluation, 5 example tasks (notepad, folder, calc, clear-browsing-data for Chrome and Edge), and 22 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RLEnvironment.evaluate_dense() uses TaskConfig milestones to compute partial credit (milestones_passed / total). This gives GRPO gradient signal even when no task fully completes — an agent passing 3/5 milestones gets reward 0.6 vs 0.0 for binary evaluation. - evaluate_dense(): milestone-based evaluation, falls back to binary - load_task_config(): convenience method to set TaskConfig - collect_rollout() uses dense rewards when milestones are defined - reset() uses TaskConfig for task loading (bypasses server lookup) - Trajectory info includes milestone_score, binary_score, counts 9 new tests, all passing. No changes to existing evaluate() behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Execute setup commands were being double-wrapped in python -c. Now passed through as-is to WAA's execute handler. Validated against live WAA VM: milestones correctly evaluate (VLM screenshot check + command check both work). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Validates full chain: TaskConfig YAML → RLEnvironment → collect_rollout → dense rewards → TRL rollout_func output shape. Key test: multiple_rollouts_produce_reward_variance proves that milestone-based rewards produce [1.0, 0.67, 0.33, 0.0] across 4 rollouts — GRPO can compute meaningful advantages from this, even when binary task completion is 0%. 5 tests, no VM or GPU required (uses mock adapter). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abrichr and others added 4 commits March 17, 2026 21:40

abrichr merged commit e62377b into main Mar 18, 2026
1 check passed

This was referenced Mar 18, 2026

fix: remove python -c wrapping from VM command execution #126

Merged

docs: comprehensive README update for planner-grounder, workflow, and training features #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add YAML-based custom task evaluation without forking WAA#125

feat: add YAML-based custom task evaluation without forking WAA#125
abrichr merged 4 commits intomainfrom
feat/custom-task-evaluation

abrichr commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 18, 2026

Summary

How it works

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant