feat: add YAML-based custom task evaluation without forking WAA#125
Merged
feat: add YAML-based custom task evaluation without forking WAA#125
Conversation
Users can define tasks with setup commands and evaluation checks in simple YAML files. The WAA server already accepts evaluator configs in POST /evaluate — this module translates YAML into that format. Four check types: - command: run PowerShell/Python on VM, check output - file: check file exists or contains expected content - screenshot: VLM judges screenshot (one-sentence description) - python: run arbitrary Python on VM Includes milestone support for dense partial rewards, VLM-based screenshot evaluation, 5 example tasks (notepad, folder, calc, clear-browsing-data for Chrome and Edge), and 22 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RLEnvironment.evaluate_dense() uses TaskConfig milestones to compute partial credit (milestones_passed / total). This gives GRPO gradient signal even when no task fully completes — an agent passing 3/5 milestones gets reward 0.6 vs 0.0 for binary evaluation. - evaluate_dense(): milestone-based evaluation, falls back to binary - load_task_config(): convenience method to set TaskConfig - collect_rollout() uses dense rewards when milestones are defined - reset() uses TaskConfig for task loading (bypasses server lookup) - Trajectory info includes milestone_score, binary_score, counts 9 new tests, all passing. No changes to existing evaluate() behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Execute setup commands were being double-wrapped in python -c. Now passed through as-is to WAA's execute handler. Validated against live WAA VM: milestones correctly evaluate (VLM screenshot check + command check both work). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validates full chain: TaskConfig YAML → RLEnvironment → collect_rollout → dense rewards → TRL rollout_func output shape. Key test: multiple_rollouts_produce_reward_variance proves that milestone-based rewards produce [1.0, 0.67, 0.33, 0.0] across 4 rollouts — GRPO can compute meaningful advantages from this, even when binary task completion is 0%. 5 tests, no VM or GPU required (uses mock adapter). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
command(PowerShell/Python output),file(exists/contains),screenshot(VLM judge),python(arbitrary code)How it works
Users write YAML, our code translates to WAA evaluator format at runtime:
Test plan
🤖 Generated with Claude Code