Skip to content

feat: add standalone GRPO trainer with WAADirect (no openadapt-ml dependency)#191

Merged
abrichr merged 1 commit into
mainfrom
feat/standalone-grpo
Mar 23, 2026
Merged

feat: add standalone GRPO trainer with WAADirect (no openadapt-ml dependency)#191
abrichr merged 1 commit into
mainfrom
feat/standalone-grpo

Conversation

@abrichr

@abrichr abrichr commented Mar 23, 2026

Copy link
Copy Markdown
Member

Summary

  • Standalone GRPO training package (openadapt_evals/training/standalone/) with zero openadapt-ml imports
  • WAADirect HTTP client replaces WAALiveAdapter + RLEnvironment stack, eliminating version coupling and adapter indirection
  • 695 LOC total across 7 modules + CLI wrapper script
  • Designed for later migration to openadapt-ml once training API stabilizes

Package structure

File LOC Purpose
config.py 30 TrainingConfig dataclass
waa_direct.py 125 Direct HTTP: screenshot/click/type/key
prompt.py 145 SYSTEM_PROMPT + message builder + multi-format parser
reward.py 35 Group-relative advantages + VLM milestone eval
model_loader.py 51 HuggingFace + PEFT model loading
trainer.py 281 GRPOTrainer: rollout collection + training loop
__init__.py 9 Package exports

Key design decisions

  1. ZERO openadapt-ml imports -- self-contained, portable
  2. WAADirect uses direct HTTP -- requests.post(server_url + "/execute_windows", ...), no adapter abstraction
  3. Prompt format matches SFT -- copies SYSTEM_PROMPT from next_action.py, supports Thought:/Action: format
  4. Multi-format parser -- handles Thought/Action, bare DSL (CLICK(...)), and JSON {"action_type":"click"}
  5. max_new_tokens=2048 default -- 100 was catastrophically low (truncates reasoning)
  6. Screenshot-only milestone evaluation -- VLM judge via OpenAI API, no PowerShell
  7. Fresh screenshot for evaluation -- taken at eval time, not cached
  8. Per-step backward -- avoids OOM on long trajectories

Usage

python -m openadapt_evals.training.standalone.trainer \
    --task-dir example_tasks \
    --server-url http://localhost:5001 \
    --model Qwen/Qwen3.5-9B \
    --num-steps 10 \
    --output checkpoints/

Test plan

  • Syntax check passes (verified)
  • Zero openadapt-ml imports (verified via grep)
  • Line count under 700 (695 actual)
  • CLI --help works
  • End-to-end test on WAA VM (Phase 3 validation)

🤖 Generated with Claude Code

…endency)

Self-contained GRPO training package that eliminates the openadapt-ml
dependency for RL training. Uses direct HTTP calls to WAA Flask server
(WAADirect) instead of the WAALiveAdapter + RLEnvironment stack,
removing version coupling and adapter indirection.

Package structure (695 LOC total):
- config.py: TrainingConfig dataclass
- waa_direct.py: WAADirect HTTP client (screenshot/click/type/key)
- prompt.py: SYSTEM_PROMPT + build_agent_messages + parse_vlm_output_to_action
- reward.py: compute_group_advantages + evaluate_milestones_screenshot
- model_loader.py: load_model_and_processor (HF + PEFT)
- trainer.py: GRPOTrainer with rollout collection + training loop

Key design decisions:
- ZERO openadapt-ml imports (self-contained, will migrate later)
- max_new_tokens=2048 default (100 was catastrophically low)
- Multi-format parser (Thought/Action, bare DSL, JSON)
- Fresh screenshot for evaluation (not cached)
- Per-step backward to avoid OOM on long trajectories
- VLM judge via OpenAI API for milestone evaluation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit ba049f7 into main Mar 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant