Skip to content

feat: add GPU training automation for verl-agent E2E workflow#87

Merged
abrichr merged 15 commits intomainfrom
feat/gpu-training-automation
Mar 4, 2026
Merged

feat: add GPU training automation for verl-agent E2E workflow#87
abrichr merged 15 commits intomainfrom
feat/gpu-training-automation

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 3, 2026

Summary

GPU training automation for verl-agent E2E workflow.

Infrastructure:

  • GPU VM sizes: Azure NC48ads_A100_v4 (2xA100), AWS p3.8xlarge (4xV100)
  • find_available_size_and_region(gpu=True) on both providers
  • scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent
  • scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training
  • oa-vm gpu-setup and gpu-train CLI commands

Config validation (done):

  • 13/17 Hydra paths verified correct against ppo_trainer.yaml
  • Fixed: env.env_name (use short name, not Python import path)
  • Fixed: env.env_kwargs (doesn't exist, use env.waa.* sub-keys)
  • Added: data.train_files/val_files (parquet required)
  • Added: prepare_training_data() and patch_env_manager() steps

Integration gap (documented):

  • verl-agent uses hardcoded env dispatch in make_envs(), not dynamic imports
  • Need EnvironmentManagerBase adapter wrapping our async WAADesktopEnv
  • Automated patch_env_manager() adds elif branch, but full adapter is TODO

GPU quota status:

  • AWS: p3.8xlarge READY (32 vCPU P-instance quota)
  • Azure: zero modern GPU quota, request needed (1-3 business days)

Networking:

  • AWS recommended: shared VPC, direct private IP between GPU+CPU VMs
  • Azure: requires SSH tunnel (separate VNets per VM)

Test plan

  • GPU size constants import correctly
  • CLI argument parsing works
  • Hydra config paths validated against verl-agent schema
  • No regressions (708 tests pass)
  • GPU VM provisioning (requires p3.8xlarge on AWS)
  • setup_gpu_training.sh on real GPU VM
  • EnvironmentManagerBase adapter for verl-agent
  • E2E training run

@abrichr abrichr force-pushed the spike/verl-agent-integration branch from 0631d62 to b8b426d Compare March 3, 2026 04:10
@abrichr abrichr force-pushed the feat/gpu-training-automation branch 2 times, most recently from c295d19 to 39e5d7e Compare March 3, 2026 20:57
abrichr and others added 5 commits March 3, 2026 16:30
- Add GPU_VM_SIZE_FALLBACKS to azure_vm.py (NC48ads_A100_v4, NC24ads, NC12s_v3)
- Add GPU_INSTANCE_TYPE_FALLBACKS to aws_vm.py (p3.8xlarge, g5.12xlarge, p3.2xlarge)
- Update find_available_size_and_region(gpu=True) on both providers + protocol
- Add scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent
- Add scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training
- Add oa-vm gpu-setup and gpu-train CLI commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validated all 17 Hydra config paths against verl-agent's actual schema
(ppo_trainer.yaml + make_envs()). Key fixes:

- env.env_name: use 'waa_desktop' short name, not Python import path
  (verl-agent uses hardcoded dispatch, not dynamic imports)
- Remove env.env_kwargs (doesn't exist), use env.waa.* sub-keys
- Add data.train_files/val_files (required parquet, generated via
  data_preprocess.prepare --mode visual)
- Add missing overrides: algorithm.gamma, gpu_memory_utilization,
  ppo_mini_batch_size, filter_overlong_prompts, test_freq
- Add prepare_training_data() and patch_env_manager() steps
- Document the EnvironmentManagerBase integration gap in decision doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egration

The previous implementation incorrectly assumed verl-agent uses an
EnvironmentManagerBase ABC with a hardcoded make_envs() dispatch.
Research reveals VAGEN actually uses:
- GymImageEnv protocol (which WAADesktopEnv already implements)
- YAML-based env registry (vagen/configs/env_registry.yaml)
- GymAgentLoop for training-time rollout orchestration

Changes:
- Replace patch_env_manager() with register_waa_env() (YAML registry)
- Add register_in_vagen() and generate_env_spec() helpers to verl_env.py
- Update launch_training() to generate proper VAGEN training config
- Fix Integration Gap section in decision doc (no EnvironmentManagerBase)
- Update training config YAML with architecture diagram
- Add 5 new tests for registration helpers (40 total, all passing)
- Export new helpers from adapters/__init__.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… DRY violation

Review fixes for the GPU training automation branch:

- Fix is_action_valid: was inverted (DONE()→invalid, garbage→valid), now uses
  regex match on original action string
- Fix scroll_direction: SCROLL parsing now populates BenchmarkAction.scroll_direction
- Fix stale repo URLs: mll-lab-nu/VAGEN → RAGEN-AI/VAGEN across vendored files and docs
- Fix stale branch ref: setup_gpu_training.sh referenced merged spike branch, now uses main
- Fix stale repo URL: langfengQ/verl-agent → RAGEN-AI/VAGEN in setup script
- Add --recurse-submodules to git clone (verl is a VAGEN submodule)
- Remove dead params from register_waa_env() (waa_server, task_id, max_steps)
- Deduplicate training command: vm_cli.py now delegates to launch_training()
- Update test count in docs: 21 → 40+
- Add 3 new tests for is_action_valid behavior
- Add scroll_direction assertion to existing scroll test

All 43 tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove undefined `use_fast` guard — always log tried sizes on failure
- Remove unused PoolManager import in vm_cli.py
- Remove extraneous f-string prefixes
- Remove unused boto3 and SSH_OPTS imports in aws_vm.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr force-pushed the feat/gpu-training-automation branch from 39e5d7e to 308cade Compare March 3, 2026 21:45
abrichr and others added 6 commits March 3, 2026 20:47
WAADesktopEnv now correctly separates:
- server_url (port 5000): Windows VM Flask API (/screenshot, /execute_windows)
- evaluate_url (port 5001): evaluate_server.py (/setup, /evaluate, /probe)

Previously, the single server_url default pointed at 5001 (evaluate server only),
which caused 404s for screenshots and action execution.

Also adds scripts/test_verl_env_e2e.py, validated on AWS g5.xlarge (A10G)
with UNIX socket bridge proxy chain to Azure WAA VM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _find_latest_dl_ami() for GPU VMs (pre-installed NVIDIA drivers + CUDA)
- Add gpu param to create_vm() to select DL AMI vs standard Ubuntu
- Reorder GPU_INSTANCE_TYPE_FALLBACKS: prefer g5 (Ampere/A10G) over p3
  (Volta/V100) since OSS NVIDIA driver requires GSP (Turing+)
- Make OPENADAPT_EVALS_BRANCH configurable via env var in setup script
- Add conda TOS acceptance step (required since Miniconda 2025)

Validated on AWS g5.xlarge with NVIDIA A10G 24GB GPU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the successful end-to-end validation of the verl-agent/VAGEN
training pipeline on AWS g5.xlarge (A10G 24GB) connecting to Azure WAA VM.
Includes architecture diagrams, proxy chain details, raw test output,
version listings, and issues discovered during validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on docs

- Standardize evaluate_url port to 5051 (socat bridge) across all docs
- Add Artifact Stage column to validation results table mapping tests to raw output
- Add docs commit (c2555ef) to PR #87 commit list
- Clarify 5050 vs 5051 port mapping in architecture diagrams and data flow
- Expand e2e_test_output.txt Stage 7/8 with sub-steps matching README table
- Add SSH tunnel tip about socat bridge still being required

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add note to gpu_vm_stack_versions.txt explaining that the full pip list
  is from Stage 5 (vLLM install) and uvicorn was later downgraded by VAGEN
- Add b7efb4f to the commit list in README.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…data

- Check GPU compute capability before installing flash-attn; V100s (sm_70)
  don't support Flash Attention 2 (requires sm_80+) and would fail at build
  or runtime
- Add post-preparation validation to prepare_training_data() ensuring the
  expected parquet files exist and are non-empty, rather than silently
  proceeding with missing data

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member Author

@abrichr abrichr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Addressed 3 feedback items (commit 59811d0)

1. EnvironmentManagerBase is a TODO — Already resolved (no change needed)

EnvironmentManagerBase does not exist in the codebase. It was replaced in commit dc4f088 ("replace EnvironmentManagerBase with VAGEN registry-based env integration"). The current architecture uses:

  • VMProvider(Protocol) in openadapt_evals/infrastructure/vm_provider.py for cloud VM management
  • VAGEN's GymImageEnv ABC (vendored in adapters/_vendored/) for RL environment protocol

The verl_agent_decision.md doc (line 259) explicitly documents this: "Earlier analysis referenced an EnvironmentManagerBase ABC... These do not exist in the current VAGEN codebase."

No action required.

2. prepare_training_data()Fixed: added output validation

The function delegates to VAGEN's examples.data_preprocess.prepare module, which is a real implementation (not a stub). However, it only checked the exit code of the preprocessing command without verifying the output files were actually created correctly.

Fix: Added a post-preparation validation step that checks both train.parquet and test.parquet exist and are non-empty (-s test). On failure, raises RuntimeError with a clear message indicating which files are missing.

3. flash-attn installed unconditionally on V100s — Fixed: GPU arch check

setup_gpu_training.sh now detects GPU compute capability via nvidia-smi --query-gpu=compute_cap before installing flash-attn. The install is only attempted on GPUs with compute capability >= 8.0 (Ampere: A10G, A100, etc.). V100s (sm_70) and older will see a log message and skip the install.

Note: The aws_vm.py GPU fallback list already prefers Ampere instances (g5) over Volta (p3), but the setup script can be run on any GPU VM, so the guard is still necessary.

@abrichr abrichr changed the base branch from spike/verl-agent-integration to main March 4, 2026 03:19
@abrichr abrichr marked this pull request as ready for review March 4, 2026 03:20
abrichr and others added 3 commits March 3, 2026 22:23
The generate_env_spec() default server_url is http://localhost:5000
(WAA Flask API port), not 5001. The test expectation was stale.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The two-port WAA architecture uses separate endpoints:
- server_url (port 5000): WAA Flask API for screenshots and actions
- evaluate_url (port 5001): evaluate_server for setup and evaluate

Previously --waa-server defaulted to port 5001 and was assigned to
server_url, conflating the two endpoints. This fixes:
- train_verl_e2e.py: --waa-server default 5000, add --evaluate-server
- vm_cli.py gpu-train: same CLI arg fixes, pass evaluate_url through
- train_waa_vagen.yaml: correct server_url to 5000, add evaluate_url
- Fix nested single quotes in register_waa_env (heredoc instead)
- Replace fragile sys.path.insert with importlib.util

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- verl_env.py docstring: server_url example 5001 -> 5000, add evaluate_url
- train_waa_vagen.yaml: SSH tunnel dest 5050 -> 5051 (socat bridge, not
  broken Docker port)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr merged commit da17355 into main Mar 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant