feat: add GPU training automation for verl-agent E2E workflow by abrichr · Pull Request #87 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-03T04:09:44Z

Summary

GPU training automation for verl-agent E2E workflow.

Infrastructure:

GPU VM sizes: Azure NC48ads_A100_v4 (2xA100), AWS p3.8xlarge (4xV100)
find_available_size_and_region(gpu=True) on both providers
scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent
scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training
oa-vm gpu-setup and gpu-train CLI commands

Config validation (done):

13/17 Hydra paths verified correct against ppo_trainer.yaml
Fixed: env.env_name (use short name, not Python import path)
Fixed: env.env_kwargs (doesn't exist, use env.waa.* sub-keys)
Added: data.train_files/val_files (parquet required)
Added: prepare_training_data() and patch_env_manager() steps

Integration gap (documented):

verl-agent uses hardcoded env dispatch in make_envs(), not dynamic imports
Need EnvironmentManagerBase adapter wrapping our async WAADesktopEnv
Automated patch_env_manager() adds elif branch, but full adapter is TODO

GPU quota status:

AWS: p3.8xlarge READY (32 vCPU P-instance quota)
Azure: zero modern GPU quota, request needed (1-3 business days)

Networking:

AWS recommended: shared VPC, direct private IP between GPU+CPU VMs
Azure: requires SSH tunnel (separate VNets per VM)

Test plan

GPU size constants import correctly
CLI argument parsing works
Hydra config paths validated against verl-agent schema
No regressions (708 tests pass)
GPU VM provisioning (requires p3.8xlarge on AWS)
setup_gpu_training.sh on real GPU VM
EnvironmentManagerBase adapter for verl-agent
E2E training run

- Add GPU_VM_SIZE_FALLBACKS to azure_vm.py (NC48ads_A100_v4, NC24ads, NC12s_v3) - Add GPU_INSTANCE_TYPE_FALLBACKS to aws_vm.py (p3.8xlarge, g5.12xlarge, p3.2xlarge) - Update find_available_size_and_region(gpu=True) on both providers + protocol - Add scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent - Add scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training - Add oa-vm gpu-setup and gpu-train CLI commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Validated all 17 Hydra config paths against verl-agent's actual schema (ppo_trainer.yaml + make_envs()). Key fixes: - env.env_name: use 'waa_desktop' short name, not Python import path (verl-agent uses hardcoded dispatch, not dynamic imports) - Remove env.env_kwargs (doesn't exist), use env.waa.* sub-keys - Add data.train_files/val_files (required parquet, generated via data_preprocess.prepare --mode visual) - Add missing overrides: algorithm.gamma, gpu_memory_utilization, ppo_mini_batch_size, filter_overlong_prompts, test_freq - Add prepare_training_data() and patch_env_manager() steps - Document the EnvironmentManagerBase integration gap in decision doc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…egration The previous implementation incorrectly assumed verl-agent uses an EnvironmentManagerBase ABC with a hardcoded make_envs() dispatch. Research reveals VAGEN actually uses: - GymImageEnv protocol (which WAADesktopEnv already implements) - YAML-based env registry (vagen/configs/env_registry.yaml) - GymAgentLoop for training-time rollout orchestration Changes: - Replace patch_env_manager() with register_waa_env() (YAML registry) - Add register_in_vagen() and generate_env_spec() helpers to verl_env.py - Update launch_training() to generate proper VAGEN training config - Fix Integration Gap section in decision doc (no EnvironmentManagerBase) - Update training config YAML with architecture diagram - Add 5 new tests for registration helpers (40 total, all passing) - Export new helpers from adapters/__init__.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… DRY violation Review fixes for the GPU training automation branch: - Fix is_action_valid: was inverted (DONE()→invalid, garbage→valid), now uses regex match on original action string - Fix scroll_direction: SCROLL parsing now populates BenchmarkAction.scroll_direction - Fix stale repo URLs: mll-lab-nu/VAGEN → RAGEN-AI/VAGEN across vendored files and docs - Fix stale branch ref: setup_gpu_training.sh referenced merged spike branch, now uses main - Fix stale repo URL: langfengQ/verl-agent → RAGEN-AI/VAGEN in setup script - Add --recurse-submodules to git clone (verl is a VAGEN submodule) - Remove dead params from register_waa_env() (waa_server, task_id, max_steps) - Deduplicate training command: vm_cli.py now delegates to launch_training() - Update test count in docs: 21 → 40+ - Add 3 new tests for is_action_valid behavior - Add scroll_direction assertion to existing scroll test All 43 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove undefined `use_fast` guard — always log tried sizes on failure - Remove unused PoolManager import in vm_cli.py - Remove extraneous f-string prefixes - Remove unused boto3 and SSH_OPTS imports in aws_vm.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WAADesktopEnv now correctly separates: - server_url (port 5000): Windows VM Flask API (/screenshot, /execute_windows) - evaluate_url (port 5001): evaluate_server.py (/setup, /evaluate, /probe) Previously, the single server_url default pointed at 5001 (evaluate server only), which caused 404s for screenshots and action execution. Also adds scripts/test_verl_env_e2e.py, validated on AWS g5.xlarge (A10G) with UNIX socket bridge proxy chain to Azure WAA VM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add _find_latest_dl_ami() for GPU VMs (pre-installed NVIDIA drivers + CUDA) - Add gpu param to create_vm() to select DL AMI vs standard Ubuntu - Reorder GPU_INSTANCE_TYPE_FALLBACKS: prefer g5 (Ampere/A10G) over p3 (Volta/V100) since OSS NVIDIA driver requires GSP (Turing+) - Make OPENADAPT_EVALS_BRANCH configurable via env var in setup script - Add conda TOS acceptance step (required since Miniconda 2025) Validated on AWS g5.xlarge with NVIDIA A10G 24GB GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Documents the successful end-to-end validation of the verl-agent/VAGEN training pipeline on AWS g5.xlarge (A10G 24GB) connecting to Azure WAA VM. Includes architecture diagrams, proxy chain details, raw test output, version listings, and issues discovered during validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…on docs - Standardize evaluate_url port to 5051 (socat bridge) across all docs - Add Artifact Stage column to validation results table mapping tests to raw output - Add docs commit (c2555ef) to PR #87 commit list - Clarify 5050 vs 5051 port mapping in architecture diagrams and data flow - Expand e2e_test_output.txt Stage 7/8 with sub-steps matching README table - Add SSH tunnel tip about socat bridge still being required Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add note to gpu_vm_stack_versions.txt explaining that the full pip list is from Stage 5 (vLLM install) and uvicorn was later downgraded by VAGEN - Add b7efb4f to the commit list in README.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…data - Check GPU compute capability before installing flash-attn; V100s (sm_70) don't support Flash Attention 2 (requires sm_80+) and would fail at build or runtime - Add post-preparation validation to prepare_training_data() ensuring the expected parquet files exist and are non-empty, rather than silently proceeding with missing data Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr

Review: Addressed 3 feedback items (commit `59811d0`)

1. EnvironmentManagerBase is a TODO — Already resolved (no change needed)

EnvironmentManagerBase does not exist in the codebase. It was replaced in commit dc4f088 ("replace EnvironmentManagerBase with VAGEN registry-based env integration"). The current architecture uses:

VMProvider(Protocol) in openadapt_evals/infrastructure/vm_provider.py for cloud VM management
VAGEN's GymImageEnv ABC (vendored in adapters/_vendored/) for RL environment protocol

The verl_agent_decision.md doc (line 259) explicitly documents this: "Earlier analysis referenced an EnvironmentManagerBase ABC... These do not exist in the current VAGEN codebase."

No action required.

2. `prepare_training_data()` — Fixed: added output validation

The function delegates to VAGEN's examples.data_preprocess.prepare module, which is a real implementation (not a stub). However, it only checked the exit code of the preprocessing command without verifying the output files were actually created correctly.

Fix: Added a post-preparation validation step that checks both train.parquet and test.parquet exist and are non-empty (-s test). On failure, raises RuntimeError with a clear message indicating which files are missing.

3. `flash-attn` installed unconditionally on V100s — Fixed: GPU arch check

setup_gpu_training.sh now detects GPU compute capability via nvidia-smi --query-gpu=compute_cap before installing flash-attn. The install is only attempted on GPUs with compute capability >= 8.0 (Ampere: A10G, A100, etc.). V100s (sm_70) and older will see a log message and skip the install.

Note: The aws_vm.py GPU fallback list already prefers Ampere instances (g5) over Volta (p3), but the setup script can be run on any GPU VM, so the guard is still necessary.

…omation

The generate_env_spec() default server_url is http://localhost:5000 (WAA Flask API port), not 5001. The test expectation was stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The two-port WAA architecture uses separate endpoints: - server_url (port 5000): WAA Flask API for screenshots and actions - evaluate_url (port 5001): evaluate_server for setup and evaluate Previously --waa-server defaulted to port 5001 and was assigned to server_url, conflating the two endpoints. This fixes: - train_verl_e2e.py: --waa-server default 5000, add --evaluate-server - vm_cli.py gpu-train: same CLI arg fixes, pass evaluate_url through - train_waa_vagen.yaml: correct server_url to 5000, add evaluate_url - Fix nested single quotes in register_waa_env (heredoc instead) - Replace fragile sys.path.insert with importlib.util Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- verl_env.py docstring: server_url example 5001 -> 5000, add evaluate_url - train_waa_vagen.yaml: SSH tunnel dest 5050 -> 5051 (socat bridge, not broken Docker port) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr force-pushed the spike/verl-agent-integration branch from 0631d62 to b8b426d Compare March 3, 2026 04:10

abrichr force-pushed the feat/gpu-training-automation branch 2 times, most recently from c295d19 to 39e5d7e Compare March 3, 2026 20:57

abrichr and others added 5 commits March 3, 2026 16:30

abrichr force-pushed the feat/gpu-training-automation branch from 39e5d7e to 308cade Compare March 3, 2026 21:45

abrichr and others added 6 commits March 3, 2026 20:47

abrichr commented Mar 4, 2026

View reviewed changes

abrichr changed the base branch from spike/verl-agent-integration to main March 4, 2026 03:19

Merge remote-tracking branch 'origin/main' into feat/gpu-training-aut…

384223c

…omation

abrichr marked this pull request as ready for review March 4, 2026 03:20

abrichr and others added 3 commits March 3, 2026 22:23

fix: update test to match server_url default port 5000

c830ddc

The generate_env_spec() default server_url is http://localhost:5000 (WAA Flask API port), not 5001. The test expectation was stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr merged commit da17355 into main Mar 4, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GPU training automation for verl-agent E2E workflow#87

feat: add GPU training automation for verl-agent E2E workflow#87
abrichr merged 15 commits intomainfrom
feat/gpu-training-automation

abrichr commented Mar 3, 2026 •

edited

Loading

Uh oh!

abrichr left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

abrichr left a comment

Choose a reason for hiding this comment

Review: Addressed 3 feedback items (commit 59811d0)

1. EnvironmentManagerBase is a TODO — Already resolved (no change needed)

2. prepare_training_data() — Fixed: added output validation

3. flash-attn installed unconditionally on V100s — Fixed: GPU arch check

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abrichr commented Mar 3, 2026 •

edited

Loading

Review: Addressed 3 feedback items (commit `59811d0`)

2. `prepare_training_data()` — Fixed: added output validation

3. `flash-attn` installed unconditionally on V100s — Fixed: GPU arch check