feat: add GPU training automation for verl-agent E2E workflow#87
feat: add GPU training automation for verl-agent E2E workflow#87
Conversation
0631d62 to
b8b426d
Compare
c295d19 to
39e5d7e
Compare
- Add GPU_VM_SIZE_FALLBACKS to azure_vm.py (NC48ads_A100_v4, NC24ads, NC12s_v3) - Add GPU_INSTANCE_TYPE_FALLBACKS to aws_vm.py (p3.8xlarge, g5.12xlarge, p3.2xlarge) - Update find_available_size_and_region(gpu=True) on both providers + protocol - Add scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent - Add scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training - Add oa-vm gpu-setup and gpu-train CLI commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validated all 17 Hydra config paths against verl-agent's actual schema (ppo_trainer.yaml + make_envs()). Key fixes: - env.env_name: use 'waa_desktop' short name, not Python import path (verl-agent uses hardcoded dispatch, not dynamic imports) - Remove env.env_kwargs (doesn't exist), use env.waa.* sub-keys - Add data.train_files/val_files (required parquet, generated via data_preprocess.prepare --mode visual) - Add missing overrides: algorithm.gamma, gpu_memory_utilization, ppo_mini_batch_size, filter_overlong_prompts, test_freq - Add prepare_training_data() and patch_env_manager() steps - Document the EnvironmentManagerBase integration gap in decision doc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egration The previous implementation incorrectly assumed verl-agent uses an EnvironmentManagerBase ABC with a hardcoded make_envs() dispatch. Research reveals VAGEN actually uses: - GymImageEnv protocol (which WAADesktopEnv already implements) - YAML-based env registry (vagen/configs/env_registry.yaml) - GymAgentLoop for training-time rollout orchestration Changes: - Replace patch_env_manager() with register_waa_env() (YAML registry) - Add register_in_vagen() and generate_env_spec() helpers to verl_env.py - Update launch_training() to generate proper VAGEN training config - Fix Integration Gap section in decision doc (no EnvironmentManagerBase) - Update training config YAML with architecture diagram - Add 5 new tests for registration helpers (40 total, all passing) - Export new helpers from adapters/__init__.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… DRY violation Review fixes for the GPU training automation branch: - Fix is_action_valid: was inverted (DONE()→invalid, garbage→valid), now uses regex match on original action string - Fix scroll_direction: SCROLL parsing now populates BenchmarkAction.scroll_direction - Fix stale repo URLs: mll-lab-nu/VAGEN → RAGEN-AI/VAGEN across vendored files and docs - Fix stale branch ref: setup_gpu_training.sh referenced merged spike branch, now uses main - Fix stale repo URL: langfengQ/verl-agent → RAGEN-AI/VAGEN in setup script - Add --recurse-submodules to git clone (verl is a VAGEN submodule) - Remove dead params from register_waa_env() (waa_server, task_id, max_steps) - Deduplicate training command: vm_cli.py now delegates to launch_training() - Update test count in docs: 21 → 40+ - Add 3 new tests for is_action_valid behavior - Add scroll_direction assertion to existing scroll test All 43 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove undefined `use_fast` guard — always log tried sizes on failure - Remove unused PoolManager import in vm_cli.py - Remove extraneous f-string prefixes - Remove unused boto3 and SSH_OPTS imports in aws_vm.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
39e5d7e to
308cade
Compare
WAADesktopEnv now correctly separates: - server_url (port 5000): Windows VM Flask API (/screenshot, /execute_windows) - evaluate_url (port 5001): evaluate_server.py (/setup, /evaluate, /probe) Previously, the single server_url default pointed at 5001 (evaluate server only), which caused 404s for screenshots and action execution. Also adds scripts/test_verl_env_e2e.py, validated on AWS g5.xlarge (A10G) with UNIX socket bridge proxy chain to Azure WAA VM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _find_latest_dl_ami() for GPU VMs (pre-installed NVIDIA drivers + CUDA) - Add gpu param to create_vm() to select DL AMI vs standard Ubuntu - Reorder GPU_INSTANCE_TYPE_FALLBACKS: prefer g5 (Ampere/A10G) over p3 (Volta/V100) since OSS NVIDIA driver requires GSP (Turing+) - Make OPENADAPT_EVALS_BRANCH configurable via env var in setup script - Add conda TOS acceptance step (required since Miniconda 2025) Validated on AWS g5.xlarge with NVIDIA A10G 24GB GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the successful end-to-end validation of the verl-agent/VAGEN training pipeline on AWS g5.xlarge (A10G 24GB) connecting to Azure WAA VM. Includes architecture diagrams, proxy chain details, raw test output, version listings, and issues discovered during validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on docs - Standardize evaluate_url port to 5051 (socat bridge) across all docs - Add Artifact Stage column to validation results table mapping tests to raw output - Add docs commit (c2555ef) to PR #87 commit list - Clarify 5050 vs 5051 port mapping in architecture diagrams and data flow - Expand e2e_test_output.txt Stage 7/8 with sub-steps matching README table - Add SSH tunnel tip about socat bridge still being required Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add note to gpu_vm_stack_versions.txt explaining that the full pip list is from Stage 5 (vLLM install) and uvicorn was later downgraded by VAGEN - Add b7efb4f to the commit list in README.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…data - Check GPU compute capability before installing flash-attn; V100s (sm_70) don't support Flash Attention 2 (requires sm_80+) and would fail at build or runtime - Add post-preparation validation to prepare_training_data() ensuring the expected parquet files exist and are non-empty, rather than silently proceeding with missing data Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr
left a comment
There was a problem hiding this comment.
Review: Addressed 3 feedback items (commit 59811d0)
1. EnvironmentManagerBase is a TODO — Already resolved (no change needed)
EnvironmentManagerBase does not exist in the codebase. It was replaced in commit dc4f088 ("replace EnvironmentManagerBase with VAGEN registry-based env integration"). The current architecture uses:
VMProvider(Protocol)inopenadapt_evals/infrastructure/vm_provider.pyfor cloud VM management- VAGEN's
GymImageEnvABC (vendored inadapters/_vendored/) for RL environment protocol
The verl_agent_decision.md doc (line 259) explicitly documents this: "Earlier analysis referenced an EnvironmentManagerBase ABC... These do not exist in the current VAGEN codebase."
No action required.
2. prepare_training_data() — Fixed: added output validation
The function delegates to VAGEN's examples.data_preprocess.prepare module, which is a real implementation (not a stub). However, it only checked the exit code of the preprocessing command without verifying the output files were actually created correctly.
Fix: Added a post-preparation validation step that checks both train.parquet and test.parquet exist and are non-empty (-s test). On failure, raises RuntimeError with a clear message indicating which files are missing.
3. flash-attn installed unconditionally on V100s — Fixed: GPU arch check
setup_gpu_training.sh now detects GPU compute capability via nvidia-smi --query-gpu=compute_cap before installing flash-attn. The install is only attempted on GPUs with compute capability >= 8.0 (Ampere: A10G, A100, etc.). V100s (sm_70) and older will see a log message and skip the install.
Note: The aws_vm.py GPU fallback list already prefers Ampere instances (g5) over Volta (p3), but the setup script can be run on any GPU VM, so the guard is still necessary.
The generate_env_spec() default server_url is http://localhost:5000 (WAA Flask API port), not 5001. The test expectation was stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The two-port WAA architecture uses separate endpoints: - server_url (port 5000): WAA Flask API for screenshots and actions - evaluate_url (port 5001): evaluate_server for setup and evaluate Previously --waa-server defaulted to port 5001 and was assigned to server_url, conflating the two endpoints. This fixes: - train_verl_e2e.py: --waa-server default 5000, add --evaluate-server - vm_cli.py gpu-train: same CLI arg fixes, pass evaluate_url through - train_waa_vagen.yaml: correct server_url to 5000, add evaluate_url - Fix nested single quotes in register_waa_env (heredoc instead) - Replace fragile sys.path.insert with importlib.util Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- verl_env.py docstring: server_url example 5001 -> 5000, add evaluate_url - train_waa_vagen.yaml: SSH tunnel dest 5050 -> 5051 (socat bridge, not broken Docker port) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
GPU training automation for verl-agent E2E workflow.
Infrastructure:
Config validation (done):
Integration gap (documented):
GPU quota status:
Networking:
Test plan