Skip to content

No single source of truth for "agent-driven" run; --single-sbatch conflates scheduling with search strategy #944

Description

@rutayan-nv

Summary

env_params are only resolved in the agent loop (CloudAIGymEnv.step), so their validity depends on one fact: will this run be agent-driven? Today that fact has no single source of truth — it's scattered across tr.is_dse_job (config), agent.samples_env_params (config), and args.single_sbatch (CLI), reconciled nowhere.

Symptom

A config with env_params + an env-aware (RL) agent passes validate_dse_env_params (it's is_dse_job and the agent samples). But if run with --single-sbatch, dispatch routes to the grid-unroll path, which calls apply_params_set(combination) with no env_params. The env_params are silently dropped (and the field is left as an unresolved list heading into command-gen).

# src/cloudai/cli/handlers.py  (mode decision)
has_dse = any(tr.is_dse_job for tr in test_scenario.test_runs)
if args.single_sbatch or not has_dse:   # <-- single_sbatch forces grid unroll
    handle_non_dse_job(runner, args)

Root cause (two coupled flaws)

  1. Missing abstraction: there is no single is_agent_driven concept. env_params validity (and dispatch) should gate on it, computed once.
  2. --single-sbatch overloads scheduling with search strategy. It is a scheduling/packaging concern (cram cases into one sbatch) but currently forces the grid strategy, overriding whatever agent the config declared. Scheduling should be orthogonal to env / action-space / search strategy. In future --single-sbatch should support agent-driven runs too (e.g. a genetic algorithm launching multiple evaluations in parallel), where env_params are perfectly valid.

Why not a quick guard

Rejecting env_params when args.single_sbatch (the obvious patch) bakes in the exact coupling we want to remove: the day --single-sbatch supports agent-driven runs, env_params would be valid there, yet the guard would still reject them. The fix belongs in the model, not the run handler.

Direction

  • Introduce a single source of truth for agent-driven execution (config/agent capability); gate both dispatch and env_params validation on it.
  • Decouple --single-sbatch from search strategy so it only affects scheduling/packaging and composes with both grid and agent-driven runs.

Pointers

  • src/cloudai/cli/handlers.pyhandle_dry_run_and_run mode decision.
  • src/cloudai/configurator/env_params.pyvalidate_dse_env_params.
  • src/cloudai/systems/slurm/single_sbatch_runner.py — grid unroll calling apply_params_set without env_params.

Surfaced by the env_params work in #901; the underlying scheduling/strategy coupling predates it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions