Skip to content

DSE num_nodes sweep writes a phantom NUM_NODES field into cmd_args #943

Description

@rutayan-nv

Summary

During a num_nodes DSE sweep, TestRun.apply_params_set writes the NUM_NODES action key into cmd_args as an undeclared field. Execution is unaffected, but the per-run record is corrupted.

Mechanism

  • param_space adds "NUM_NODES" to the action space when num_nodes is a list, so each action dict contains {"NUM_NODES": <n>, ...}.
  • _apply(key, value) treats every key as a field path into cmd_args, so "NUM_NODES" becomes setattr(cmd_args, "NUM_NODES", n).
  • CmdArgs is configured extra="allow", so instead of raising on the unknown field it silently stores it.

Impact

  • Functionally correct: the workload runs fine and TestRun.num_nodes is still set correctly (via the explicit if "NUM_NODES" in action assignment).
  • The phantom field shows up in cmd_args.model_dump() and therefore in the per-run TestRunDetails dump — recording a config with a NUM_NODES arg the workload never accepts. Anything reading that dump back (reproduction, reporting) sees a fabricated field.
  • It does not leak into the generated shell command (strategies read named fields, not the extras bag).

Root cause

Scope mismatch: action keys address two different objects — cmd_args/extra_env_vars on the TestDefinition, vs num_nodes on the TestRun. _apply only knows how to write into the TestDefinition. NUM_NODES is the single boundary-crossing key, and extra="allow" hides the misroute instead of surfacing it.

Severity

Low — record/repro integrity, not execution. Pre-existing on main; surfaced (not introduced) by the apply_params_set refactor in #901.

Repro

```python
from cloudai.models.workload import CmdArgs
c = CmdArgs()
setattr(c, "NUM_NODES", 2) # what _apply("NUM_NODES", 2) effectively does
print(c.model_dump()) # -> {'NUM_NODES': 2}
```

Suggested fix

Make _apply the single dispatch point: route NUM_NODES to TestRun.num_nodes, everything else into the TestDefinition. Share a NUM_NODES_KEY constant between param_space (producer) and apply_params_set (consumer) so the one scope-crossing key is defined in one place.

Code: src/cloudai/_core/test_scenario.pyTestRun.param_space / TestRun.apply_params_set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions