Do not run CommandShell check during object creation by amaslenn · Pull Request #841 · NVIDIA/cloudai

amaslenn · 2026-03-20T08:04:34Z

Summary

Do not run CommandShell check during object creation. That addresses issue with building documentation on Windows.

Test Plan

CI
Docs build on Windows.

Additional Notes

–

coderabbitai · 2026-03-20T08:04:48Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added per-test exclude_nodes support across Slurm path (models, parsing, allocation, command generation), switched SlurmSystem.cmd_shell to use a default_factory, added git submodule init support in installers, adjusted Megatron-Bridge launcher/mount and container env handling, and updated tests and config TOMLs accordingly.

Changes

Cohort / File(s)	Summary
Slurm core & allocation `src/cloudai/systems/slurm/slurm_system.py`, `src/cloudai/systems/slurm/slurm_command_gen_strategy.py`, `src/cloudai/systems/slurm/slurm_installer.py`	Added `exclude_nodes` plumbing: parsing, grouping, allocation, and error behavior; added `SlurmNodeState.RESERVED` grouping; switched `cmd_shell` field to `default_factory=CommandShell`; updated command-gen to emit `--exclude` and include `exclude_nodes` in cache keys; refactored installer to support submodule init.
Models / TestRun / Parser `src/cloudai/models/scenario.py`, `src/cloudai/_core/test_scenario.py`, `src/cloudai/test_scenario_parser.py`	Added `exclude_nodes` field to TestRun/TestRunModel and threading through parser to preserve per-test exclusions.
Git repo installables & installers `src/cloudai/_core/installables.py`, `src/cloudai/systems/kubernetes/kubernetes_installer.py`, `src/cloudai/systems/slurm/slurm_installer.py`	Added `init_submodules: bool` to `GitRepo`; refactored clone/setup flow and introduced `_init_submodules` in installers; ensure repo path is set only after successful setup and remove partial clones on failure.
Megatron-Bridge launcher & mounts `src/cloudai/workloads/megatron_bridge/megatron_bridge.py`, `src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`, `tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`, `tests/ref_data/megatron-bridge.sbatch`	Stopped implicit mount_as normalization; added container-runtime env var exports into wrapper script; parse extra srun args into Slurm params; forward gpus_per_node/gres and nodelist/exclude into `--additional_slurm_params`; adjust mounting logic and tests/ref data to match.
CommandShell & util `src/cloudai/util/command_shell.py`	Condensed docstrings and moved the executable FileNotFoundError check to occur immediately in `execute()`; no public API changes.
Tests — Slurm, installer, command-gen `tests/systems/slurm/...`, `tests/test_git_repo_installer.py`, `tests/conftest.py`, `tests/test_docker_image_cache_manager.py`, `tests/...`	Added/updated tests for exclude_nodes behavior, node allocation, command generation, caching, submodule init success/failure, and fixture tweaks (supports_gpu_directives_cache toggle); updated test helper wiring for Megatron-Bridge repo mount handling.
Configs / TOMLs `conf/experimental/megatron_bridge/test/`, `conf/experimental/megatron_bridge/test_scenario/`	Removed `mount_as` from several TOMLs; added a new test_scenario TOML and minor header year updates/commented micro-batch hint.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop through nodes both near and far,
Marking which to skip — exclude by name or star.
I plant fresh shells for each new test to run,
Pull submodules, mount the bridge, then watch commands.
A tidy nibble, then a joyful bound — code gardens humming, neat and sound. 🥕

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Do not run CommandShell check during object creation' accurately describes the main change: deferring the CommandShell executable check from initialization to execution time.
Description check	✅ Passed	The description is related to the changeset, explaining that the CommandShell check has been moved from object creation to runtime to address Windows documentation build issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am/no-runtime-shell-check

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-20T08:06:11Z

Greptile Summary

This PR bundles the stated Windows/docs-build fix (CommandShell no longer checks for /bin/bash at import time) with a set of broader Megatron-Bridge and Slurm enhancements: optional git submodule initialisation for GitRepo, an exclude_nodes field that flows from TOML → model → TestRun → SBATCH header and additional_slurm_params, conditional bind-mounting of the Megatron-Bridge repo via mount_as, container-runtime env-var pre-exports, srun-args forwarding to the inner sbatch, and support for SlurmNodeState.RESERVED nodes.

Key changes:

CommandShell: FileNotFoundError guard moved from __init__ to execute(); SlurmSystem switched to default_factory=CommandShell to avoid the check at Pydantic model-class creation time.
GitRepo.init_submodules: new opt-in boolean; both Slurm and Kubernetes installers gain _init_submodules + _clone_and_setup_repo helpers with proper cleanup on failure.
exclude_nodes: end-to-end plumbing from scenario TOML through TestRunModel/TestRun to #SBATCH --exclude and get_nodes_by_spec; Slurm range syntax is expanded before node filtering.
get_nodes_by_spec behavior change: non-empty node specs that resolve to an empty list now raise ValueError instead of silently falling back to unconstrained allocation — a deliberate but breaking change.
Megatron-Bridge: the forced mount_as=/opt/Megatron-Bridge default is removed; mounting is now conditional on the user setting the field; GPU directives, node lists, and srun args are forwarded via --additional_slurm_params; container-runtime variables (MELLANOX_VISIBLE_DEVICES, NVIDIA_VISIBLE_DEVICES, NVIDIA_DRIVER_CAPABILITIES) are exported in the wrapper script before the launcher call.
SlurmNodeState.RESERVED: added to the grouped-nodes dict and to max_avail allocation; also participates in fixed-count allocation via the generic for state in grouped_nodes loop.

Confidence Score: 3/5

Mostly safe, but two concerns warrant attention before merging to main: a silent short-form flag drop in _parse_srun_args_as_slurm_params and the breaking ValueError in get_nodes_by_spec.
The core Windows/docs fix is clean and correct. The exclude_nodes plumbing and init_submodules additions are well-tested and logically sound. The Megatron-Bridge enhancements are substantial and well-covered by new tests. Two issues lower the score: (1) _parse_srun_args_as_slurm_params silently drops single-dash srun flags, which could cause silent misconfiguration; (2) get_nodes_by_spec now raises instead of falling back when explicitly-specified nodes are unavailable — a meaningful behavior change that could break existing scenarios if nodes are temporarily down.
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (short-form flag handling, exclude range-syntax in additional_slurm_params) and src/cloudai/systems/slurm/slurm_system.py (breaking ValueError in get_nodes_by_spec).

Important Files Changed

Filename	Overview
src/cloudai/util/command_shell.py	Core fix: moves `FileNotFoundError` guard from `__init__` to `execute()` so `CommandShell` can be imported/instantiated on Windows without `/bin/bash` existing; also switches `SlurmSystem` to `default_factory=CommandShell` to avoid the check at class-definition time.
src/cloudai/systems/slurm/slurm_system.py	Adds `exclude_nodes` propagation through `parse_nodes` → `get_available_nodes_from_group` → `group_nodes_by_state`; adds `RESERVED` node state support; introduces a breaking `ValueError` in `get_nodes_by_spec` when explicit specs resolve to empty — previously the function silently fell back to unconstrained allocation.
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py	Significant feature additions: conditional repo bind-mounting via `mount_as`, container-runtime env-var pre-exports, `additional_slurm_params` forwarding (GPU directives, nodelist/exclude, srun args); new `_parse_srun_args_as_slurm_params` parser silently drops single-dash short-form flags and raw Slurm range syntax in exclude is passed unverified to the launcher.
src/cloudai/_core/installables.py	Adds `init_submodules: bool = False` field to `GitRepo`; straightforward additive change.
src/cloudai/systems/slurm/slurm_installer.py	Refactors `_install_one_git_repo` to call new `_clone_and_setup_repo` helper and adds `_init_submodules`; properly cleans up the repo directory on submodule init failure.
src/cloudai/systems/kubernetes/kubernetes_installer.py	Mirrors Slurm installer changes: adds `_init_submodules` and `_clone_and_setup_repo`; consistent implementation with cleanup on failure.
src/cloudai/_core/test_scenario.py	Adds `exclude_nodes: List[str]` field to `TestRun` dataclass; clean additive change with proper default factory.
src/cloudai/models/scenario.py	Adds `exclude_nodes` field to `TestRunModel` with helpful Slurm range-syntax description; clean additive model change.
src/cloudai/systems/slurm/slurm_command_gen_strategy.py	Adds `#SBATCH --exclude` directive when `exclude_nodes` is set and no explicit nodelist is present; includes `exclude_nodes` in the node-spec cache key to prevent stale hits across different test runs.
src/cloudai/workloads/megatron_bridge/megatron_bridge.py	Removes forced `mount_as` assignment in `_select_megatron_bridge_repo`; the repo is now returned as-is so the caller decides whether to bind-mount it.

Sequence Diagram

sequenceDiagram
    participant TOML as scenario.toml
    participant TSP as TestScenarioParser
    participant TR as TestRun
    participant SCGS as SlurmCommandGenStrategy
    participant SS as SlurmSystem
    participant MBS as MegatronBridgeSlurmStrategy

    TOML->>TSP: exclude_nodes, git_repos[mount_as, init_submodules]
    TSP->>TR: TestRun(exclude_nodes=[...])
    TR->>SCGS: _append_nodes_related_directives()
    SCGS->>SCGS: emit #SBATCH --exclude (no nodelist case)
    SCGS->>SS: get_nodes_by_spec(nnodes, nodes, exclude_nodes)
    SS->>SS: parse_nodes() → expand ranges → filter excluded
    SS->>SS: get_available_nodes_from_group() → group_nodes_by_state()
    SS-->>SCGS: (num_nodes, node_list) or ValueError

    TR->>MBS: gen_exec_command()
    MBS->>MBS: _container_runtime_env_exports() → export MELLANOX/NVIDIA_VISIBLE_DEVICES
    MBS->>MBS: _build_launcher_parts() → conditional -cm mount_as
    MBS->>MBS: additional_slurm_params: gpus-per-node, nodelist/exclude, srun-args
    MBS->>MBS: _wrap_launcher_for_job_id → wrapper.sh

Comments Outside Diff (2)

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 638-651 (link)

Short-form srun flags are silently dropped

_parse_srun_args_as_slurm_params only processes tokens starting with --. Any single-dash short-form flags (e.g., -p partition, -N 4, -C gpu) in extra_srun_args are silently ignored — the i += 1 at the bottom of the loop still advances past them, but no params.append is reached. This could cause subtle misconfiguration if a user or system uses short-form arguments.

For example, extra_srun_args = "-p debug --exclusive" would produce ["exclusive"] and silently drop -p debug.

Consider either documenting the limitation clearly or extending the parser to handle single-dash flags:
```
elif tok.startswith("-") and not tok.startswith("--"):
    # Short-form flags: skip with optional value
    if i + 1 < len(tokens) and not tokens[i + 1].startswith("-"):
        # -p partition → skip value token too
        i += 1
    # Not forwarded; consider logging a warning
```
src/cloudai/systems/slurm/slurm_system.py, line 477-486 (link)

Breaking behavior change: ValueError replaces silent fallback

Previously, when non-empty nodes specs (e.g. ["main:group1:3"]) resolved to an empty list (all nodes DRAIN/DOWN or all excluded), get_nodes_by_spec silently returned (num_nodes, []), letting Slurm pick any available nodes in the partition.

The new code raises a ValueError ("Cannot fall back to unconstrained allocation"). This is a more defensive and correct design, but it's a breaking change for users who may have existing scenarios where nodes are temporarily unavailable and relied on CloudAI transparently falling back. It would be worth adding a note in the release/changelog and ensuring callers handle this exception gracefully.

The inconsistency with the empty-nodes case is also notable: when nodes=[] (empty list) + exclude_nodes set, the function still returns (num_nodes, []) without raising, which could feel surprising given the tighter contract now enforced for non-empty nodes.

_{Last reviewed commit: "Update copyright yea..."}

src/cloudai/util/command_shell.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_docker_image_cache_manager.py`:
- Around line 77-78: The function definition for test_cache_docker_image exceeds
the 120-char line limit; split the parameter list across multiple lines to keep
each line <=120 chars while preserving the existing parameter order and
decorator behavior (ensure mock_supports_gpu remains first so bottom-up
decorators still apply). Update the signature for test_cache_docker_image(...)
to place parameters like mock_supports_gpu, mock_check_prerequisites, mock_run,
mock_access, mock_exists, mock_is_file, and slurm_system each on their own line
or grouped so the wrapped lines are under 120 chars and the test still consumes
the mocks as intended.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 825ca023-36b0-4b33-80f8-5d01bd8c972e

📥 Commits

Reviewing files that changed from the base of the PR and between 216bf6a and 0fb93cd.

📒 Files selected for processing (1)

tests/test_docker_image_cache_manager.py

tests/test_docker_image_cache_manager.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_docker_image_cache_manager.py (1)

2-2: ⚠️ Potential issue | 🔴 Critical

Update header year to 2026 to unblock CI.

Line [2] still says 2024-2025, but CI expects 2024-2026, and this is currently failing the pipeline.

🔧 Proposed fix

-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_docker_image_cache_manager.py` at line 2, Update the file header
string that currently reads "Copyright (c) 2024-2025 NVIDIA CORPORATION &
AFFILIATES. All rights reserved." to "Copyright (c) 2024-2026 NVIDIA CORPORATION
& AFFILIATES. All rights reserved." so the year range matches CI expectations;
locate the header line in tests/test_docker_image_cache_manager.py (the
copyright comment at the top) and change the year range accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/test_docker_image_cache_manager.py`:
- Line 2: Update the file header string that currently reads "Copyright (c)
2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved." to "Copyright
(c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved." so the year
range matches CI expectations; locate the header line in
tests/test_docker_image_cache_manager.py (the copyright comment at the top) and
change the year range accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8cca6567-7836-493c-a3c7-18c7de1627b1

📥 Commits

Reviewing files that changed from the base of the PR and between 0fb93cd and e75edc7.

📒 Files selected for processing (1)

tests/test_docker_image_cache_manager.py

src/cloudai/util/command_shell.py

greptile-apps · 2026-03-20T15:21:14Z

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

+        elif self.test_run.exclude_nodes:
+            additional_slurm_params.append(f"exclude={','.join(self.test_run.exclude_nodes)}")


Raw Slurm range syntax in --additional_slurm_params exclude field

self.test_run.exclude_nodes is passed verbatim to the Megatron-Bridge launcher's --additional_slurm_params as exclude=node-[101-104],.... Whether the launcher correctly forwards Slurm range expressions to the inner sbatch call depends on its parsing. In the standard SBATCH --exclude directive (written to the script header on line 229), Slurm handles range expansion natively. But for the --additional_slurm_params path the launcher may or may not perform the same expansion.

It would be safer to expand the ranges before building the exclude parameter here, similar to what parse_nodes does via parse_node_list, or to add a test that validates range syntax is preserved end-to-end through the launcher path.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/_core/installables.py (1)

92-107: ⚠️ Potential issue | 🟠 Major

Include init_submodules in the installable identity.

init_submodules changes the checkout contents, but GitRepo.__eq__ / __hash__ still key only on url and commit. If installables are deduped, GitRepo(..., init_submodules=True) can collapse into a plain clone and skip the recursive submodule init step. Keep any git_repo-derived identity checks in sync as well.

Suggested change

 class GitRepo(Installable, BaseModel):
@@
     def __eq__(self, other: object) -> bool:
         """Check if two installable objects are equal."""
-        return isinstance(other, GitRepo) and other.url == self.url and other.commit == self.commit
+        return (
+            isinstance(other, GitRepo)
+            and other.url == self.url
+            and other.commit == self.commit
+            and other.init_submodules == self.init_submodules
+        )
 
     def __hash__(self) -> int:
         """Hash the installable object."""
-        return hash((self.url, self.commit))
+        return hash((self.url, self.commit, self.init_submodules))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/_core/installables.py` around lines 92 - 107, GitRepo's identity
currently ignores init_submodules causing different checkouts to be treated
equal; update GitRepo.__eq__ and GitRepo.__hash__ to include the init_submodules
field (i.e., compare self.init_submodules in __eq__ and include it in the tuple
passed to hash), and mirror the same change in any git_repo-derived identity
methods so deduping respects recursive submodule initialization.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/slurm/slurm_command_gen_strategy.py`:
- Around line 427-428: The generated SBATCH use of exclude_nodes isn't
persisted: add an exclude_nodes attribute to the TestRunDetails/data model and
include self.test_run.exclude_nodes when building the stored run spec in
store_test_run() in slurm_command_gen_strategy.py so the saved metadata can
reconstruct allocation requests; update any serialization/deserialization logic
that creates or consumes TestRunDetails (and any callers of store_test_run()) to
read/write this new field and ensure tests or storage schema migrations are
adjusted accordingly.

In `@src/cloudai/systems/slurm/slurm_installer.py`:
- Around line 212-226: The status checks for Git repos currently only verify
existence and commit but not submodule initialization, so update
_is_git_repo_installed() and the GitRepo branch of is_installed_one() to also
validate submodules when item.init_submodules is true: call the same
submodule-check logic used in _install_one_git_repo()/_clone_and_setup_repo()
(or factor that check into a helper) to ensure submodules are initialized and at
the correct state before returning success; if the submodule check fails, return
a failing InstallStatusResult with an explanatory message.

In `@src/cloudai/systems/slurm/slurm_system.py`:
- Around line 758-771: The code collapses the “too few nodes returned” case into
the “empty node list” error; update the logic after parse_nodes so that if
parsed_nodes is non-empty but shorter than the originally requested count you
raise a specific shortfall error instead of the empty-list message. Determine
the requested node count from the incoming nodes specification (using the same
logic you use elsewhere—e.g., the routine that resolves group requests or
get_available_nodes_from_group) and compare it to len(parsed_nodes); if
len(parsed_nodes) < requested_count raise a ValueError indicating the requested
count and the number of nodes actually available (referencing parsed_nodes,
nodes and exclude_nodes), otherwise keep the existing empty-list error path when
parsed_nodes is truly empty.

In `@src/cloudai/workloads/megatron_bridge/megatron_bridge.py`:
- Around line 186-196: The helper _select_megatron_bridge_repo currently returns
the repo unchanged which makes an absent mount_as become a no-op; instead when a
Megatron-Bridge repo is found (match on "Megatron-Bridge" in repo.url or
explicit mount_as path) ensure it will be mounted into the container so the
pinned commit controls runtime by returning a repo object that has mount_as set
to "/opt/Megatron-Bridge" if repo.mount_as is falsy; to avoid mutating callers,
return a shallow copy (or new GitRepo instance) with
mount_as="/opt/Megatron-Bridge" when needed (leave any existing mount_as alone)
so configs that omit mount_as still override the image bundle.

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 107-124: The file has two issues: _container_runtime_env_exports
currently shlex.quotes values (turning shell-expansions like
NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES into literals) and
_parse_srun_args_as_slurm_params silently drops unsupported tokens (short flags
like -x or non-flag tokens). Fix _container_runtime_env_exports to preserve
shell expansion for keys in CONTAINER_RUNTIME_ENV_VARS by not quoting values
that look like shell expressions (e.g., start with $ or contain parameter
expansion) or by validating and raising an error if a value contains shell
metacharacters—update the logic in _container_runtime_env_exports and reference
CONTAINER_RUNTIME_ENV_VARS and final_env_vars. Fix
_parse_srun_args_as_slurm_params to stop silently discarding tokens from
extra_srun_args: enhance its parsing to accept short flags (e.g., -x),
positional/non-flag tokens, and return them unchanged if they don’t match
--key[=value]/--key value/--flag patterns, or at minimum emit a clear
warning/error when tokens would be dropped so the caller can handle them—update
the parsing code in _parse_srun_args_as_slurm_params and how extra_srun_args is
consumed.

In `@tests/systems/slurm/test_command_gen_strategy.py`:
- Around line 382-383: Replace the weak absence check that uses `assert "#SBATCH
--exclude=" not in content` with a prefix-based check that ensures no line in
the generated `content` starts with the `#SBATCH --exclude` directive; locate
the assertion in the test (in tests/systems/slurm/test_command_gen_strategy.py
where the `content` variable is inspected) and change it to assert that none of
the `content.splitlines()` entries start with the string `#SBATCH --exclude`.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 355-357: The test function
test_mount_as_adds_repo_to_container_mounts currently declares an unused
tmp_path fixture parameter; remove tmp_path from the function signature so the
test becomes def test_mount_as_adds_repo_to_container_mounts(self,
configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]) ->
None to eliminate the unused fixture and silence the ARG002 warning. Ensure no
other code in that test references tmp_path before committing.

---

Outside diff comments:
In `@src/cloudai/_core/installables.py`:
- Around line 92-107: GitRepo's identity currently ignores init_submodules
causing different checkouts to be treated equal; update GitRepo.__eq__ and
GitRepo.__hash__ to include the init_submodules field (i.e., compare
self.init_submodules in __eq__ and include it in the tuple passed to hash), and
mirror the same change in any git_repo-derived identity methods so deduping
respects recursive submodule initialization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 24836278-8892-45f8-824f-77b1547f8686

📥 Commits

Reviewing files that changed from the base of the PR and between 1567b71 and 9ab843a.

📒 Files selected for processing (24)

conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test_scenario/megatron_bridge_r0.3.0_qwen_30b.toml
src/cloudai/_core/installables.py
src/cloudai/_core/test_scenario.py
src/cloudai/models/scenario.py
src/cloudai/systems/kubernetes/kubernetes_installer.py
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
src/cloudai/systems/slurm/slurm_installer.py
src/cloudai/systems/slurm/slurm_system.py
src/cloudai/test_scenario_parser.py
src/cloudai/workloads/megatron_bridge/megatron_bridge.py
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
tests/conftest.py
tests/ref_data/megatron-bridge.sbatch
tests/systems/slurm/test_allocation.py
tests/systems/slurm/test_command_gen_strategy.py
tests/systems/slurm/test_system.py
tests/test_docker_image_cache_manager.py
tests/test_git_repo_installer.py
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

💤 Files with no reviewable changes (3)

conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml

coderabbitai · 2026-03-20T15:31:08Z

src/cloudai/systems/slurm/slurm_command_gen_strategy.py

+        if self.test_run.exclude_nodes:
+            content.append(f"#SBATCH --exclude={','.join(self.test_run.exclude_nodes)}")


⚠️ Potential issue | 🟡 Minor

Persist exclude_nodes in the stored run spec.

#SBATCH --exclude= only exists in the generated batch script. store_test_run() still dumps the raw srun command, and TestRunDetails has no exclude_nodes field, so the saved run metadata can no longer reconstruct the actual allocation request.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/slurm/slurm_command_gen_strategy.py` around lines 427 - 428, The generated SBATCH use of exclude_nodes isn't persisted: add an exclude_nodes attribute to the TestRunDetails/data model and include self.test_run.exclude_nodes when building the stored run spec in store_test_run() in slurm_command_gen_strategy.py so the saved metadata can reconstruct allocation requests; update any serialization/deserialization logic that creates or consumes TestRunDetails (and any callers of store_test_run()) to read/write this new field and ensure tests or storage schema migrations are adjusted accordingly.

coderabbitai · 2026-03-20T15:31:08Z

src/cloudai/systems/slurm/slurm_installer.py

+            if item.init_submodules:
+                res = self._init_submodules(repo_path)
+                if not res.success:
+                    return res
            item.installed_path = repo_path
            msg = f"Git repository already exists at {repo_path}."
            logging.debug(msg)
            return InstallStatusResult(True, msg)

+        res = self._clone_and_setup_repo(item, repo_path)
+        if not res.success:
+            return res
+
+        item.installed_path = repo_path
+        return InstallStatusResult(True)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Base installer orchestration ==" BASE_INSTALLER_FILE="$(fd 'base_installer\.py$' src | head -n 1 || true)" if [ -n "${BASE_INSTALLER_FILE}" ]; then sed -n '1,260p' "${BASE_INSTALLER_FILE}" fi echo echo "== Git repo installed checks ==" rg -n -C3 'def (_is_git_repo_installed|is_installed_one)|is_installed_one\(' \ src/cloudai/systems/slurm/slurm_installer.py \ src/cloudai/systems/kubernetes/kubernetes_installer.py

Repository: NVIDIA/cloudai

Length of output: 13058

🏁 Script executed:

# Get full implementation of is_installed_one and _is_git_repo_installed in slurm_installer.py echo "=== slurm_installer.py: is_installed_one() ===" sed -n '139,170p' src/cloudai/systems/slurm/slurm_installer.py echo echo "=== slurm_installer.py: _is_git_repo_installed() ===" sed -n '446,485p' src/cloudai/systems/slurm/slurm_installer.py echo echo "=== kubernetes_installer.py: is_installed_one() ===" sed -n '110,145p' src/cloudai/systems/kubernetes/kubernetes_installer.py echo echo "=== Check if init_submodules is validated in status checks ===" rg -n 'init_submodules' src/cloudai/systems/slurm/slurm_installer.py src/cloudai/systems/kubernetes/kubernetes_installer.py

Repository: NVIDIA/cloudai

Length of output: 6552

🏁 Script executed:

# Verify if _verify_commit checks submodules state echo "=== _verify_commit implementation ===" rg -n -A10 'def _verify_commit' src/cloudai/systems/slurm/slurm_installer.py | head -20

Repository: NVIDIA/cloudai

Length of output: 700

Add init_submodules validation to status check methods.

The _is_git_repo_installed() method (and is_installed_one() in Kubernetes) validates only repository existence and commit, but does not check if submodules are initialized when item.init_submodules is true. While the current orchestration doesn't pre-check status before installing, the incomplete status API creates a correctness issue: external calls to is_installed() or future optimizations that check status first will incorrectly report a repository as installed if it exists at the correct commit but lacks initialized submodules.

Update _is_git_repo_installed() and the Kubernetes is_installed_one() GitRepo branch to verify submodules state when applicable, similar to the checks already present in _install_one_git_repo() and _clone_and_setup_repo().

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 218-218: debug() call on root logger

(LOG015)

[warning] 219-219: Boolean positional value in function call

(FBT003)

[warning] 226-226: Boolean positional value in function call

(FBT003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/slurm/slurm_installer.py` around lines 212 - 226, The status checks for Git repos currently only verify existence and commit but not submodule initialization, so update _is_git_repo_installed() and the GitRepo branch of is_installed_one() to also validate submodules when item.init_submodules is true: call the same submodule-check logic used in _install_one_git_repo()/_clone_and_setup_repo() (or factor that check into a helper) to ensure submodules are initialized and at the correct state before returning success; if the submodule check fails, return a failing InstallStatusResult with an explanatory message.

coderabbitai · 2026-03-20T15:31:08Z

src/cloudai/systems/slurm/slurm_system.py

+        parsed_nodes = self.parse_nodes(nodes, exclude_nodes=exclude_nodes)
        if parsed_nodes:
            num_nodes = len(parsed_nodes)
            node_list = parsed_nodes
+        elif nodes:
+            reason = (
+                f"after excluding nodes {exclude_nodes}"
+                if exclude_nodes
+                else "— no nodes are available (all may be DRAIN/DOWN)"
+            )
+            raise ValueError(
+                f"Node specifications {nodes} resolved to an empty node list {reason}. "
+                "Cannot fall back to unconstrained allocation."
+            )


⚠️ Potential issue | 🟡 Minor

Don't collapse “too few nodes” into “empty node list.”

Lines 762-771 treat any empty parsed_nodes result as if nothing matched, but get_available_nodes_from_group() also returns [] when only part of the request can be satisfied. A request like 4 nodes from a 3-node remainder now raises “resolved to an empty node list” instead of reporting the actual shortfall.

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 768-771: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/slurm/slurm_system.py` around lines 758 - 771, The code collapses the “too few nodes returned” case into the “empty node list” error; update the logic after parse_nodes so that if parsed_nodes is non-empty but shorter than the originally requested count you raise a specific shortfall error instead of the empty-list message. Determine the requested node count from the incoming nodes specification (using the same logic you use elsewhere—e.g., the routine that resolves group requests or get_available_nodes_from_group) and compare it to len(parsed_nodes); if len(parsed_nodes) < requested_count raise a ValueError indicating the requested count and the number of nodes actually available (referencing parsed_nodes, nodes and exclude_nodes), otherwise keep the existing empty-list error path when parsed_nodes is truly empty.

coderabbitai · 2026-03-20T15:31:08Z

src/cloudai/workloads/megatron_bridge/megatron_bridge.py

    def _select_megatron_bridge_repo(git_repos: list[GitRepo]) -> GitRepo | None:
-        """Return the Megatron-Bridge repo from `git_repos` (normalized to mount_as=/opt/Megatron-Bridge)."""
+        """
+        Return the Megatron-Bridge repo from `git_repos`.
+
+        When the user sets ``mount_as`` (e.g. ``/opt/Megatron-Bridge``), the installed clone will be bind-mounted
+        into the container at that path, overriding whatever the container image ships. When ``mount_as`` is *not*
+        set the container's built-in ``/opt/Megatron-Bridge`` is used.
+        """
        for repo in git_repos:
            if "Megatron-Bridge" in repo.url or (repo.mount_as or "").rstrip("/") == "/opt/Megatron-Bridge":
-                return repo if repo.mount_as else repo.model_copy(update={"mount_as": "/opt/Megatron-Bridge"})
+                return repo


⚠️ Potential issue | 🟠 Major

Don't make the repo pin a no-op.

Returning the repo unchanged here means configs that omit mount_as no longer override /opt/Megatron-Bridge inside the container. The job now uses the image-bundled checkout, while [[git_repos]].commit is still required and cloned locally, so the declared pin no longer controls the runtime version. Either keep the implicit mount or stop requiring/installing the repo when mount_as is unset.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/megatron_bridge/megatron_bridge.py` around lines 186 - 196, The helper _select_megatron_bridge_repo currently returns the repo unchanged which makes an absent mount_as become a no-op; instead when a Megatron-Bridge repo is found (match on "Megatron-Bridge" in repo.url or explicit mount_as path) ensure it will be mounted into the container so the pinned commit controls runtime by returning a repo object that has mount_as set to "/opt/Megatron-Bridge" if repo.mount_as is falsy; to avoid mutating callers, return a shallow copy (or new GitRepo instance) with mount_as="/opt/Megatron-Bridge" when needed (leave any existing mount_as alone) so configs that omit mount_as still override the image bundle.

coderabbitai · 2026-03-20T15:31:08Z

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

+    def _container_runtime_env_exports(self) -> list[str]:
+        """
+        Build ``export`` lines for container-runtime env vars.
+
+        Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES``
+        are consumed by the NVIDIA container toolkit / enroot at container-creation
+        time to decide which devices to mount.  They must be present in the process
+        environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that
+        Slurm inherits them into the job and ``srun`` passes them to the container
+        runtime.  Exporting them in the wrapper script (which runs on the submit
+        node) achieves this.  The same variables are still passed via ``-cb`` as
+        well, so they are also set inside the container for any runtime readers.
+        """
+        lines: list[str] = []
+        for key, value in sorted(self.final_env_vars.items()):
+            if key in self.CONTAINER_RUNTIME_ENV_VARS:
+                lines.append(f"export {key}={shlex.quote(str(value))}")
+        return lines


⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import shlex import subprocess value = "$CUDA_VISIBLE_DEVICES" rendered = f"export NVIDIA_VISIBLE_DEVICES={shlex.quote(value)}; printf '%s\\n' \"$NVIDIA_VISIBLE_DEVICES\"" print("Generated wrapper line:", rendered) literal = subprocess.run( ["bash", "-lc", f"CUDA_VISIBLE_DEVICES=0,1; {rendered}"], check=True, capture_output=True, text=True, ) expanded = subprocess.run( [ "bash", "-lc", 'CUDA_VISIBLE_DEVICES=0,1; export NVIDIA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES"; printf "%s\\n" "$NVIDIA_VISIBLE_DEVICES"', ], check=True, capture_output=True, text=True, ) print("Current result :", literal.stdout.strip()) print("Expanded result:", expanded.stdout.strip()) PY

Repository: NVIDIA/cloudai

Repository: NVIDIA/cloudai
Exit code: 0

stdout:

Generated wrapper line: export NVIDIA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES'; printf '%s\n' "$NVIDIA_VISIBLE_DEVICES" Current result : $CUDA_VISIBLE_DEVICES Expanded result: 0,1

Script executed:

#!/bin/bash # First, find and read the exact file to verify the implementation if [ -f "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" ]; then echo "=== Lines 107-124 (container_runtime_env_exports) ===" sed -n '107,124p' "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" echo -e "\n=== Checking for _build_custom_bash_env_exports context ===" # Find where _build_custom_bash_env_exports is defined and show relevant lines rg -n "_build_custom_bash_env_exports" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -A 20 | head -40 echo -e "\n=== Checking for _parse_srun_args_as_slurm_params ===" rg -n "_parse_srun_args_as_slurm_params" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -A 15 | head -40 echo -e "\n=== Checking for CONTAINER_RUNTIME_ENV_VARS definition ===" rg -n "CONTAINER_RUNTIME_ENV_VARS" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -B 2 -A 5 fi

Repository: NVIDIA/cloudai

Repository: NVIDIA/cloudai
Exit code: 0

stdout:

=== Lines 107-124 (container_runtime_env_exports) === def _container_runtime_env_exports(self) -> list[str]: """ Build ``export`` lines for container-runtime env vars. Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES`` are consumed by the NVIDIA container toolkit / enroot at container-creation time to decide which devices to mount. They must be present in the process environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that Slurm inherits them into the job and ``srun`` passes them to the container runtime. Exporting them in the wrapper script (which runs on the submit node) achieves this. The same variables are still passed via ``-cb`` as well, so they are also set inside the container for any runtime readers. """ lines: list[str] = [] for key, value in sorted(self.final_env_vars.items()): if key in self.CONTAINER_RUNTIME_ENV_VARS: lines.append(f"export {key}={shlex.quote(str(value))}") return lines === Checking for _build_custom_bash_env_exports context === 95: def _build_custom_bash_env_exports(self) -> list[str]: 96- """ 97- Build repeated -cb entries that export env vars inside the launched Slurm job shell. 98- 99- We quote each full `export KEY=value` command so `$SLURM_*` and commas survive 100- argument parsing on the submit node and are expanded/interpreted in the job shell. 101- """ 102- exports: list[str] = [] 103- for key, value in sorted(self.final_env_vars.items()): 104- exports.extend(["-cb", shlex.quote(f"export {key}={value}")]) 105- return exports 106- 107- def _container_runtime_env_exports(self) -> list[str]: 108- """ 109- Build ``export`` lines for container-runtime env vars. 110- 111- Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES`` 112- are consumed by the NVIDIA container toolkit / enroot at container-creation 113- time to decide which devices to mount. They must be present in the process 114- environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that 115- Slurm inherits them into the job and ``srun`` passes them to the container -- 353: parts.extend(self._build_custom_bash_env_exports()) 354- 355- # Model flags (Megatron-Bridge main-branch API) 356- add_field("domain", "--domain", args.domain) 357- if args.use_recipes and "use_recipes" in fields_set: 358- parts.append("--use_recipes") 359- if "enable_vboost" in fields_set: 360- add_field("enable_vboost", "-vb", bool(args.enable_vboost)) 361- if not args.model_family_name: 362- raise RuntimeError("Missing required cmd_args.model_family_name (maps to -m/--model_family_name).") 363- if not args.model_recipe_name: 364- raise RuntimeError("Missing required cmd_args.model_recipe_name (maps to -mr/--model_recipe_name).") 365- add_field("model_family_name", "-m", args.model_family_name) 366- add_field("model_recipe_name", "-mr", args.model_recipe_name) 367- add_field("hidden_size", "--hidden_size", args.hidden_size) 368- add_field("num_layers", "--num_layers", args.num_layers) 369- add_field( 370- "pipeline_model_parallel_layout", "--pipeline_model_parallel_layout", args.pipeline_model_parallel_layout === Checking for _parse_srun_args_as_slurm_params === 139: def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]: 140- """ 141- Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params. 142- 143- Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare 144- key names without a ``=value`` suffix. 145- """ 146- params: list[str] = [] 147- tokens = shlex.split(srun_args) 148- i = 0 149- while i < len(tokens): 150- tok = tokens[i] 151- if tok.startswith("--") and "=" in tok: 152- key, val = tok[2:].split("=", 1) 153- params.append(f"{key}={val}") 154- elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"): -- 471: additional_slurm_params.extend(self._parse_srun_args_as_slurm_params(source)) 472- 473- if additional_slurm_params: 474- parts.extend(["--additional_slurm_params", shlex.quote(";".join(additional_slurm_params))]) 475- 476- # Config variant 477- add_field("config_variant", "-cv", args.config_variant) 478- if args.list_config_variants and "list_config_variants" in fields_set: 479- parts.append("--list_config_variants") 480- 481- # Extra user args (dict -> string) 482- if tdef.extra_cmd_args: 483- parts.append(tdef.extra_args_str) 484- 485- return parts === Checking for CONTAINER_RUNTIME_ENV_VARS definition === 38- """ 39- 40: CONTAINER_RUNTIME_ENV_VARS: frozenset[str] = frozenset( 41- { 42- "MELLANOX_VISIBLE_DEVICES", 43- "NVIDIA_VISIBLE_DEVICES", 44- "NVIDIA_DRIVER_CAPABILITIES", 45- } -- 120- lines: list[str] = [] 121- for key, value in sorted(self.final_env_vars.items()): 122: if key in self.CONTAINER_RUNTIME_ENV_VARS: 123- lines.append(f"export {key}={shlex.quote(str(value))}") 124- return lines 125- 126- def _normalize_recompute_modules(self, val: Any) -> str: 127- if isinstance(val, list):

Script executed:

#!/bin/bash # Check the file size and structure echo "=== File structure ===" wc -l "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" # Get a broader view of the class to understand context echo -e "\n=== Class structure and method definitions ===" rg -n "^\s*(def |class )" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"

Repository: NVIDIA/cloudai

Repository: NVIDIA/cloudai
Exit code: 0

stdout:

=== File structure === 485 src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py === Class structure and method definitions === 33:class MegatronBridgeSlurmCommandGenStrategy(SlurmCommandGenStrategy): 48: def _container_mounts(self) -> list[str]: 52: def gen_exec_command(self) -> str: 83: def store_test_run(self) -> None: 89: def _write_command_to_file(self, command: str, output_path: Path) -> None: 95: def _build_custom_bash_env_exports(self) -> list[str]: 107: def _container_runtime_env_exports(self) -> list[str]: 126: def _normalize_recompute_modules(self, val: Any) -> str: 139: def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]: 162: def _normalize_cuda_graph_scope_arg(self, val: Any) -> str: 169: def _wrap_launcher_for_job_id_and_quiet_output(self, launcher_cmd: str, launcher_python: str) -> str: 240: def _list_or_comma_str(self, val: str | list[str] | None) -> Optional[str]: 249: def _build_launcher_parts( # noqa: C901 265: def _installed_container_path() -> str: 298: def add(flag: str, value: Any) -> None: 317: def add_field(field: str, flag: str, value: Any) -> None:

Script executed:

#!/bin/bash # Get the full _parse_srun_args_as_slurm_params method sed -n '139,160p' "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"

Repository: NVIDIA/cloudai

Repository: NVIDIA/cloudai
Exit code: 0

stdout:

def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]: """ Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params. Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare key names without a ``=value`` suffix. """ params: list[str] = [] tokens = shlex.split(srun_args) i = 0 while i < len(tokens): tok = tokens[i] if tok.startswith("--") and "=" in tok: key, val = tok[2:].split("=", 1) params.append(f"{key}={val}") elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"): params.append(f"{tok[2:]}={tokens[i + 1]}") i += 1 elif tok.startswith("--"): params.append(tok[2:]) i += 1 return params

Don't turn container-runtime env references into literals.

_build_custom_bash_env_exports() keeps env values shell-interpretable by quoting the entire export statement, but _container_runtime_env_exports() wraps the value itself with shlex.quote(). A config like NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES will be exported as a literal dollar string before sbatch, and the later -cb export cannot fix that because the container runtime consumes it at container-creation time. Preserve shell expansion here, or explicitly reject shell expressions for these keys.

Additionally, _parse_srun_args_as_slurm_params() silently drops any extra_srun_args tokens that don't match its three supported patterns (--key=value, --key value, --flag). Short-form flags like -x and non-flag tokens disappear without warning, risking loss of scheduler-affecting options.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around lines 107 - 124, The file has two issues: _container_runtime_env_exports currently shlex.quotes values (turning shell-expansions like NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES into literals) and _parse_srun_args_as_slurm_params silently drops unsupported tokens (short flags like -x or non-flag tokens). Fix _container_runtime_env_exports to preserve shell expansion for keys in CONTAINER_RUNTIME_ENV_VARS by not quoting values that look like shell expressions (e.g., start with $ or contain parameter expansion) or by validating and raising an error if a value contains shell metacharacters—update the logic in _container_runtime_env_exports and reference CONTAINER_RUNTIME_ENV_VARS and final_env_vars. Fix _parse_srun_args_as_slurm_params to stop silently discarding tokens from extra_srun_args: enhance its parsing to accept short flags (e.g., -x), positional/non-flag tokens, and return them unchanged if they don’t match --key[=value]/--key value/--flag patterns, or at minimum emit a clear warning/error when tokens would be dropped so the caller can handle them—update the parsing code in _parse_srun_args_as_slurm_params and how extra_srun_args is consumed.

coderabbitai · 2026-03-20T15:31:08Z

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

+    @staticmethod
+    def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
+        """
+        Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.
+
+        Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
+        key names without a ``=value`` suffix.
+        """
+        params: list[str] = []
+        tokens = shlex.split(srun_args)
+        i = 0
+        while i < len(tokens):
+            tok = tokens[i]
+            if tok.startswith("--") and "=" in tok:
+                key, val = tok[2:].split("=", 1)
+                params.append(f"{key}={val}")
+            elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
+                params.append(f"{tok[2:]}={tokens[i + 1]}")
+                i += 1
+            elif tok.startswith("--"):
+                params.append(tok[2:])
+            i += 1
+        return params


⚠️ Potential issue | 🟠 Major

Reject unsupported extra_srun_args forms instead of dropping them.

This parser only recognizes long-form --... tokens. Valid Slurm inputs like -A acct or -w node01 — or even a typo missing the second dash — fall through the loop and disappear from --additional_slurm_params with no signal. That can change scheduler placement silently. Either parse short options explicitly or raise on token shapes you do not support.

🛡️ Minimal fail-fast fix

while i < len(tokens): tok = tokens[i] if tok.startswith("--") and "=" in tok: key, val = tok[2:].split("=", 1) params.append(f"{key}={val}") elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"): params.append(f"{tok[2:]}={tokens[i + 1]}") i += 1 elif tok.startswith("--"): params.append(tok[2:]) + else: + raise ValueError( + f"Unsupported extra_srun_args token {tok!r}; " + "use --key=value, --key value, or bare --flag forms." + ) i += 1 return params

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@staticmethod

def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:

"""

Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.

Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare

key names without a ``=value`` suffix.

"""

params: list[str] = []

tokens = shlex.split(srun_args)

i = 0

while i < len(tokens):

tok = tokens[i]

if tok.startswith("--") and "=" in tok:

key, val = tok[2:].split("=", 1)

params.append(f"{key}={val}")

elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):

params.append(f"{tok[2:]}={tokens[i + 1]}")

i += 1

elif tok.startswith("--"):

params.append(tok[2:])

i += 1

return params

`@staticmethod`

def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:

"""

Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.

Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare

key names without a ``=value`` suffix.

"""

params: list[str] = []

tokens = shlex.split(srun_args)

i = 0

while i < len(tokens):

tok = tokens[i]

if tok.startswith("--") and "=" in tok:

key, val = tok[2:].split("=", 1)

params.append(f"{key}={val}")

elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):

params.append(f"{tok[2:]}={tokens[i + 1]}")

i += 1

elif tok.startswith("--"):

params.append(tok[2:])

else:

raise ValueError(

f"Unsupported extra_srun_args token {tok!r}; "

"use --key=value, --key value, or bare --flag forms."

)

i += 1

return params

coderabbitai · 2026-03-20T15:31:08Z

tests/systems/slurm/test_command_gen_strategy.py

+    assert "#SBATCH --nodelist=node3,node4" in content
+    assert "#SBATCH --exclude=" not in content


⚠️ Potential issue | 🟡 Minor

Use a prefix check for the missing --exclude assertion.

Line 383 is testing exact list membership, not absence of an exclude directive. If content contains #SBATCH --exclude=node01,node02, this assertion still passes.

🔧 Proposed fix

assert "#SBATCH --nodelist=node3,node4" in content - assert "#SBATCH --exclude=" not in content + assert not any(line.startswith("#SBATCH --exclude=") for line in content)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert "#SBATCH --nodelist=node3,node4" in content

assert "#SBATCH --exclude=" not in content

assert "#SBATCH --nodelist=node3,node4" in content

assert not any(line.startswith("#SBATCH --exclude=") for line in content)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/systems/slurm/test_command_gen_strategy.py` around lines 382 - 383, Replace the weak absence check that uses `assert "#SBATCH --exclude=" not in content` with a prefix-based check that ensures no line in the generated `content` starts with the `#SBATCH --exclude` directive; locate the assertion in the test (in tests/systems/slurm/test_command_gen_strategy.py where the `content` variable is inspected) and change it to assert that none of the `content.splitlines()` entries start with the string `#SBATCH --exclude`.

coderabbitai · 2026-03-20T15:31:08Z

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

+    def test_mount_as_adds_repo_to_container_mounts(
+        self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
+    ) -> None:


🧹 Nitpick | 🔵 Trivial

Remove the unused tmp_path fixture from this test signature.

tmp_path is never read here, so the fixture just adds setup noise and keeps Ruff's ARG002 warning alive.

♻️ Proposed cleanup

- def test_mount_as_adds_repo_to_container_mounts( - self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path - ) -> None: + def test_mount_as_adds_repo_to_container_mounts( + self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun] + ) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_mount_as_adds_repo_to_container_mounts(

self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path

) -> None:

def test_mount_as_adds_repo_to_container_mounts(

self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]

) -> None:

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 356-356: Unused method argument: tmp_path

(ARG002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around lines 355 - 357, The test function test_mount_as_adds_repo_to_container_mounts currently declares an unused tmp_path fixture parameter; remove tmp_path from the function signature so the test becomes def test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]) -> None to eliminate the unused fixture and silence the ARG002 warning. Ensure no other code in that test references tmp_path before committing.

Do not run CommandShell check during object creation

216bf6a

amaslenn requested review from jeffnvidia and srivatsankrishnan as code owners March 20, 2026 08:04

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

src/cloudai/util/command_shell.py Show resolved Hide resolved

Update tests

0fb93cd

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

tests/test_docker_image_cache_manager.py Outdated Show resolved Hide resolved

amaslenn added 2 commits March 20, 2026 09:13

Add noqa: Vulture

e75edc7

Update copyright years

1567b71

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

src/cloudai/util/command_shell.py Show resolved Hide resolved

amaslenn requested a review from podkidyshev March 20, 2026 08:25

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

src/cloudai/util/command_shell.py Show resolved Hide resolved

juntaowww added 16 commits March 20, 2026 16:13

Make node placement correctly propagate M-Bridge

65e7761

Allow excluding nodes from group of nodes

4fbd785

Also init submodules when install git repos

7b90e2d

Add options to mount the repo in the container

e8e453c

Fix slurm gpu resource requesting

33dd1aa

Update gb200 M-Bridge r0.3.0 qwen recipe

3f61208

ruff check & ruff format

0bbebb5

Fix tests to reflect updates

638a0c1

Update copyright year

1d9aed4

Add back the defensive filter

fb24acd

Enhance nodes allocation for edge cases

d549240

Enhance tests by avoiding default names

5493528

Make exclude nodes correctly propagate to M-Bridge

e298085

Improve error messages for excluding nodes

b30a194

Improve exclude nodes APIs

17b5d1f

Change additional_slurm_params separator to semi-colon

d09d889

juntaowww and others added 18 commits March 20, 2026 16:13

Make init_submodules optional

a7e3dab

Update configurations

6559932

Make extra_srun_args correctly propagate to M-Bridge

44904a0

Allow submission of jobs to resv nodes

f3488ae

Make the no-mount assertion less brittle

d2aa80b

Fix and improve the extra_srun_args propagation

84f4f56

ruff format & copyright year

df4ce24

Fix init_submodules is silently bypassed on pre-existing clones

72676cd

Use shlex.split() instead of str.split()

99e6c23

Add missing regression: existing repo + init_submodules=True

c42033e

Export container-runtime env vars before Megatron-Bridge launcher

7b4fdb6

Add supports_gpu_directives check

ed1cd2f

Fix tests

abeb9b1

Restructure configurations

fd50fce

Avoid real system calls

b17a2f5

address linting issues

f34a541

Revert unneeded changes

9f95bf8

Update copyright years

9ab843a

amaslenn force-pushed the am/no-runtime-shell-check branch from 75f6276 to 9ab843a Compare March 20, 2026 15:14

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

amaslenn closed this Mar 20, 2026

amaslenn deleted the am/no-runtime-shell-check branch March 20, 2026 15:49

amaslenn mentioned this pull request Mar 20, 2026

Do not run CommandShell check during object creation #843

Merged

		elif self.test_run.exclude_nodes:
		additional_slurm_params.append(f"exclude={','.join(self.test_run.exclude_nodes)}")

		if self.test_run.exclude_nodes:
		content.append(f"#SBATCH --exclude={','.join(self.test_run.exclude_nodes)}")

		assert "#SBATCH --nodelist=node3,node4" in content
		assert "#SBATCH --exclude=" not in content

Conversation

amaslenn commented Mar 20, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

greptile-apps bot commented Mar 20, 2026 •

edited

Loading