Skip to content

Do not run CommandShell check during object creation#841

Closed
amaslenn wants to merge 38 commits intomainfrom
am/no-runtime-shell-check
Closed

Do not run CommandShell check during object creation#841
amaslenn wants to merge 38 commits intomainfrom
am/no-runtime-shell-check

Conversation

@amaslenn
Copy link
Contributor

Summary

Do not run CommandShell check during object creation. That addresses issue with building documentation on Windows.

Test Plan

  1. CI
  2. Docs build on Windows.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added per-test exclude_nodes support across Slurm path (models, parsing, allocation, command generation), switched SlurmSystem.cmd_shell to use a default_factory, added git submodule init support in installers, adjusted Megatron-Bridge launcher/mount and container env handling, and updated tests and config TOMLs accordingly.

Changes

Cohort / File(s) Summary
Slurm core & allocation
src/cloudai/systems/slurm/slurm_system.py, src/cloudai/systems/slurm/slurm_command_gen_strategy.py, src/cloudai/systems/slurm/slurm_installer.py
Added exclude_nodes plumbing: parsing, grouping, allocation, and error behavior; added SlurmNodeState.RESERVED grouping; switched cmd_shell field to default_factory=CommandShell; updated command-gen to emit --exclude and include exclude_nodes in cache keys; refactored installer to support submodule init.
Models / TestRun / Parser
src/cloudai/models/scenario.py, src/cloudai/_core/test_scenario.py, src/cloudai/test_scenario_parser.py
Added exclude_nodes field to TestRun/TestRunModel and threading through parser to preserve per-test exclusions.
Git repo installables & installers
src/cloudai/_core/installables.py, src/cloudai/systems/kubernetes/kubernetes_installer.py, src/cloudai/systems/slurm/slurm_installer.py
Added init_submodules: bool to GitRepo; refactored clone/setup flow and introduced _init_submodules in installers; ensure repo path is set only after successful setup and remove partial clones on failure.
Megatron-Bridge launcher & mounts
src/cloudai/workloads/megatron_bridge/megatron_bridge.py, src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py, tests/ref_data/megatron-bridge.sbatch
Stopped implicit mount_as normalization; added container-runtime env var exports into wrapper script; parse extra srun args into Slurm params; forward gpus_per_node/gres and nodelist/exclude into --additional_slurm_params; adjust mounting logic and tests/ref data to match.
CommandShell & util
src/cloudai/util/command_shell.py
Condensed docstrings and moved the executable FileNotFoundError check to occur immediately in execute(); no public API changes.
Tests — Slurm, installer, command-gen
tests/systems/slurm/..., tests/test_git_repo_installer.py, tests/conftest.py, tests/test_docker_image_cache_manager.py, tests/...
Added/updated tests for exclude_nodes behavior, node allocation, command generation, caching, submodule init success/failure, and fixture tweaks (supports_gpu_directives_cache toggle); updated test helper wiring for Megatron-Bridge repo mount handling.
Configs / TOMLs
conf/experimental/megatron_bridge/test/*, conf/experimental/megatron_bridge/test_scenario/*
Removed mount_as from several TOMLs; added a new test_scenario TOML and minor header year updates/commented micro-batch hint.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop through nodes both near and far,
Marking which to skip — exclude by name or star.
I plant fresh shells for each new test to run,
Pull submodules, mount the bridge, then watch commands.
A tidy nibble, then a joyful bound — code gardens humming, neat and sound. 🥕

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Do not run CommandShell check during object creation' accurately describes the main change: deferring the CommandShell executable check from initialization to execution time.
Description check ✅ Passed The description is related to the changeset, explaining that the CommandShell check has been moved from object creation to runtime to address Windows documentation build issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am/no-runtime-shell-check
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR bundles the stated Windows/docs-build fix (CommandShell no longer checks for /bin/bash at import time) with a set of broader Megatron-Bridge and Slurm enhancements: optional git submodule initialisation for GitRepo, an exclude_nodes field that flows from TOML → model → TestRun → SBATCH header and additional_slurm_params, conditional bind-mounting of the Megatron-Bridge repo via mount_as, container-runtime env-var pre-exports, srun-args forwarding to the inner sbatch, and support for SlurmNodeState.RESERVED nodes.

Key changes:

  • CommandShell: FileNotFoundError guard moved from __init__ to execute(); SlurmSystem switched to default_factory=CommandShell to avoid the check at Pydantic model-class creation time.
  • GitRepo.init_submodules: new opt-in boolean; both Slurm and Kubernetes installers gain _init_submodules + _clone_and_setup_repo helpers with proper cleanup on failure.
  • exclude_nodes: end-to-end plumbing from scenario TOML through TestRunModel/TestRun to #SBATCH --exclude and get_nodes_by_spec; Slurm range syntax is expanded before node filtering.
  • get_nodes_by_spec behavior change: non-empty node specs that resolve to an empty list now raise ValueError instead of silently falling back to unconstrained allocation — a deliberate but breaking change.
  • Megatron-Bridge: the forced mount_as=/opt/Megatron-Bridge default is removed; mounting is now conditional on the user setting the field; GPU directives, node lists, and srun args are forwarded via --additional_slurm_params; container-runtime variables (MELLANOX_VISIBLE_DEVICES, NVIDIA_VISIBLE_DEVICES, NVIDIA_DRIVER_CAPABILITIES) are exported in the wrapper script before the launcher call.
  • SlurmNodeState.RESERVED: added to the grouped-nodes dict and to max_avail allocation; also participates in fixed-count allocation via the generic for state in grouped_nodes loop.

Confidence Score: 3/5

  • Mostly safe, but two concerns warrant attention before merging to main: a silent short-form flag drop in _parse_srun_args_as_slurm_params and the breaking ValueError in get_nodes_by_spec.
  • The core Windows/docs fix is clean and correct. The exclude_nodes plumbing and init_submodules additions are well-tested and logically sound. The Megatron-Bridge enhancements are substantial and well-covered by new tests. Two issues lower the score: (1) _parse_srun_args_as_slurm_params silently drops single-dash srun flags, which could cause silent misconfiguration; (2) get_nodes_by_spec now raises instead of falling back when explicitly-specified nodes are unavailable — a meaningful behavior change that could break existing scenarios if nodes are temporarily down.
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (short-form flag handling, exclude range-syntax in additional_slurm_params) and src/cloudai/systems/slurm/slurm_system.py (breaking ValueError in get_nodes_by_spec).

Important Files Changed

Filename Overview
src/cloudai/util/command_shell.py Core fix: moves FileNotFoundError guard from __init__ to execute() so CommandShell can be imported/instantiated on Windows without /bin/bash existing; also switches SlurmSystem to default_factory=CommandShell to avoid the check at class-definition time.
src/cloudai/systems/slurm/slurm_system.py Adds exclude_nodes propagation through parse_nodesget_available_nodes_from_groupgroup_nodes_by_state; adds RESERVED node state support; introduces a breaking ValueError in get_nodes_by_spec when explicit specs resolve to empty — previously the function silently fell back to unconstrained allocation.
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Significant feature additions: conditional repo bind-mounting via mount_as, container-runtime env-var pre-exports, additional_slurm_params forwarding (GPU directives, nodelist/exclude, srun args); new _parse_srun_args_as_slurm_params parser silently drops single-dash short-form flags and raw Slurm range syntax in exclude is passed unverified to the launcher.
src/cloudai/_core/installables.py Adds init_submodules: bool = False field to GitRepo; straightforward additive change.
src/cloudai/systems/slurm/slurm_installer.py Refactors _install_one_git_repo to call new _clone_and_setup_repo helper and adds _init_submodules; properly cleans up the repo directory on submodule init failure.
src/cloudai/systems/kubernetes/kubernetes_installer.py Mirrors Slurm installer changes: adds _init_submodules and _clone_and_setup_repo; consistent implementation with cleanup on failure.
src/cloudai/_core/test_scenario.py Adds exclude_nodes: List[str] field to TestRun dataclass; clean additive change with proper default factory.
src/cloudai/models/scenario.py Adds exclude_nodes field to TestRunModel with helpful Slurm range-syntax description; clean additive model change.
src/cloudai/systems/slurm/slurm_command_gen_strategy.py Adds #SBATCH --exclude directive when exclude_nodes is set and no explicit nodelist is present; includes exclude_nodes in the node-spec cache key to prevent stale hits across different test runs.
src/cloudai/workloads/megatron_bridge/megatron_bridge.py Removes forced mount_as assignment in _select_megatron_bridge_repo; the repo is now returned as-is so the caller decides whether to bind-mount it.

Sequence Diagram

sequenceDiagram
    participant TOML as scenario.toml
    participant TSP as TestScenarioParser
    participant TR as TestRun
    participant SCGS as SlurmCommandGenStrategy
    participant SS as SlurmSystem
    participant MBS as MegatronBridgeSlurmStrategy

    TOML->>TSP: exclude_nodes, git_repos[mount_as, init_submodules]
    TSP->>TR: TestRun(exclude_nodes=[...])
    TR->>SCGS: _append_nodes_related_directives()
    SCGS->>SCGS: emit #SBATCH --exclude (no nodelist case)
    SCGS->>SS: get_nodes_by_spec(nnodes, nodes, exclude_nodes)
    SS->>SS: parse_nodes() → expand ranges → filter excluded
    SS->>SS: get_available_nodes_from_group() → group_nodes_by_state()
    SS-->>SCGS: (num_nodes, node_list) or ValueError

    TR->>MBS: gen_exec_command()
    MBS->>MBS: _container_runtime_env_exports() → export MELLANOX/NVIDIA_VISIBLE_DEVICES
    MBS->>MBS: _build_launcher_parts() → conditional -cm mount_as
    MBS->>MBS: additional_slurm_params: gpus-per-node, nodelist/exclude, srun-args
    MBS->>MBS: _wrap_launcher_for_job_id → wrapper.sh
Loading

Comments Outside Diff (2)

  1. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 638-651 (link)

    P1 Short-form srun flags are silently dropped

    _parse_srun_args_as_slurm_params only processes tokens starting with --. Any single-dash short-form flags (e.g., -p partition, -N 4, -C gpu) in extra_srun_args are silently ignored — the i += 1 at the bottom of the loop still advances past them, but no params.append is reached. This could cause subtle misconfiguration if a user or system uses short-form arguments.

    For example, extra_srun_args = "-p debug --exclusive" would produce ["exclusive"] and silently drop -p debug.

    Consider either documenting the limitation clearly or extending the parser to handle single-dash flags:

    elif tok.startswith("-") and not tok.startswith("--"):
        # Short-form flags: skip with optional value
        if i + 1 < len(tokens) and not tokens[i + 1].startswith("-"):
            # -p partition → skip value token too
            i += 1
        # Not forwarded; consider logging a warning
  2. src/cloudai/systems/slurm/slurm_system.py, line 477-486 (link)

    P2 Breaking behavior change: ValueError replaces silent fallback

    Previously, when non-empty nodes specs (e.g. ["main:group1:3"]) resolved to an empty list (all nodes DRAIN/DOWN or all excluded), get_nodes_by_spec silently returned (num_nodes, []), letting Slurm pick any available nodes in the partition.

    The new code raises a ValueError ("Cannot fall back to unconstrained allocation"). This is a more defensive and correct design, but it's a breaking change for users who may have existing scenarios where nodes are temporarily unavailable and relied on CloudAI transparently falling back. It would be worth adding a note in the release/changelog and ensuring callers handle this exception gracefully.

    The inconsistency with the empty-nodes case is also notable: when nodes=[] (empty list) + exclude_nodes set, the function still returns (num_nodes, []) without raising, which could feel surprising given the tighter contract now enforced for non-empty nodes.

Last reviewed commit: "Update copyright yea..."

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_docker_image_cache_manager.py`:
- Around line 77-78: The function definition for test_cache_docker_image exceeds
the 120-char line limit; split the parameter list across multiple lines to keep
each line <=120 chars while preserving the existing parameter order and
decorator behavior (ensure mock_supports_gpu remains first so bottom-up
decorators still apply). Update the signature for test_cache_docker_image(...)
to place parameters like mock_supports_gpu, mock_check_prerequisites, mock_run,
mock_access, mock_exists, mock_is_file, and slurm_system each on their own line
or grouped so the wrapped lines are under 120 chars and the test still consumes
the mocks as intended.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 825ca023-36b0-4b33-80f8-5d01bd8c972e

📥 Commits

Reviewing files that changed from the base of the PR and between 216bf6a and 0fb93cd.

📒 Files selected for processing (1)
  • tests/test_docker_image_cache_manager.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_docker_image_cache_manager.py (1)

2-2: ⚠️ Potential issue | 🔴 Critical

Update header year to 2026 to unblock CI.

Line [2] still says 2024-2025, but CI expects 2024-2026, and this is currently failing the pipeline.

🔧 Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_docker_image_cache_manager.py` at line 2, Update the file header
string that currently reads "Copyright (c) 2024-2025 NVIDIA CORPORATION &
AFFILIATES. All rights reserved." to "Copyright (c) 2024-2026 NVIDIA CORPORATION
& AFFILIATES. All rights reserved." so the year range matches CI expectations;
locate the header line in tests/test_docker_image_cache_manager.py (the
copyright comment at the top) and change the year range accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/test_docker_image_cache_manager.py`:
- Line 2: Update the file header string that currently reads "Copyright (c)
2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved." to "Copyright
(c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved." so the year
range matches CI expectations; locate the header line in
tests/test_docker_image_cache_manager.py (the copyright comment at the top) and
change the year range accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8cca6567-7836-493c-a3c7-18c7de1627b1

📥 Commits

Reviewing files that changed from the base of the PR and between 0fb93cd and e75edc7.

📒 Files selected for processing (1)
  • tests/test_docker_image_cache_manager.py

@amaslenn amaslenn requested a review from podkidyshev March 20, 2026 08:25
@amaslenn amaslenn force-pushed the am/no-runtime-shell-check branch from 75f6276 to 9ab843a Compare March 20, 2026 15:14
Comment on lines +466 to +467
elif self.test_run.exclude_nodes:
additional_slurm_params.append(f"exclude={','.join(self.test_run.exclude_nodes)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Raw Slurm range syntax in --additional_slurm_params exclude field

self.test_run.exclude_nodes is passed verbatim to the Megatron-Bridge launcher's --additional_slurm_params as exclude=node-[101-104],.... Whether the launcher correctly forwards Slurm range expressions to the inner sbatch call depends on its parsing. In the standard SBATCH --exclude directive (written to the script header on line 229), Slurm handles range expansion natively. But for the --additional_slurm_params path the launcher may or may not perform the same expansion.

It would be safer to expand the ranges before building the exclude parameter here, similar to what parse_nodes does via parse_node_list, or to add a test that validates range syntax is preserved end-to-end through the launcher path.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/_core/installables.py (1)

92-107: ⚠️ Potential issue | 🟠 Major

Include init_submodules in the installable identity.

init_submodules changes the checkout contents, but GitRepo.__eq__ / __hash__ still key only on url and commit. If installables are deduped, GitRepo(..., init_submodules=True) can collapse into a plain clone and skip the recursive submodule init step. Keep any git_repo-derived identity checks in sync as well.

Suggested change
 class GitRepo(Installable, BaseModel):
@@
     def __eq__(self, other: object) -> bool:
         """Check if two installable objects are equal."""
-        return isinstance(other, GitRepo) and other.url == self.url and other.commit == self.commit
+        return (
+            isinstance(other, GitRepo)
+            and other.url == self.url
+            and other.commit == self.commit
+            and other.init_submodules == self.init_submodules
+        )
 
     def __hash__(self) -> int:
         """Hash the installable object."""
-        return hash((self.url, self.commit))
+        return hash((self.url, self.commit, self.init_submodules))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/_core/installables.py` around lines 92 - 107, GitRepo's identity
currently ignores init_submodules causing different checkouts to be treated
equal; update GitRepo.__eq__ and GitRepo.__hash__ to include the init_submodules
field (i.e., compare self.init_submodules in __eq__ and include it in the tuple
passed to hash), and mirror the same change in any git_repo-derived identity
methods so deduping respects recursive submodule initialization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/slurm/slurm_command_gen_strategy.py`:
- Around line 427-428: The generated SBATCH use of exclude_nodes isn't
persisted: add an exclude_nodes attribute to the TestRunDetails/data model and
include self.test_run.exclude_nodes when building the stored run spec in
store_test_run() in slurm_command_gen_strategy.py so the saved metadata can
reconstruct allocation requests; update any serialization/deserialization logic
that creates or consumes TestRunDetails (and any callers of store_test_run()) to
read/write this new field and ensure tests or storage schema migrations are
adjusted accordingly.

In `@src/cloudai/systems/slurm/slurm_installer.py`:
- Around line 212-226: The status checks for Git repos currently only verify
existence and commit but not submodule initialization, so update
_is_git_repo_installed() and the GitRepo branch of is_installed_one() to also
validate submodules when item.init_submodules is true: call the same
submodule-check logic used in _install_one_git_repo()/_clone_and_setup_repo()
(or factor that check into a helper) to ensure submodules are initialized and at
the correct state before returning success; if the submodule check fails, return
a failing InstallStatusResult with an explanatory message.

In `@src/cloudai/systems/slurm/slurm_system.py`:
- Around line 758-771: The code collapses the “too few nodes returned” case into
the “empty node list” error; update the logic after parse_nodes so that if
parsed_nodes is non-empty but shorter than the originally requested count you
raise a specific shortfall error instead of the empty-list message. Determine
the requested node count from the incoming nodes specification (using the same
logic you use elsewhere—e.g., the routine that resolves group requests or
get_available_nodes_from_group) and compare it to len(parsed_nodes); if
len(parsed_nodes) < requested_count raise a ValueError indicating the requested
count and the number of nodes actually available (referencing parsed_nodes,
nodes and exclude_nodes), otherwise keep the existing empty-list error path when
parsed_nodes is truly empty.

In `@src/cloudai/workloads/megatron_bridge/megatron_bridge.py`:
- Around line 186-196: The helper _select_megatron_bridge_repo currently returns
the repo unchanged which makes an absent mount_as become a no-op; instead when a
Megatron-Bridge repo is found (match on "Megatron-Bridge" in repo.url or
explicit mount_as path) ensure it will be mounted into the container so the
pinned commit controls runtime by returning a repo object that has mount_as set
to "/opt/Megatron-Bridge" if repo.mount_as is falsy; to avoid mutating callers,
return a shallow copy (or new GitRepo instance) with
mount_as="/opt/Megatron-Bridge" when needed (leave any existing mount_as alone)
so configs that omit mount_as still override the image bundle.

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 107-124: The file has two issues: _container_runtime_env_exports
currently shlex.quotes values (turning shell-expansions like
NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES into literals) and
_parse_srun_args_as_slurm_params silently drops unsupported tokens (short flags
like -x or non-flag tokens). Fix _container_runtime_env_exports to preserve
shell expansion for keys in CONTAINER_RUNTIME_ENV_VARS by not quoting values
that look like shell expressions (e.g., start with $ or contain parameter
expansion) or by validating and raising an error if a value contains shell
metacharacters—update the logic in _container_runtime_env_exports and reference
CONTAINER_RUNTIME_ENV_VARS and final_env_vars. Fix
_parse_srun_args_as_slurm_params to stop silently discarding tokens from
extra_srun_args: enhance its parsing to accept short flags (e.g., -x),
positional/non-flag tokens, and return them unchanged if they don’t match
--key[=value]/--key value/--flag patterns, or at minimum emit a clear
warning/error when tokens would be dropped so the caller can handle them—update
the parsing code in _parse_srun_args_as_slurm_params and how extra_srun_args is
consumed.

In `@tests/systems/slurm/test_command_gen_strategy.py`:
- Around line 382-383: Replace the weak absence check that uses `assert "#SBATCH
--exclude=" not in content` with a prefix-based check that ensures no line in
the generated `content` starts with the `#SBATCH --exclude` directive; locate
the assertion in the test (in tests/systems/slurm/test_command_gen_strategy.py
where the `content` variable is inspected) and change it to assert that none of
the `content.splitlines()` entries start with the string `#SBATCH --exclude`.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 355-357: The test function
test_mount_as_adds_repo_to_container_mounts currently declares an unused
tmp_path fixture parameter; remove tmp_path from the function signature so the
test becomes def test_mount_as_adds_repo_to_container_mounts(self,
configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]) ->
None to eliminate the unused fixture and silence the ARG002 warning. Ensure no
other code in that test references tmp_path before committing.

---

Outside diff comments:
In `@src/cloudai/_core/installables.py`:
- Around line 92-107: GitRepo's identity currently ignores init_submodules
causing different checkouts to be treated equal; update GitRepo.__eq__ and
GitRepo.__hash__ to include the init_submodules field (i.e., compare
self.init_submodules in __eq__ and include it in the tuple passed to hash), and
mirror the same change in any git_repo-derived identity methods so deduping
respects recursive submodule initialization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 24836278-8892-45f8-824f-77b1547f8686

📥 Commits

Reviewing files that changed from the base of the PR and between 1567b71 and 9ab843a.

📒 Files selected for processing (24)
  • conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test_scenario/megatron_bridge_r0.3.0_qwen_30b.toml
  • src/cloudai/_core/installables.py
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/models/scenario.py
  • src/cloudai/systems/kubernetes/kubernetes_installer.py
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
  • src/cloudai/systems/slurm/slurm_installer.py
  • src/cloudai/systems/slurm/slurm_system.py
  • src/cloudai/test_scenario_parser.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/conftest.py
  • tests/ref_data/megatron-bridge.sbatch
  • tests/systems/slurm/test_allocation.py
  • tests/systems/slurm/test_command_gen_strategy.py
  • tests/systems/slurm/test_system.py
  • tests/test_docker_image_cache_manager.py
  • tests/test_git_repo_installer.py
  • tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py
💤 Files with no reviewable changes (3)
  • conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml

Comment on lines +427 to +428
if self.test_run.exclude_nodes:
content.append(f"#SBATCH --exclude={','.join(self.test_run.exclude_nodes)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Persist exclude_nodes in the stored run spec.

#SBATCH --exclude= only exists in the generated batch script. store_test_run() still dumps the raw srun command, and TestRunDetails has no exclude_nodes field, so the saved run metadata can no longer reconstruct the actual allocation request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/systems/slurm/slurm_command_gen_strategy.py` around lines 427 -
428, The generated SBATCH use of exclude_nodes isn't persisted: add an
exclude_nodes attribute to the TestRunDetails/data model and include
self.test_run.exclude_nodes when building the stored run spec in
store_test_run() in slurm_command_gen_strategy.py so the saved metadata can
reconstruct allocation requests; update any serialization/deserialization logic
that creates or consumes TestRunDetails (and any callers of store_test_run()) to
read/write this new field and ensure tests or storage schema migrations are
adjusted accordingly.

Comment on lines +212 to +226
if item.init_submodules:
res = self._init_submodules(repo_path)
if not res.success:
return res
item.installed_path = repo_path
msg = f"Git repository already exists at {repo_path}."
logging.debug(msg)
return InstallStatusResult(True, msg)

res = self._clone_and_setup_repo(item, repo_path)
if not res.success:
return res

item.installed_path = repo_path
return InstallStatusResult(True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Base installer orchestration =="
BASE_INSTALLER_FILE="$(fd 'base_installer\.py$' src | head -n 1 || true)"
if [ -n "${BASE_INSTALLER_FILE}" ]; then
  sed -n '1,260p' "${BASE_INSTALLER_FILE}"
fi

echo
echo "== Git repo installed checks =="
rg -n -C3 'def (_is_git_repo_installed|is_installed_one)|is_installed_one\(' \
  src/cloudai/systems/slurm/slurm_installer.py \
  src/cloudai/systems/kubernetes/kubernetes_installer.py

Repository: NVIDIA/cloudai

Length of output: 13058


🏁 Script executed:

# Get full implementation of is_installed_one and _is_git_repo_installed in slurm_installer.py
echo "=== slurm_installer.py: is_installed_one() ===" 
sed -n '139,170p' src/cloudai/systems/slurm/slurm_installer.py

echo
echo "=== slurm_installer.py: _is_git_repo_installed() ===" 
sed -n '446,485p' src/cloudai/systems/slurm/slurm_installer.py

echo
echo "=== kubernetes_installer.py: is_installed_one() ===" 
sed -n '110,145p' src/cloudai/systems/kubernetes/kubernetes_installer.py

echo
echo "=== Check if init_submodules is validated in status checks ==="
rg -n 'init_submodules' src/cloudai/systems/slurm/slurm_installer.py src/cloudai/systems/kubernetes/kubernetes_installer.py

Repository: NVIDIA/cloudai

Length of output: 6552


🏁 Script executed:

# Verify if _verify_commit checks submodules state
echo "=== _verify_commit implementation ===" 
rg -n -A10 'def _verify_commit' src/cloudai/systems/slurm/slurm_installer.py | head -20

Repository: NVIDIA/cloudai

Length of output: 700


Add init_submodules validation to status check methods.

The _is_git_repo_installed() method (and is_installed_one() in Kubernetes) validates only repository existence and commit, but does not check if submodules are initialized when item.init_submodules is true. While the current orchestration doesn't pre-check status before installing, the incomplete status API creates a correctness issue: external calls to is_installed() or future optimizations that check status first will incorrectly report a repository as installed if it exists at the correct commit but lacks initialized submodules.

Update _is_git_repo_installed() and the Kubernetes is_installed_one() GitRepo branch to verify submodules state when applicable, similar to the checks already present in _install_one_git_repo() and _clone_and_setup_repo().

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 218-218: debug() call on root logger

(LOG015)


[warning] 219-219: Boolean positional value in function call

(FBT003)


[warning] 226-226: Boolean positional value in function call

(FBT003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/systems/slurm/slurm_installer.py` around lines 212 - 226, The
status checks for Git repos currently only verify existence and commit but not
submodule initialization, so update _is_git_repo_installed() and the GitRepo
branch of is_installed_one() to also validate submodules when
item.init_submodules is true: call the same submodule-check logic used in
_install_one_git_repo()/_clone_and_setup_repo() (or factor that check into a
helper) to ensure submodules are initialized and at the correct state before
returning success; if the submodule check fails, return a failing
InstallStatusResult with an explanatory message.

Comment on lines +758 to +771
parsed_nodes = self.parse_nodes(nodes, exclude_nodes=exclude_nodes)
if parsed_nodes:
num_nodes = len(parsed_nodes)
node_list = parsed_nodes
elif nodes:
reason = (
f"after excluding nodes {exclude_nodes}"
if exclude_nodes
else "— no nodes are available (all may be DRAIN/DOWN)"
)
raise ValueError(
f"Node specifications {nodes} resolved to an empty node list {reason}. "
"Cannot fall back to unconstrained allocation."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don't collapse “too few nodes” into “empty node list.”

Lines 762-771 treat any empty parsed_nodes result as if nothing matched, but get_available_nodes_from_group() also returns [] when only part of the request can be satisfied. A request like 4 nodes from a 3-node remainder now raises “resolved to an empty node list” instead of reporting the actual shortfall.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 768-771: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/systems/slurm/slurm_system.py` around lines 758 - 771, The code
collapses the “too few nodes returned” case into the “empty node list” error;
update the logic after parse_nodes so that if parsed_nodes is non-empty but
shorter than the originally requested count you raise a specific shortfall error
instead of the empty-list message. Determine the requested node count from the
incoming nodes specification (using the same logic you use elsewhere—e.g., the
routine that resolves group requests or get_available_nodes_from_group) and
compare it to len(parsed_nodes); if len(parsed_nodes) < requested_count raise a
ValueError indicating the requested count and the number of nodes actually
available (referencing parsed_nodes, nodes and exclude_nodes), otherwise keep
the existing empty-list error path when parsed_nodes is truly empty.

Comment on lines 186 to +196
def _select_megatron_bridge_repo(git_repos: list[GitRepo]) -> GitRepo | None:
"""Return the Megatron-Bridge repo from `git_repos` (normalized to mount_as=/opt/Megatron-Bridge)."""
"""
Return the Megatron-Bridge repo from `git_repos`.

When the user sets ``mount_as`` (e.g. ``/opt/Megatron-Bridge``), the installed clone will be bind-mounted
into the container at that path, overriding whatever the container image ships. When ``mount_as`` is *not*
set the container's built-in ``/opt/Megatron-Bridge`` is used.
"""
for repo in git_repos:
if "Megatron-Bridge" in repo.url or (repo.mount_as or "").rstrip("/") == "/opt/Megatron-Bridge":
return repo if repo.mount_as else repo.model_copy(update={"mount_as": "/opt/Megatron-Bridge"})
return repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't make the repo pin a no-op.

Returning the repo unchanged here means configs that omit mount_as no longer override /opt/Megatron-Bridge inside the container. The job now uses the image-bundled checkout, while [[git_repos]].commit is still required and cloned locally, so the declared pin no longer controls the runtime version. Either keep the implicit mount or stop requiring/installing the repo when mount_as is unset.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/megatron_bridge/megatron_bridge.py` around lines 186 -
196, The helper _select_megatron_bridge_repo currently returns the repo
unchanged which makes an absent mount_as become a no-op; instead when a
Megatron-Bridge repo is found (match on "Megatron-Bridge" in repo.url or
explicit mount_as path) ensure it will be mounted into the container so the
pinned commit controls runtime by returning a repo object that has mount_as set
to "/opt/Megatron-Bridge" if repo.mount_as is falsy; to avoid mutating callers,
return a shallow copy (or new GitRepo instance) with
mount_as="/opt/Megatron-Bridge" when needed (leave any existing mount_as alone)
so configs that omit mount_as still override the image bundle.

Comment on lines +107 to +124
def _container_runtime_env_exports(self) -> list[str]:
"""
Build ``export`` lines for container-runtime env vars.

Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES``
are consumed by the NVIDIA container toolkit / enroot at container-creation
time to decide which devices to mount. They must be present in the process
environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that
Slurm inherits them into the job and ``srun`` passes them to the container
runtime. Exporting them in the wrapper script (which runs on the submit
node) achieves this. The same variables are still passed via ``-cb`` as
well, so they are also set inside the container for any runtime readers.
"""
lines: list[str] = []
for key, value in sorted(self.final_env_vars.items()):
if key in self.CONTAINER_RUNTIME_ENV_VARS:
lines.append(f"export {key}={shlex.quote(str(value))}")
return lines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import shlex
import subprocess

value = "$CUDA_VISIBLE_DEVICES"
rendered = f"export NVIDIA_VISIBLE_DEVICES={shlex.quote(value)}; printf '%s\\n' \"$NVIDIA_VISIBLE_DEVICES\""
print("Generated wrapper line:", rendered)

literal = subprocess.run(
    ["bash", "-lc", f"CUDA_VISIBLE_DEVICES=0,1; {rendered}"],
    check=True,
    capture_output=True,
    text=True,
)
expanded = subprocess.run(
    [
        "bash",
        "-lc",
        'CUDA_VISIBLE_DEVICES=0,1; export NVIDIA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES"; printf "%s\\n" "$NVIDIA_VISIBLE_DEVICES"',
    ],
    check=True,
    capture_output=True,
    text=True,
)

print("Current result :", literal.stdout.strip())
print("Expanded result:", expanded.stdout.strip())
PY

Repository: NVIDIA/cloudai


Repository: NVIDIA/cloudai
Exit code: 0

stdout:

Generated wrapper line: export NVIDIA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES'; printf '%s\n' "$NVIDIA_VISIBLE_DEVICES"
Current result : $CUDA_VISIBLE_DEVICES
Expanded result: 0,1

Script executed:

#!/bin/bash

# First, find and read the exact file to verify the implementation
if [ -f "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" ]; then
  echo "=== Lines 107-124 (container_runtime_env_exports) ==="
  sed -n '107,124p' "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"
  
  echo -e "\n=== Checking for _build_custom_bash_env_exports context ==="
  # Find where _build_custom_bash_env_exports is defined and show relevant lines
  rg -n "_build_custom_bash_env_exports" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -A 20 | head -40
  
  echo -e "\n=== Checking for _parse_srun_args_as_slurm_params ==="
  rg -n "_parse_srun_args_as_slurm_params" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -A 15 | head -40
  
  echo -e "\n=== Checking for CONTAINER_RUNTIME_ENV_VARS definition ==="
  rg -n "CONTAINER_RUNTIME_ENV_VARS" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -B 2 -A 5
fi

Repository: NVIDIA/cloudai


Repository: NVIDIA/cloudai
Exit code: 0

stdout:

=== Lines 107-124 (container_runtime_env_exports) ===
    def _container_runtime_env_exports(self) -> list[str]:
        """
        Build ``export`` lines for container-runtime env vars.

        Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES``
        are consumed by the NVIDIA container toolkit / enroot at container-creation
        time to decide which devices to mount.  They must be present in the process
        environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that
        Slurm inherits them into the job and ``srun`` passes them to the container
        runtime.  Exporting them in the wrapper script (which runs on the submit
        node) achieves this.  The same variables are still passed via ``-cb`` as
        well, so they are also set inside the container for any runtime readers.
        """
        lines: list[str] = []
        for key, value in sorted(self.final_env_vars.items()):
            if key in self.CONTAINER_RUNTIME_ENV_VARS:
                lines.append(f"export {key}={shlex.quote(str(value))}")
        return lines

=== Checking for _build_custom_bash_env_exports context ===
95:    def _build_custom_bash_env_exports(self) -> list[str]:
96-        """
97-        Build repeated -cb entries that export env vars inside the launched Slurm job shell.
98-
99-        We quote each full `export KEY=value` command so `$SLURM_*` and commas survive
100-        argument parsing on the submit node and are expanded/interpreted in the job shell.
101-        """
102-        exports: list[str] = []
103-        for key, value in sorted(self.final_env_vars.items()):
104-            exports.extend(["-cb", shlex.quote(f"export {key}={value}")])
105-        return exports
106-
107-    def _container_runtime_env_exports(self) -> list[str]:
108-        """
109-        Build ``export`` lines for container-runtime env vars.
110-
111-        Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES``
112-        are consumed by the NVIDIA container toolkit / enroot at container-creation
113-        time to decide which devices to mount.  They must be present in the process
114-        environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that
115-        Slurm inherits them into the job and ``srun`` passes them to the container
--
353:            parts.extend(self._build_custom_bash_env_exports())
354-
355-        # Model flags (Megatron-Bridge main-branch API)
356-        add_field("domain", "--domain", args.domain)
357-        if args.use_recipes and "use_recipes" in fields_set:
358-            parts.append("--use_recipes")
359-        if "enable_vboost" in fields_set:
360-            add_field("enable_vboost", "-vb", bool(args.enable_vboost))
361-        if not args.model_family_name:
362-            raise RuntimeError("Missing required cmd_args.model_family_name (maps to -m/--model_family_name).")
363-        if not args.model_recipe_name:
364-            raise RuntimeError("Missing required cmd_args.model_recipe_name (maps to -mr/--model_recipe_name).")
365-        add_field("model_family_name", "-m", args.model_family_name)
366-        add_field("model_recipe_name", "-mr", args.model_recipe_name)
367-        add_field("hidden_size", "--hidden_size", args.hidden_size)
368-        add_field("num_layers", "--num_layers", args.num_layers)
369-        add_field(
370-            "pipeline_model_parallel_layout", "--pipeline_model_parallel_layout", args.pipeline_model_parallel_layout

=== Checking for _parse_srun_args_as_slurm_params ===
139:    def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
140-        """
141-        Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.
142-
143-        Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
144-        key names without a ``=value`` suffix.
145-        """
146-        params: list[str] = []
147-        tokens = shlex.split(srun_args)
148-        i = 0
149-        while i < len(tokens):
150-            tok = tokens[i]
151-            if tok.startswith("--") and "=" in tok:
152-                key, val = tok[2:].split("=", 1)
153-                params.append(f"{key}={val}")
154-            elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
--
471:                additional_slurm_params.extend(self._parse_srun_args_as_slurm_params(source))
472-
473-        if additional_slurm_params:
474-            parts.extend(["--additional_slurm_params", shlex.quote(";".join(additional_slurm_params))])
475-
476-        # Config variant
477-        add_field("config_variant", "-cv", args.config_variant)
478-        if args.list_config_variants and "list_config_variants" in fields_set:
479-            parts.append("--list_config_variants")
480-
481-        # Extra user args (dict -> string)
482-        if tdef.extra_cmd_args:
483-            parts.append(tdef.extra_args_str)
484-
485-        return parts

=== Checking for CONTAINER_RUNTIME_ENV_VARS definition ===
38-    """
39-
40:    CONTAINER_RUNTIME_ENV_VARS: frozenset[str] = frozenset(
41-        {
42-            "MELLANOX_VISIBLE_DEVICES",
43-            "NVIDIA_VISIBLE_DEVICES",
44-            "NVIDIA_DRIVER_CAPABILITIES",
45-        }
--
120-        lines: list[str] = []
121-        for key, value in sorted(self.final_env_vars.items()):
122:            if key in self.CONTAINER_RUNTIME_ENV_VARS:
123-                lines.append(f"export {key}={shlex.quote(str(value))}")
124-        return lines
125-
126-    def _normalize_recompute_modules(self, val: Any) -> str:
127-        if isinstance(val, list):

Script executed:

#!/bin/bash

# Check the file size and structure
echo "=== File structure ==="
wc -l "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"

# Get a broader view of the class to understand context
echo -e "\n=== Class structure and method definitions ==="
rg -n "^\s*(def |class )" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"

Repository: NVIDIA/cloudai


Repository: NVIDIA/cloudai
Exit code: 0

stdout:

=== File structure ===
485 src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

=== Class structure and method definitions ===
33:class MegatronBridgeSlurmCommandGenStrategy(SlurmCommandGenStrategy):
48:    def _container_mounts(self) -> list[str]:
52:    def gen_exec_command(self) -> str:
83:    def store_test_run(self) -> None:
89:    def _write_command_to_file(self, command: str, output_path: Path) -> None:
95:    def _build_custom_bash_env_exports(self) -> list[str]:
107:    def _container_runtime_env_exports(self) -> list[str]:
126:    def _normalize_recompute_modules(self, val: Any) -> str:
139:    def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
162:    def _normalize_cuda_graph_scope_arg(self, val: Any) -> str:
169:    def _wrap_launcher_for_job_id_and_quiet_output(self, launcher_cmd: str, launcher_python: str) -> str:
240:    def _list_or_comma_str(self, val: str | list[str] | None) -> Optional[str]:
249:    def _build_launcher_parts(  # noqa: C901
265:            def _installed_container_path() -> str:
298:        def add(flag: str, value: Any) -> None:
317:        def add_field(field: str, flag: str, value: Any) -> None:

Script executed:

#!/bin/bash

# Get the full _parse_srun_args_as_slurm_params method
sed -n '139,160p' "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"

Repository: NVIDIA/cloudai


Repository: NVIDIA/cloudai
Exit code: 0

stdout:

    def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
        """
        Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.

        Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
        key names without a ``=value`` suffix.
        """
        params: list[str] = []
        tokens = shlex.split(srun_args)
        i = 0
        while i < len(tokens):
            tok = tokens[i]
            if tok.startswith("--") and "=" in tok:
                key, val = tok[2:].split("=", 1)
                params.append(f"{key}={val}")
            elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
                params.append(f"{tok[2:]}={tokens[i + 1]}")
                i += 1
            elif tok.startswith("--"):
                params.append(tok[2:])
            i += 1
        return params

Don't turn container-runtime env references into literals.

_build_custom_bash_env_exports() keeps env values shell-interpretable by quoting the entire export statement, but _container_runtime_env_exports() wraps the value itself with shlex.quote(). A config like NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES will be exported as a literal dollar string before sbatch, and the later -cb export cannot fix that because the container runtime consumes it at container-creation time. Preserve shell expansion here, or explicitly reject shell expressions for these keys.

Additionally, _parse_srun_args_as_slurm_params() silently drops any extra_srun_args tokens that don't match its three supported patterns (--key=value, --key value, --flag). Short-form flags like -x and non-flag tokens disappear without warning, risking loss of scheduler-affecting options.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 107 - 124, The file has two issues: _container_runtime_env_exports
currently shlex.quotes values (turning shell-expansions like
NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES into literals) and
_parse_srun_args_as_slurm_params silently drops unsupported tokens (short flags
like -x or non-flag tokens). Fix _container_runtime_env_exports to preserve
shell expansion for keys in CONTAINER_RUNTIME_ENV_VARS by not quoting values
that look like shell expressions (e.g., start with $ or contain parameter
expansion) or by validating and raising an error if a value contains shell
metacharacters—update the logic in _container_runtime_env_exports and reference
CONTAINER_RUNTIME_ENV_VARS and final_env_vars. Fix
_parse_srun_args_as_slurm_params to stop silently discarding tokens from
extra_srun_args: enhance its parsing to accept short flags (e.g., -x),
positional/non-flag tokens, and return them unchanged if they don’t match
--key[=value]/--key value/--flag patterns, or at minimum emit a clear
warning/error when tokens would be dropped so the caller can handle them—update
the parsing code in _parse_srun_args_as_slurm_params and how extra_srun_args is
consumed.

Comment on lines +138 to +160
@staticmethod
def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
"""
Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.

Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
key names without a ``=value`` suffix.
"""
params: list[str] = []
tokens = shlex.split(srun_args)
i = 0
while i < len(tokens):
tok = tokens[i]
if tok.startswith("--") and "=" in tok:
key, val = tok[2:].split("=", 1)
params.append(f"{key}={val}")
elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
params.append(f"{tok[2:]}={tokens[i + 1]}")
i += 1
elif tok.startswith("--"):
params.append(tok[2:])
i += 1
return params
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unsupported extra_srun_args forms instead of dropping them.

This parser only recognizes long-form --... tokens. Valid Slurm inputs like -A acct or -w node01 — or even a typo missing the second dash — fall through the loop and disappear from --additional_slurm_params with no signal. That can change scheduler placement silently. Either parse short options explicitly or raise on token shapes you do not support.

🛡️ Minimal fail-fast fix
         while i < len(tokens):
             tok = tokens[i]
             if tok.startswith("--") and "=" in tok:
                 key, val = tok[2:].split("=", 1)
                 params.append(f"{key}={val}")
             elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
                 params.append(f"{tok[2:]}={tokens[i + 1]}")
                 i += 1
             elif tok.startswith("--"):
                 params.append(tok[2:])
+            else:
+                raise ValueError(
+                    f"Unsupported extra_srun_args token {tok!r}; "
+                    "use --key=value, --key value, or bare --flag forms."
+                )
             i += 1
         return params
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
"""
Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.
Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
key names without a ``=value`` suffix.
"""
params: list[str] = []
tokens = shlex.split(srun_args)
i = 0
while i < len(tokens):
tok = tokens[i]
if tok.startswith("--") and "=" in tok:
key, val = tok[2:].split("=", 1)
params.append(f"{key}={val}")
elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
params.append(f"{tok[2:]}={tokens[i + 1]}")
i += 1
elif tok.startswith("--"):
params.append(tok[2:])
i += 1
return params
`@staticmethod`
def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
"""
Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.
Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
key names without a ``=value`` suffix.
"""
params: list[str] = []
tokens = shlex.split(srun_args)
i = 0
while i < len(tokens):
tok = tokens[i]
if tok.startswith("--") and "=" in tok:
key, val = tok[2:].split("=", 1)
params.append(f"{key}={val}")
elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
params.append(f"{tok[2:]}={tokens[i + 1]}")
i += 1
elif tok.startswith("--"):
params.append(tok[2:])
else:
raise ValueError(
f"Unsupported extra_srun_args token {tok!r}; "
"use --key=value, --key value, or bare --flag forms."
)
i += 1
return params

Comment on lines +382 to +383
assert "#SBATCH --nodelist=node3,node4" in content
assert "#SBATCH --exclude=" not in content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a prefix check for the missing --exclude assertion.

Line 383 is testing exact list membership, not absence of an exclude directive. If content contains #SBATCH --exclude=node01,node02, this assertion still passes.

🔧 Proposed fix
     assert "#SBATCH --nodelist=node3,node4" in content
-    assert "#SBATCH --exclude=" not in content
+    assert not any(line.startswith("#SBATCH --exclude=") for line in content)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert "#SBATCH --nodelist=node3,node4" in content
assert "#SBATCH --exclude=" not in content
assert "#SBATCH --nodelist=node3,node4" in content
assert not any(line.startswith("#SBATCH --exclude=") for line in content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/systems/slurm/test_command_gen_strategy.py` around lines 382 - 383,
Replace the weak absence check that uses `assert "#SBATCH --exclude=" not in
content` with a prefix-based check that ensures no line in the generated
`content` starts with the `#SBATCH --exclude` directive; locate the assertion in
the test (in tests/systems/slurm/test_command_gen_strategy.py where the
`content` variable is inspected) and change it to assert that none of the
`content.splitlines()` entries start with the string `#SBATCH --exclude`.

Comment on lines +355 to +357
def test_mount_as_adds_repo_to_container_mounts(
self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove the unused tmp_path fixture from this test signature.

tmp_path is never read here, so the fixture just adds setup noise and keeps Ruff's ARG002 warning alive.

♻️ Proposed cleanup
-    def test_mount_as_adds_repo_to_container_mounts(
-        self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
-    ) -> None:
+    def test_mount_as_adds_repo_to_container_mounts(
+        self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]
+    ) -> None:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_mount_as_adds_repo_to_container_mounts(
self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
) -> None:
def test_mount_as_adds_repo_to_container_mounts(
self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]
) -> None:
🧰 Tools
🪛 Ruff (0.15.6)

[warning] 356-356: Unused method argument: tmp_path

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around
lines 355 - 357, The test function test_mount_as_adds_repo_to_container_mounts
currently declares an unused tmp_path fixture parameter; remove tmp_path from
the function signature so the test becomes def
test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system:
SlurmSystem, make_test_run: Callable[..., TestRun]) -> None to eliminate the
unused fixture and silence the ARG002 warning. Ensure no other code in that test
references tmp_path before committing.

@amaslenn amaslenn closed this Mar 20, 2026
@amaslenn amaslenn deleted the am/no-runtime-shell-check branch March 20, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants