Do not run CommandShell check during object creation#841
Do not run CommandShell check during object creation#841
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded per-test exclude_nodes support across Slurm path (models, parsing, allocation, command generation), switched SlurmSystem.cmd_shell to use a default_factory, added git submodule init support in installers, adjusted Megatron-Bridge launcher/mount and container env handling, and updated tests and config TOMLs accordingly. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Greptile SummaryThis PR bundles the stated Windows/docs-build fix ( Key changes:
Confidence Score: 3/5
Important Files Changed
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_docker_image_cache_manager.py`:
- Around line 77-78: The function definition for test_cache_docker_image exceeds
the 120-char line limit; split the parameter list across multiple lines to keep
each line <=120 chars while preserving the existing parameter order and
decorator behavior (ensure mock_supports_gpu remains first so bottom-up
decorators still apply). Update the signature for test_cache_docker_image(...)
to place parameters like mock_supports_gpu, mock_check_prerequisites, mock_run,
mock_access, mock_exists, mock_is_file, and slurm_system each on their own line
or grouped so the wrapped lines are under 120 chars and the test still consumes
the mocks as intended.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 825ca023-36b0-4b33-80f8-5d01bd8c972e
📒 Files selected for processing (1)
tests/test_docker_image_cache_manager.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_docker_image_cache_manager.py (1)
2-2:⚠️ Potential issue | 🔴 CriticalUpdate header year to 2026 to unblock CI.
Line [2] still says
2024-2025, but CI expects2024-2026, and this is currently failing the pipeline.🔧 Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_docker_image_cache_manager.py` at line 2, Update the file header string that currently reads "Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved." to "Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved." so the year range matches CI expectations; locate the header line in tests/test_docker_image_cache_manager.py (the copyright comment at the top) and change the year range accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/test_docker_image_cache_manager.py`:
- Line 2: Update the file header string that currently reads "Copyright (c)
2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved." to "Copyright
(c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved." so the year
range matches CI expectations; locate the header line in
tests/test_docker_image_cache_manager.py (the copyright comment at the top) and
change the year range accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8cca6567-7836-493c-a3c7-18c7de1627b1
📒 Files selected for processing (1)
tests/test_docker_image_cache_manager.py
75f6276 to
9ab843a
Compare
| elif self.test_run.exclude_nodes: | ||
| additional_slurm_params.append(f"exclude={','.join(self.test_run.exclude_nodes)}") |
There was a problem hiding this comment.
Raw Slurm range syntax in
--additional_slurm_params exclude field
self.test_run.exclude_nodes is passed verbatim to the Megatron-Bridge launcher's --additional_slurm_params as exclude=node-[101-104],.... Whether the launcher correctly forwards Slurm range expressions to the inner sbatch call depends on its parsing. In the standard SBATCH --exclude directive (written to the script header on line 229), Slurm handles range expansion natively. But for the --additional_slurm_params path the launcher may or may not perform the same expansion.
It would be safer to expand the ranges before building the exclude parameter here, similar to what parse_nodes does via parse_node_list, or to add a test that validates range syntax is preserved end-to-end through the launcher path.
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/_core/installables.py (1)
92-107:⚠️ Potential issue | 🟠 MajorInclude
init_submodulesin the installable identity.
init_submoduleschanges the checkout contents, butGitRepo.__eq__/__hash__still key only onurlandcommit. If installables are deduped,GitRepo(..., init_submodules=True)can collapse into a plain clone and skip the recursive submodule init step. Keep anygit_repo-derived identity checks in sync as well.Suggested change
class GitRepo(Installable, BaseModel): @@ def __eq__(self, other: object) -> bool: """Check if two installable objects are equal.""" - return isinstance(other, GitRepo) and other.url == self.url and other.commit == self.commit + return ( + isinstance(other, GitRepo) + and other.url == self.url + and other.commit == self.commit + and other.init_submodules == self.init_submodules + ) def __hash__(self) -> int: """Hash the installable object.""" - return hash((self.url, self.commit)) + return hash((self.url, self.commit, self.init_submodules))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/_core/installables.py` around lines 92 - 107, GitRepo's identity currently ignores init_submodules causing different checkouts to be treated equal; update GitRepo.__eq__ and GitRepo.__hash__ to include the init_submodules field (i.e., compare self.init_submodules in __eq__ and include it in the tuple passed to hash), and mirror the same change in any git_repo-derived identity methods so deduping respects recursive submodule initialization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/systems/slurm/slurm_command_gen_strategy.py`:
- Around line 427-428: The generated SBATCH use of exclude_nodes isn't
persisted: add an exclude_nodes attribute to the TestRunDetails/data model and
include self.test_run.exclude_nodes when building the stored run spec in
store_test_run() in slurm_command_gen_strategy.py so the saved metadata can
reconstruct allocation requests; update any serialization/deserialization logic
that creates or consumes TestRunDetails (and any callers of store_test_run()) to
read/write this new field and ensure tests or storage schema migrations are
adjusted accordingly.
In `@src/cloudai/systems/slurm/slurm_installer.py`:
- Around line 212-226: The status checks for Git repos currently only verify
existence and commit but not submodule initialization, so update
_is_git_repo_installed() and the GitRepo branch of is_installed_one() to also
validate submodules when item.init_submodules is true: call the same
submodule-check logic used in _install_one_git_repo()/_clone_and_setup_repo()
(or factor that check into a helper) to ensure submodules are initialized and at
the correct state before returning success; if the submodule check fails, return
a failing InstallStatusResult with an explanatory message.
In `@src/cloudai/systems/slurm/slurm_system.py`:
- Around line 758-771: The code collapses the “too few nodes returned” case into
the “empty node list” error; update the logic after parse_nodes so that if
parsed_nodes is non-empty but shorter than the originally requested count you
raise a specific shortfall error instead of the empty-list message. Determine
the requested node count from the incoming nodes specification (using the same
logic you use elsewhere—e.g., the routine that resolves group requests or
get_available_nodes_from_group) and compare it to len(parsed_nodes); if
len(parsed_nodes) < requested_count raise a ValueError indicating the requested
count and the number of nodes actually available (referencing parsed_nodes,
nodes and exclude_nodes), otherwise keep the existing empty-list error path when
parsed_nodes is truly empty.
In `@src/cloudai/workloads/megatron_bridge/megatron_bridge.py`:
- Around line 186-196: The helper _select_megatron_bridge_repo currently returns
the repo unchanged which makes an absent mount_as become a no-op; instead when a
Megatron-Bridge repo is found (match on "Megatron-Bridge" in repo.url or
explicit mount_as path) ensure it will be mounted into the container so the
pinned commit controls runtime by returning a repo object that has mount_as set
to "/opt/Megatron-Bridge" if repo.mount_as is falsy; to avoid mutating callers,
return a shallow copy (or new GitRepo instance) with
mount_as="/opt/Megatron-Bridge" when needed (leave any existing mount_as alone)
so configs that omit mount_as still override the image bundle.
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 107-124: The file has two issues: _container_runtime_env_exports
currently shlex.quotes values (turning shell-expansions like
NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES into literals) and
_parse_srun_args_as_slurm_params silently drops unsupported tokens (short flags
like -x or non-flag tokens). Fix _container_runtime_env_exports to preserve
shell expansion for keys in CONTAINER_RUNTIME_ENV_VARS by not quoting values
that look like shell expressions (e.g., start with $ or contain parameter
expansion) or by validating and raising an error if a value contains shell
metacharacters—update the logic in _container_runtime_env_exports and reference
CONTAINER_RUNTIME_ENV_VARS and final_env_vars. Fix
_parse_srun_args_as_slurm_params to stop silently discarding tokens from
extra_srun_args: enhance its parsing to accept short flags (e.g., -x),
positional/non-flag tokens, and return them unchanged if they don’t match
--key[=value]/--key value/--flag patterns, or at minimum emit a clear
warning/error when tokens would be dropped so the caller can handle them—update
the parsing code in _parse_srun_args_as_slurm_params and how extra_srun_args is
consumed.
In `@tests/systems/slurm/test_command_gen_strategy.py`:
- Around line 382-383: Replace the weak absence check that uses `assert "#SBATCH
--exclude=" not in content` with a prefix-based check that ensures no line in
the generated `content` starts with the `#SBATCH --exclude` directive; locate
the assertion in the test (in tests/systems/slurm/test_command_gen_strategy.py
where the `content` variable is inspected) and change it to assert that none of
the `content.splitlines()` entries start with the string `#SBATCH --exclude`.
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 355-357: The test function
test_mount_as_adds_repo_to_container_mounts currently declares an unused
tmp_path fixture parameter; remove tmp_path from the function signature so the
test becomes def test_mount_as_adds_repo_to_container_mounts(self,
configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]) ->
None to eliminate the unused fixture and silence the ARG002 warning. Ensure no
other code in that test references tmp_path before committing.
---
Outside diff comments:
In `@src/cloudai/_core/installables.py`:
- Around line 92-107: GitRepo's identity currently ignores init_submodules
causing different checkouts to be treated equal; update GitRepo.__eq__ and
GitRepo.__hash__ to include the init_submodules field (i.e., compare
self.init_submodules in __eq__ and include it in the tuple passed to hash), and
mirror the same change in any git_repo-derived identity methods so deduping
respects recursive submodule initialization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 24836278-8892-45f8-824f-77b1547f8686
📒 Files selected for processing (24)
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test_scenario/megatron_bridge_r0.3.0_qwen_30b.tomlsrc/cloudai/_core/installables.pysrc/cloudai/_core/test_scenario.pysrc/cloudai/models/scenario.pysrc/cloudai/systems/kubernetes/kubernetes_installer.pysrc/cloudai/systems/slurm/slurm_command_gen_strategy.pysrc/cloudai/systems/slurm/slurm_installer.pysrc/cloudai/systems/slurm/slurm_system.pysrc/cloudai/test_scenario_parser.pysrc/cloudai/workloads/megatron_bridge/megatron_bridge.pysrc/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pytests/conftest.pytests/ref_data/megatron-bridge.sbatchtests/systems/slurm/test_allocation.pytests/systems/slurm/test_command_gen_strategy.pytests/systems/slurm/test_system.pytests/test_docker_image_cache_manager.pytests/test_git_repo_installer.pytests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py
💤 Files with no reviewable changes (3)
- conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
- conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
- conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
| if self.test_run.exclude_nodes: | ||
| content.append(f"#SBATCH --exclude={','.join(self.test_run.exclude_nodes)}") |
There was a problem hiding this comment.
Persist exclude_nodes in the stored run spec.
#SBATCH --exclude= only exists in the generated batch script. store_test_run() still dumps the raw srun command, and TestRunDetails has no exclude_nodes field, so the saved run metadata can no longer reconstruct the actual allocation request.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/systems/slurm/slurm_command_gen_strategy.py` around lines 427 -
428, The generated SBATCH use of exclude_nodes isn't persisted: add an
exclude_nodes attribute to the TestRunDetails/data model and include
self.test_run.exclude_nodes when building the stored run spec in
store_test_run() in slurm_command_gen_strategy.py so the saved metadata can
reconstruct allocation requests; update any serialization/deserialization logic
that creates or consumes TestRunDetails (and any callers of store_test_run()) to
read/write this new field and ensure tests or storage schema migrations are
adjusted accordingly.
| if item.init_submodules: | ||
| res = self._init_submodules(repo_path) | ||
| if not res.success: | ||
| return res | ||
| item.installed_path = repo_path | ||
| msg = f"Git repository already exists at {repo_path}." | ||
| logging.debug(msg) | ||
| return InstallStatusResult(True, msg) | ||
|
|
||
| res = self._clone_and_setup_repo(item, repo_path) | ||
| if not res.success: | ||
| return res | ||
|
|
||
| item.installed_path = repo_path | ||
| return InstallStatusResult(True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Base installer orchestration =="
BASE_INSTALLER_FILE="$(fd 'base_installer\.py$' src | head -n 1 || true)"
if [ -n "${BASE_INSTALLER_FILE}" ]; then
sed -n '1,260p' "${BASE_INSTALLER_FILE}"
fi
echo
echo "== Git repo installed checks =="
rg -n -C3 'def (_is_git_repo_installed|is_installed_one)|is_installed_one\(' \
src/cloudai/systems/slurm/slurm_installer.py \
src/cloudai/systems/kubernetes/kubernetes_installer.pyRepository: NVIDIA/cloudai
Length of output: 13058
🏁 Script executed:
# Get full implementation of is_installed_one and _is_git_repo_installed in slurm_installer.py
echo "=== slurm_installer.py: is_installed_one() ==="
sed -n '139,170p' src/cloudai/systems/slurm/slurm_installer.py
echo
echo "=== slurm_installer.py: _is_git_repo_installed() ==="
sed -n '446,485p' src/cloudai/systems/slurm/slurm_installer.py
echo
echo "=== kubernetes_installer.py: is_installed_one() ==="
sed -n '110,145p' src/cloudai/systems/kubernetes/kubernetes_installer.py
echo
echo "=== Check if init_submodules is validated in status checks ==="
rg -n 'init_submodules' src/cloudai/systems/slurm/slurm_installer.py src/cloudai/systems/kubernetes/kubernetes_installer.pyRepository: NVIDIA/cloudai
Length of output: 6552
🏁 Script executed:
# Verify if _verify_commit checks submodules state
echo "=== _verify_commit implementation ==="
rg -n -A10 'def _verify_commit' src/cloudai/systems/slurm/slurm_installer.py | head -20Repository: NVIDIA/cloudai
Length of output: 700
Add init_submodules validation to status check methods.
The _is_git_repo_installed() method (and is_installed_one() in Kubernetes) validates only repository existence and commit, but does not check if submodules are initialized when item.init_submodules is true. While the current orchestration doesn't pre-check status before installing, the incomplete status API creates a correctness issue: external calls to is_installed() or future optimizations that check status first will incorrectly report a repository as installed if it exists at the correct commit but lacks initialized submodules.
Update _is_git_repo_installed() and the Kubernetes is_installed_one() GitRepo branch to verify submodules state when applicable, similar to the checks already present in _install_one_git_repo() and _clone_and_setup_repo().
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 218-218: debug() call on root logger
(LOG015)
[warning] 219-219: Boolean positional value in function call
(FBT003)
[warning] 226-226: Boolean positional value in function call
(FBT003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/systems/slurm/slurm_installer.py` around lines 212 - 226, The
status checks for Git repos currently only verify existence and commit but not
submodule initialization, so update _is_git_repo_installed() and the GitRepo
branch of is_installed_one() to also validate submodules when
item.init_submodules is true: call the same submodule-check logic used in
_install_one_git_repo()/_clone_and_setup_repo() (or factor that check into a
helper) to ensure submodules are initialized and at the correct state before
returning success; if the submodule check fails, return a failing
InstallStatusResult with an explanatory message.
| parsed_nodes = self.parse_nodes(nodes, exclude_nodes=exclude_nodes) | ||
| if parsed_nodes: | ||
| num_nodes = len(parsed_nodes) | ||
| node_list = parsed_nodes | ||
| elif nodes: | ||
| reason = ( | ||
| f"after excluding nodes {exclude_nodes}" | ||
| if exclude_nodes | ||
| else "— no nodes are available (all may be DRAIN/DOWN)" | ||
| ) | ||
| raise ValueError( | ||
| f"Node specifications {nodes} resolved to an empty node list {reason}. " | ||
| "Cannot fall back to unconstrained allocation." | ||
| ) |
There was a problem hiding this comment.
Don't collapse “too few nodes” into “empty node list.”
Lines 762-771 treat any empty parsed_nodes result as if nothing matched, but get_available_nodes_from_group() also returns [] when only part of the request can be satisfied. A request like 4 nodes from a 3-node remainder now raises “resolved to an empty node list” instead of reporting the actual shortfall.
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 768-771: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/systems/slurm/slurm_system.py` around lines 758 - 771, The code
collapses the “too few nodes returned” case into the “empty node list” error;
update the logic after parse_nodes so that if parsed_nodes is non-empty but
shorter than the originally requested count you raise a specific shortfall error
instead of the empty-list message. Determine the requested node count from the
incoming nodes specification (using the same logic you use elsewhere—e.g., the
routine that resolves group requests or get_available_nodes_from_group) and
compare it to len(parsed_nodes); if len(parsed_nodes) < requested_count raise a
ValueError indicating the requested count and the number of nodes actually
available (referencing parsed_nodes, nodes and exclude_nodes), otherwise keep
the existing empty-list error path when parsed_nodes is truly empty.
| def _select_megatron_bridge_repo(git_repos: list[GitRepo]) -> GitRepo | None: | ||
| """Return the Megatron-Bridge repo from `git_repos` (normalized to mount_as=/opt/Megatron-Bridge).""" | ||
| """ | ||
| Return the Megatron-Bridge repo from `git_repos`. | ||
|
|
||
| When the user sets ``mount_as`` (e.g. ``/opt/Megatron-Bridge``), the installed clone will be bind-mounted | ||
| into the container at that path, overriding whatever the container image ships. When ``mount_as`` is *not* | ||
| set the container's built-in ``/opt/Megatron-Bridge`` is used. | ||
| """ | ||
| for repo in git_repos: | ||
| if "Megatron-Bridge" in repo.url or (repo.mount_as or "").rstrip("/") == "/opt/Megatron-Bridge": | ||
| return repo if repo.mount_as else repo.model_copy(update={"mount_as": "/opt/Megatron-Bridge"}) | ||
| return repo |
There was a problem hiding this comment.
Don't make the repo pin a no-op.
Returning the repo unchanged here means configs that omit mount_as no longer override /opt/Megatron-Bridge inside the container. The job now uses the image-bundled checkout, while [[git_repos]].commit is still required and cloned locally, so the declared pin no longer controls the runtime version. Either keep the implicit mount or stop requiring/installing the repo when mount_as is unset.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/megatron_bridge/megatron_bridge.py` around lines 186 -
196, The helper _select_megatron_bridge_repo currently returns the repo
unchanged which makes an absent mount_as become a no-op; instead when a
Megatron-Bridge repo is found (match on "Megatron-Bridge" in repo.url or
explicit mount_as path) ensure it will be mounted into the container so the
pinned commit controls runtime by returning a repo object that has mount_as set
to "/opt/Megatron-Bridge" if repo.mount_as is falsy; to avoid mutating callers,
return a shallow copy (or new GitRepo instance) with
mount_as="/opt/Megatron-Bridge" when needed (leave any existing mount_as alone)
so configs that omit mount_as still override the image bundle.
| def _container_runtime_env_exports(self) -> list[str]: | ||
| """ | ||
| Build ``export`` lines for container-runtime env vars. | ||
|
|
||
| Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES`` | ||
| are consumed by the NVIDIA container toolkit / enroot at container-creation | ||
| time to decide which devices to mount. They must be present in the process | ||
| environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that | ||
| Slurm inherits them into the job and ``srun`` passes them to the container | ||
| runtime. Exporting them in the wrapper script (which runs on the submit | ||
| node) achieves this. The same variables are still passed via ``-cb`` as | ||
| well, so they are also set inside the container for any runtime readers. | ||
| """ | ||
| lines: list[str] = [] | ||
| for key, value in sorted(self.final_env_vars.items()): | ||
| if key in self.CONTAINER_RUNTIME_ENV_VARS: | ||
| lines.append(f"export {key}={shlex.quote(str(value))}") | ||
| return lines |
There was a problem hiding this comment.
❓ Verification inconclusive
Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import shlex
import subprocess
value = "$CUDA_VISIBLE_DEVICES"
rendered = f"export NVIDIA_VISIBLE_DEVICES={shlex.quote(value)}; printf '%s\\n' \"$NVIDIA_VISIBLE_DEVICES\""
print("Generated wrapper line:", rendered)
literal = subprocess.run(
["bash", "-lc", f"CUDA_VISIBLE_DEVICES=0,1; {rendered}"],
check=True,
capture_output=True,
text=True,
)
expanded = subprocess.run(
[
"bash",
"-lc",
'CUDA_VISIBLE_DEVICES=0,1; export NVIDIA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES"; printf "%s\\n" "$NVIDIA_VISIBLE_DEVICES"',
],
check=True,
capture_output=True,
text=True,
)
print("Current result :", literal.stdout.strip())
print("Expanded result:", expanded.stdout.strip())
PYRepository: NVIDIA/cloudai
Repository: NVIDIA/cloudai
Exit code: 0
stdout:
Generated wrapper line: export NVIDIA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES'; printf '%s\n' "$NVIDIA_VISIBLE_DEVICES"
Current result : $CUDA_VISIBLE_DEVICES
Expanded result: 0,1
Script executed:
#!/bin/bash
# First, find and read the exact file to verify the implementation
if [ -f "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" ]; then
echo "=== Lines 107-124 (container_runtime_env_exports) ==="
sed -n '107,124p' "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"
echo -e "\n=== Checking for _build_custom_bash_env_exports context ==="
# Find where _build_custom_bash_env_exports is defined and show relevant lines
rg -n "_build_custom_bash_env_exports" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -A 20 | head -40
echo -e "\n=== Checking for _parse_srun_args_as_slurm_params ==="
rg -n "_parse_srun_args_as_slurm_params" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -A 15 | head -40
echo -e "\n=== Checking for CONTAINER_RUNTIME_ENV_VARS definition ==="
rg -n "CONTAINER_RUNTIME_ENV_VARS" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" -B 2 -A 5
fiRepository: NVIDIA/cloudai
Repository: NVIDIA/cloudai
Exit code: 0
stdout:
=== Lines 107-124 (container_runtime_env_exports) ===
def _container_runtime_env_exports(self) -> list[str]:
"""
Build ``export`` lines for container-runtime env vars.
Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES``
are consumed by the NVIDIA container toolkit / enroot at container-creation
time to decide which devices to mount. They must be present in the process
environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that
Slurm inherits them into the job and ``srun`` passes them to the container
runtime. Exporting them in the wrapper script (which runs on the submit
node) achieves this. The same variables are still passed via ``-cb`` as
well, so they are also set inside the container for any runtime readers.
"""
lines: list[str] = []
for key, value in sorted(self.final_env_vars.items()):
if key in self.CONTAINER_RUNTIME_ENV_VARS:
lines.append(f"export {key}={shlex.quote(str(value))}")
return lines
=== Checking for _build_custom_bash_env_exports context ===
95: def _build_custom_bash_env_exports(self) -> list[str]:
96- """
97- Build repeated -cb entries that export env vars inside the launched Slurm job shell.
98-
99- We quote each full `export KEY=value` command so `$SLURM_*` and commas survive
100- argument parsing on the submit node and are expanded/interpreted in the job shell.
101- """
102- exports: list[str] = []
103- for key, value in sorted(self.final_env_vars.items()):
104- exports.extend(["-cb", shlex.quote(f"export {key}={value}")])
105- return exports
106-
107- def _container_runtime_env_exports(self) -> list[str]:
108- """
109- Build ``export`` lines for container-runtime env vars.
110-
111- Variables like ``MELLANOX_VISIBLE_DEVICES`` and ``NVIDIA_VISIBLE_DEVICES``
112- are consumed by the NVIDIA container toolkit / enroot at container-creation
113- time to decide which devices to mount. They must be present in the process
114- environment **before** the Megatron-Bridge launcher calls ``sbatch`` so that
115- Slurm inherits them into the job and ``srun`` passes them to the container
--
353: parts.extend(self._build_custom_bash_env_exports())
354-
355- # Model flags (Megatron-Bridge main-branch API)
356- add_field("domain", "--domain", args.domain)
357- if args.use_recipes and "use_recipes" in fields_set:
358- parts.append("--use_recipes")
359- if "enable_vboost" in fields_set:
360- add_field("enable_vboost", "-vb", bool(args.enable_vboost))
361- if not args.model_family_name:
362- raise RuntimeError("Missing required cmd_args.model_family_name (maps to -m/--model_family_name).")
363- if not args.model_recipe_name:
364- raise RuntimeError("Missing required cmd_args.model_recipe_name (maps to -mr/--model_recipe_name).")
365- add_field("model_family_name", "-m", args.model_family_name)
366- add_field("model_recipe_name", "-mr", args.model_recipe_name)
367- add_field("hidden_size", "--hidden_size", args.hidden_size)
368- add_field("num_layers", "--num_layers", args.num_layers)
369- add_field(
370- "pipeline_model_parallel_layout", "--pipeline_model_parallel_layout", args.pipeline_model_parallel_layout
=== Checking for _parse_srun_args_as_slurm_params ===
139: def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
140- """
141- Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.
142-
143- Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
144- key names without a ``=value`` suffix.
145- """
146- params: list[str] = []
147- tokens = shlex.split(srun_args)
148- i = 0
149- while i < len(tokens):
150- tok = tokens[i]
151- if tok.startswith("--") and "=" in tok:
152- key, val = tok[2:].split("=", 1)
153- params.append(f"{key}={val}")
154- elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
--
471: additional_slurm_params.extend(self._parse_srun_args_as_slurm_params(source))
472-
473- if additional_slurm_params:
474- parts.extend(["--additional_slurm_params", shlex.quote(";".join(additional_slurm_params))])
475-
476- # Config variant
477- add_field("config_variant", "-cv", args.config_variant)
478- if args.list_config_variants and "list_config_variants" in fields_set:
479- parts.append("--list_config_variants")
480-
481- # Extra user args (dict -> string)
482- if tdef.extra_cmd_args:
483- parts.append(tdef.extra_args_str)
484-
485- return parts
=== Checking for CONTAINER_RUNTIME_ENV_VARS definition ===
38- """
39-
40: CONTAINER_RUNTIME_ENV_VARS: frozenset[str] = frozenset(
41- {
42- "MELLANOX_VISIBLE_DEVICES",
43- "NVIDIA_VISIBLE_DEVICES",
44- "NVIDIA_DRIVER_CAPABILITIES",
45- }
--
120- lines: list[str] = []
121- for key, value in sorted(self.final_env_vars.items()):
122: if key in self.CONTAINER_RUNTIME_ENV_VARS:
123- lines.append(f"export {key}={shlex.quote(str(value))}")
124- return lines
125-
126- def _normalize_recompute_modules(self, val: Any) -> str:
127- if isinstance(val, list):
Script executed:
#!/bin/bash
# Check the file size and structure
echo "=== File structure ==="
wc -l "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"
# Get a broader view of the class to understand context
echo -e "\n=== Class structure and method definitions ==="
rg -n "^\s*(def |class )" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"Repository: NVIDIA/cloudai
Repository: NVIDIA/cloudai
Exit code: 0
stdout:
=== File structure ===
485 src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
=== Class structure and method definitions ===
33:class MegatronBridgeSlurmCommandGenStrategy(SlurmCommandGenStrategy):
48: def _container_mounts(self) -> list[str]:
52: def gen_exec_command(self) -> str:
83: def store_test_run(self) -> None:
89: def _write_command_to_file(self, command: str, output_path: Path) -> None:
95: def _build_custom_bash_env_exports(self) -> list[str]:
107: def _container_runtime_env_exports(self) -> list[str]:
126: def _normalize_recompute_modules(self, val: Any) -> str:
139: def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
162: def _normalize_cuda_graph_scope_arg(self, val: Any) -> str:
169: def _wrap_launcher_for_job_id_and_quiet_output(self, launcher_cmd: str, launcher_python: str) -> str:
240: def _list_or_comma_str(self, val: str | list[str] | None) -> Optional[str]:
249: def _build_launcher_parts( # noqa: C901
265: def _installed_container_path() -> str:
298: def add(flag: str, value: Any) -> None:
317: def add_field(field: str, flag: str, value: Any) -> None:
Script executed:
#!/bin/bash
# Get the full _parse_srun_args_as_slurm_params method
sed -n '139,160p' "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"Repository: NVIDIA/cloudai
Repository: NVIDIA/cloudai
Exit code: 0
stdout:
def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
"""
Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.
Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
key names without a ``=value`` suffix.
"""
params: list[str] = []
tokens = shlex.split(srun_args)
i = 0
while i < len(tokens):
tok = tokens[i]
if tok.startswith("--") and "=" in tok:
key, val = tok[2:].split("=", 1)
params.append(f"{key}={val}")
elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
params.append(f"{tok[2:]}={tokens[i + 1]}")
i += 1
elif tok.startswith("--"):
params.append(tok[2:])
i += 1
return params
Don't turn container-runtime env references into literals.
_build_custom_bash_env_exports() keeps env values shell-interpretable by quoting the entire export statement, but _container_runtime_env_exports() wraps the value itself with shlex.quote(). A config like NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES will be exported as a literal dollar string before sbatch, and the later -cb export cannot fix that because the container runtime consumes it at container-creation time. Preserve shell expansion here, or explicitly reject shell expressions for these keys.
Additionally, _parse_srun_args_as_slurm_params() silently drops any extra_srun_args tokens that don't match its three supported patterns (--key=value, --key value, --flag). Short-form flags like -x and non-flag tokens disappear without warning, risking loss of scheduler-affecting options.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 107 - 124, The file has two issues: _container_runtime_env_exports
currently shlex.quotes values (turning shell-expansions like
NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES into literals) and
_parse_srun_args_as_slurm_params silently drops unsupported tokens (short flags
like -x or non-flag tokens). Fix _container_runtime_env_exports to preserve
shell expansion for keys in CONTAINER_RUNTIME_ENV_VARS by not quoting values
that look like shell expressions (e.g., start with $ or contain parameter
expansion) or by validating and raising an error if a value contains shell
metacharacters—update the logic in _container_runtime_env_exports and reference
CONTAINER_RUNTIME_ENV_VARS and final_env_vars. Fix
_parse_srun_args_as_slurm_params to stop silently discarding tokens from
extra_srun_args: enhance its parsing to accept short flags (e.g., -x),
positional/non-flag tokens, and return them unchanged if they don’t match
--key[=value]/--key value/--flag patterns, or at minimum emit a clear
warning/error when tokens would be dropped so the caller can handle them—update
the parsing code in _parse_srun_args_as_slurm_params and how extra_srun_args is
consumed.
| @staticmethod | ||
| def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]: | ||
| """ | ||
| Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params. | ||
|
|
||
| Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare | ||
| key names without a ``=value`` suffix. | ||
| """ | ||
| params: list[str] = [] | ||
| tokens = shlex.split(srun_args) | ||
| i = 0 | ||
| while i < len(tokens): | ||
| tok = tokens[i] | ||
| if tok.startswith("--") and "=" in tok: | ||
| key, val = tok[2:].split("=", 1) | ||
| params.append(f"{key}={val}") | ||
| elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"): | ||
| params.append(f"{tok[2:]}={tokens[i + 1]}") | ||
| i += 1 | ||
| elif tok.startswith("--"): | ||
| params.append(tok[2:]) | ||
| i += 1 | ||
| return params |
There was a problem hiding this comment.
Reject unsupported extra_srun_args forms instead of dropping them.
This parser only recognizes long-form --... tokens. Valid Slurm inputs like -A acct or -w node01 — or even a typo missing the second dash — fall through the loop and disappear from --additional_slurm_params with no signal. That can change scheduler placement silently. Either parse short options explicitly or raise on token shapes you do not support.
🛡️ Minimal fail-fast fix
while i < len(tokens):
tok = tokens[i]
if tok.startswith("--") and "=" in tok:
key, val = tok[2:].split("=", 1)
params.append(f"{key}={val}")
elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
params.append(f"{tok[2:]}={tokens[i + 1]}")
i += 1
elif tok.startswith("--"):
params.append(tok[2:])
+ else:
+ raise ValueError(
+ f"Unsupported extra_srun_args token {tok!r}; "
+ "use --key=value, --key value, or bare --flag forms."
+ )
i += 1
return params📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @staticmethod | |
| def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]: | |
| """ | |
| Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params. | |
| Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare | |
| key names without a ``=value`` suffix. | |
| """ | |
| params: list[str] = [] | |
| tokens = shlex.split(srun_args) | |
| i = 0 | |
| while i < len(tokens): | |
| tok = tokens[i] | |
| if tok.startswith("--") and "=" in tok: | |
| key, val = tok[2:].split("=", 1) | |
| params.append(f"{key}={val}") | |
| elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"): | |
| params.append(f"{tok[2:]}={tokens[i + 1]}") | |
| i += 1 | |
| elif tok.startswith("--"): | |
| params.append(tok[2:]) | |
| i += 1 | |
| return params | |
| `@staticmethod` | |
| def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]: | |
| """ | |
| Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params. | |
| Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare | |
| key names without a ``=value`` suffix. | |
| """ | |
| params: list[str] = [] | |
| tokens = shlex.split(srun_args) | |
| i = 0 | |
| while i < len(tokens): | |
| tok = tokens[i] | |
| if tok.startswith("--") and "=" in tok: | |
| key, val = tok[2:].split("=", 1) | |
| params.append(f"{key}={val}") | |
| elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"): | |
| params.append(f"{tok[2:]}={tokens[i + 1]}") | |
| i += 1 | |
| elif tok.startswith("--"): | |
| params.append(tok[2:]) | |
| else: | |
| raise ValueError( | |
| f"Unsupported extra_srun_args token {tok!r}; " | |
| "use --key=value, --key value, or bare --flag forms." | |
| ) | |
| i += 1 | |
| return params |
| assert "#SBATCH --nodelist=node3,node4" in content | ||
| assert "#SBATCH --exclude=" not in content |
There was a problem hiding this comment.
Use a prefix check for the missing --exclude assertion.
Line 383 is testing exact list membership, not absence of an exclude directive. If content contains #SBATCH --exclude=node01,node02, this assertion still passes.
🔧 Proposed fix
assert "#SBATCH --nodelist=node3,node4" in content
- assert "#SBATCH --exclude=" not in content
+ assert not any(line.startswith("#SBATCH --exclude=") for line in content)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert "#SBATCH --nodelist=node3,node4" in content | |
| assert "#SBATCH --exclude=" not in content | |
| assert "#SBATCH --nodelist=node3,node4" in content | |
| assert not any(line.startswith("#SBATCH --exclude=") for line in content) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/systems/slurm/test_command_gen_strategy.py` around lines 382 - 383,
Replace the weak absence check that uses `assert "#SBATCH --exclude=" not in
content` with a prefix-based check that ensures no line in the generated
`content` starts with the `#SBATCH --exclude` directive; locate the assertion in
the test (in tests/systems/slurm/test_command_gen_strategy.py where the
`content` variable is inspected) and change it to assert that none of the
`content.splitlines()` entries start with the string `#SBATCH --exclude`.
| def test_mount_as_adds_repo_to_container_mounts( | ||
| self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path | ||
| ) -> None: |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Remove the unused tmp_path fixture from this test signature.
tmp_path is never read here, so the fixture just adds setup noise and keeps Ruff's ARG002 warning alive.
♻️ Proposed cleanup
- def test_mount_as_adds_repo_to_container_mounts(
- self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
- ) -> None:
+ def test_mount_as_adds_repo_to_container_mounts(
+ self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]
+ ) -> None:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_mount_as_adds_repo_to_container_mounts( | |
| self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path | |
| ) -> None: | |
| def test_mount_as_adds_repo_to_container_mounts( | |
| self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun] | |
| ) -> None: |
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 356-356: Unused method argument: tmp_path
(ARG002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around
lines 355 - 357, The test function test_mount_as_adds_repo_to_container_mounts
currently declares an unused tmp_path fixture parameter; remove tmp_path from
the function signature so the test becomes def
test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system:
SlurmSystem, make_test_run: Callable[..., TestRun]) -> None to eliminate the
unused fixture and silence the ARG002 warning. Ensure no other code in that test
references tmp_path before committing.
Summary
Do not run CommandShell check during object creation. That addresses issue with building documentation on Windows.
Test Plan
Additional Notes
–