Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- **`MAD_GUEST_OS` in container env**: `container_runner` now exports the run's `guest_os` as `MAD_GUEST_OS` so in-container pre-scripts (notably `run_rocenv_tool.sh`) can select the correct package manager without re-detecting from `/etc/os-release`.

- **K8s `storage_class` field**: New generic `storage_class` key in the K8s preset defaults (`src/madengine/deployment/presets/k8s/defaults.json`). It is the broadest fallback for both the data PVC and the single-node results PVC, behind the more specific `data_storage_class` / `nfs_storage_class` and `single_node_results_storage_class` / `local_path_storage_class` keys. The legacy `local_path_storage_class` key continues to be honoured for backward compatibility. **Default change**: the bundled preset now sets `storage_class: "nfs-banff"` in place of `local_path_storage_class: "local-path"`, so out-of-the-box single-node results PVCs land on the NFS class instead of `local-path`. Clusters that still want local-path should set `"local_path_storage_class": "local-path"` (or `"single_node_results_storage_class": "local-path"`) in `--additional-context`. See [K8s storage classes](examples/k8s-configs/README.md).

### Changed

- **Profiling**: `rocm_trace_lite` now sets `RTL_MODE=lite` explicitly; added tool `rocm_trace_lite_default` with `RTL_MODE=default` for A/B overhead comparison. `rtl_trace_wrapper.sh` passes `rtl trace --mode …` when `RTL_MODE` is set.
Expand All @@ -21,6 +23,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- **`run_rocenv_tool.sh` argument signature**: now accepts `<output_basename> <rocenv_mode> <guest_os>` (was just `<output_basename>`). `rocenv_mode` and `guest_os` default to `lite` and `UBUNTU` respectively when omitted, so existing direct callers remain functional. `container_runner` and the K8s scripts mixin pass all three.

- **Pytest configuration consolidation**: pytest settings now live solely in `[tool.pytest.ini_options]` in `pyproject.toml`; the redundant `pytest.ini` was removed. `tests/conftest.py` lost its `sys.path` hack and duplicate marker registration (markers are declared in `pyproject.toml`). The `minversion` value was also corrected from `"3.8"` (which was the Python version) to `"7.0"` — the actual pytest floor required for the `pythonpath` option used by the config.

### Fixed

- **K8s collector pod name mismatch**: The cleanup code in `kubernetes.py` used the full job name (`collector-{job_name}`) while the creation code in `k8s_results.py` truncated it (`collector-{deployment_id[:15]}`). For any job name longer than 15 characters (i.e. virtually all real jobs), cleanup would fail to delete the collector pod, leaving it running and potentially blocking PVC deletion on the next deploy. Extracted a shared `collector_pod_name()` helper so both sites use the same truncated name.
Expand All @@ -31,12 +35,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- **RPD pre-script: failed as root with no `sudo`**: the install path used `sudo apt`/`sudo yum` unconditionally, which is missing in many CI containers running as `root`. `trace.sh` now branches on `id -u` — direct `apt-get`/`yum` when root, `sudo` otherwise — and adds the build deps the upstream Makefile expects (`git`, `build-essential`, `pkg-config` on Ubuntu; `gcc`, `gcc-c++`, `make`, `git` on CentOS).

- **`TypeError` on restricted ROCm < 6.4.1 systems**: `Context` assumed every `/dev/dri/renderD*` entry exposed a non-`None` `kfd_renderDs` value. On restricted hosts (ROCm < 6.4.1, certain VFIO/passthrough setups) this returned `None` and crashed downstream consumers. `core/context.py` now guards the iteration so missing/`None` entries are skipped instead of raising.

- **Deployment monitor infinite loop on cancelled jobs**: `BaseDeployment._monitor_job` treated only `COMPLETED`/`FAILED` as terminal, so a `CANCELLED` job (manual `scancel`, K8s job deletion, etc.) would loop forever waiting for a state that never arrived. `CANCELLED` is now in the terminal-state set in `deployment/base.py`.

### Security

- **Shell injection hardening (extended)**: `shlex.quote()` is now applied to every shell interpolation of a user-controlled value across `core/docker.py`, `execution/container_runner.py`, `execution/docker_builder.py`, and `orchestration/run_orchestrator.py` (image names, paths, container names, build-args). A follow-up pass closed the last remaining sites in `docker_builder.py` (`grep`, `docker manifest inspect`, `docker tag`, `docker push`, `head`). This is a defence-in-depth extension of the v2.0.2 build-arg quoting work — values that flow through `--additional-context`, model configs, or registry credentials can no longer break out of the shell command they are embedded in.

### Tests

- **Dummy `dummy_rocenv_full` fixture**: new Dockerfile installs `lshw`, `dmidecode`, `kmod`, and `util-linux` so e2e tests can exercise rocenv full mode end-to-end inside the container.

- **RCCL profiling e2e stabilization**: `tests/fixtures/dummy/scripts/dummy/run_nccl_trace.sh` now pins `HIP_VISIBLE_DEVICES`/`NCCL_IB_DISABLE`/`NCCL_SOCKET_IFNAME` defaults to avoid topology-detection hangs in CI. The `rccl_trace` log assertion in `tests/e2e/test_profiling_workflows.py` was relaxed for minor NCCL log-format drift.

- **New `test_shell_quoting.py`**: 11-test suite covering the shell-quoting behaviour described above end-to-end across `docker.py`, `container_runner.py`, `docker_builder.py`, and `run_orchestrator.py`. Includes regression coverage for spaces, `$`, backticks, command-substitution, and quote characters in interpolated values.

- **Test isolation fix**: `tests/unit/test_error_handling.py` was leaking the global error-handler state across tests, so test order could mask or fabricate failures. The handler is now reset around the affected tests.

### Known Issues

- **K8s multi-node: node reported as FAILED due to log collection error**: In multi-node Kubernetes jobs, a node may be reported as `FAILED` in the results table even though the pod completed successfully (`Status: Succeeded`). This happens when the kubelet on the node becomes unreachable (502 Bad Gateway) between job completion and log collection — madengine cannot retrieve stdout logs and therefore cannot parse performance metrics for that node. The PVC artifacts are still collected. Check `kubectl describe pod <pod>` to confirm the pod actually succeeded; the issue is infrastructure-level (kubelet/API server), not a workload failure.
Expand Down
15 changes: 9 additions & 6 deletions examples/k8s-configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -405,10 +405,12 @@ madengine separates **per-job results** from **long-lived shared data**:

| Volume | Typical use | Single-node (`nnodes: 1`) | Multi-node (`nnodes > 1`) |
|--------|-------------|----------------------------|----------------------------|
| **`{job}-results`** | Benchmark artifacts (`/results`) | **RWO** — `local_path_storage_class` or `single_node_results_storage_class` (e.g. `local-path`) | **RWX** — `nfs_storage_class` or `multi_node_results_storage_class` (e.g. `nfs-banff`) |
| **`{job}-results`** | Benchmark artifacts (`/results`) | **RWO** — `single_node_results_storage_class` → `local_path_storage_class` → `storage_class` (e.g. `local-path` or `nfs-banff`) | **RWX** — `multi_node_results_storage_class` → `nfs_storage_class` → `storage_class` (e.g. `nfs-banff`) |
| **`madengine-shared-data`** | Dataset cache (`/data`) | **RWX** — always `ReadWriteMany` + NFS class | Same PVC |

**Built-in defaults (Banff-oriented)** are in `presets/k8s/defaults.json`: `nfs_storage_class` / `data_storage_class` → `nfs-banff`, `local_path_storage_class` → `local-path`, `recreate_shared_data_pvc` → `false`. You do not need to set these unless you use another cluster — then override in additional context.
**Built-in defaults (Banff-oriented)** are in `presets/k8s/defaults.json`: `nfs_storage_class` / `data_storage_class` → `nfs-banff`, generic `storage_class` → `nfs-banff` (broad fallback for both data and single-node results PVCs), `recreate_shared_data_pvc` → `false`. The legacy `local_path_storage_class` key is still honoured as a single-node-results fallback for backward compatibility but is no longer set in the preset. You do not need to set any of these unless you use another cluster — then override in additional context.

> **2.0.3 default change:** Before 2.0.3 the preset set `local_path_storage_class: "local-path"`, so single-node results PVCs landed on `local-path` by default. The preset now sets `storage_class: "nfs-banff"` instead, so both the data PVC and the single-node results PVC default to `nfs-banff` unless you override. If you actually want `local-path` for single-node results, set `"local_path_storage_class": "local-path"` (or `"single_node_results_storage_class": "local-path"`) in your `--additional-context`.

Example override for a different cluster:

Expand All @@ -423,7 +425,8 @@ Example override for a different cluster:
```

- **`nfs_storage_class`**: RWX class (e.g. `nfs-banff`) — used for shared-data (with `data_storage_class`) and multi-node results unless overridden.
- **`local_path_storage_class`**: RWO class for **single-node only** results PVC.
- **`local_path_storage_class`**: RWO class for **single-node only** results PVC. Still accepted for backward compatibility; new configs should prefer `single_node_results_storage_class` or rely on `storage_class`.
- **`storage_class`**: Generic broad fallback used for **both** the data PVC and single-node results PVC when no more-specific key is set. Added in 2.0.3.
- **`data_storage_class`**: Optional override for `madengine-shared-data` only (defaults to `nfs_storage_class` then `storage_class`).
- **`single_node_results_storage_class`** / **`multi_node_results_storage_class`**: Optional fine-grained overrides for results PVCs.
- **`recreate_shared_data_pvc`**: If `true`, deletes existing `madengine-shared-data` before create (**destroys data** — backup first). Use when migrating from RWO `local-path` to RWX NFS.
Expand Down Expand Up @@ -563,11 +566,11 @@ To use an existing PVC instead of auto-creation:
|-------|------|---------|-------------|
| `data_pvc` | string | `null` | Data PVC name (auto-created if using data provider) |
| `results_pvc` | string | `null` | Results PVC name (auto-created by default) |
| `storage_class` | string | `null` | Optional fallback if the keys below are unset |
| `storage_class` | string | **`nfs-banff`** (preset, since 2.0.3) | Generic broad fallback for both the data PVC and the single-node results PVC when no more-specific key is set |
| `nfs_storage_class` | string | **`nfs-banff`** (preset) | RWX class for shared-data / multi-node results |
| `local_path_storage_class` | string | **`local-path`** (preset) | RWO class for single-node `{job}-results` |
| `local_path_storage_class` | string | `null` (not in preset since 2.0.3; was **`local-path`** in ≤ 2.0.2) | Optional RWO class for single-node `{job}-results`. Still honoured for backward compatibility |
| `data_storage_class` | string | **`nfs-banff`** (preset) | Overrides SC for shared-data only |
| `single_node_results_storage_class` | string | `null` | Overrides single-node results SC (`local_path_storage_class` if unset) |
| `single_node_results_storage_class` | string | `null` | Overrides single-node results SC (falls back to `local_path_storage_class`, then `storage_class`) |
| `multi_node_results_storage_class` | string | `null` | Overrides multi-node results SC (`nfs_storage_class` if unset) |
| `recreate_shared_data_pvc` | boolean | **`false`** (preset) | If `true`, delete `madengine-shared-data` before create (data loss) |

Expand Down
23 changes: 18 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -149,12 +149,25 @@ module = [
ignore_missing_imports = true

[tool.pytest.ini_options]
testpaths = ["tests"]
python_paths = ["src"]
addopts = "-v --tb=short"
testpaths = ["tests/unit", "tests/integration", "tests/e2e"]
pythonpath = ["src"]
addopts = "-v --tb=short -ra --strict-markers -W default"
minversion = "7.0"
filterwarnings = [
"ignore::DeprecationWarning",
"ignore::PendingDeprecationWarning",
]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"integration: marks tests as integration tests",
"unit: Fast unit tests (no external dependencies)",
"integration: Integration tests (may be slower, test multiple components)",
"e2e: End-to-end tests (require full environment, Docker, may be very slow)",
"slow: Slow tests (can be skipped with -m \"not slow\")",
"gpu: Tests that require GPU hardware",
"amd: Tests specific to AMD GPUs",
"nvidia: Tests specific to NVIDIA GPUs",
"cpu: Tests for CPU-only execution",
"requires_docker: Tests that require Docker daemon",
"requires_models: Tests that require model fixtures",
]

[tool.coverage.run]
Expand Down
85 changes: 0 additions & 85 deletions pytest.ini

This file was deleted.

6 changes: 6 additions & 0 deletions src/madengine/core/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -758,6 +758,12 @@ def get_gpu_renderD_nodes(self) -> typing.Optional[typing.List[int]]:
print(f"Warning: Failed to parse unique_id from line '{item}': {e}")
continue

if kfd_renderDs is None:
raise RuntimeError(
"KFD topology not accessible and required for ROCm < 6.4.1 GPU mapping. "
"Check permissions on /sys/devices/virtual/kfd/kfd/topology/nodes"
)

if len(kfd_unique_ids) != len(kfd_renderDs):
raise RuntimeError(
f"Mismatch between unique_ids count ({len(kfd_unique_ids)}) "
Expand Down
9 changes: 5 additions & 4 deletions src/madengine/core/docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,11 @@ def __init__(
# add mounts
if mounts is not None:
for mount in mounts:
command += "-v " + mount + ":" + mount + " "
quoted_mount = shlex.quote(mount)
command += "-v " + quoted_mount + ":" + quoted_mount + " "

# add current working directory
command += "-v " + cwd + ":/myworkspace/ "
command += "-v " + shlex.quote(cwd) + ":/myworkspace/ "

# add envVars
_env_key_re = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
Expand All @@ -105,8 +106,8 @@ def __init__(
command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " "

command += "--workdir /myworkspace/ "
command += "--name " + container_name + " "
command += image + " "
command += "--name " + shlex.quote(container_name) + " "
command += shlex.quote(image) + " "

# Use 'cat' to keep container alive (blocks waiting for stdin)
# Works reliably across all deployment types (local, k8s, slurm)
Expand Down
2 changes: 1 addition & 1 deletion src/madengine/deployment/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ def _monitor_until_complete(self, deployment_id: str) -> DeploymentResult:
while True:
status = self.monitor(deployment_id)

if status.status in [DeploymentStatus.SUCCESS, DeploymentStatus.FAILED, DeploymentStatus.UNKNOWN]:
if status.status in [DeploymentStatus.SUCCESS, DeploymentStatus.FAILED, DeploymentStatus.UNKNOWN, DeploymentStatus.CANCELLED]:
return status

# Still running, wait and check again
Expand Down
2 changes: 1 addition & 1 deletion src/madengine/deployment/presets/k8s/defaults.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"ttl_seconds_after_finished": null,
"allow_privileged_profiling": null,
"nfs_storage_class": "nfs-banff",
"local_path_storage_class": "local-path",
"storage_class": "nfs-banff",
Comment thread
coketaste marked this conversation as resolved.
"data_storage_class": "nfs-banff",
"recreate_shared_data_pvc": false,
"secrets": {
Expand Down
10 changes: 5 additions & 5 deletions src/madengine/execution/container_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -558,16 +558,16 @@ def pull_image(
print(f"🔄 Using fresh pull policy for SLURM compute node (prevents cached layer corruption)")
# Remove any existing cached image to force fresh pull
try:
self.console.sh(f"docker rmi -f {registry_image} 2>/dev/null || true")
self.console.sh(f"docker rmi -f {shlex.quote(registry_image)} 2>/dev/null || true")
print(f"✓ Removed cached image layers")
except Exception:
pass # It's okay if image doesn't exist

try:
self.console.sh(f"docker pull {registry_image}")
self.console.sh(f"docker pull {shlex.quote(registry_image)}")

if local_name:
self.console.sh(f"docker tag {registry_image} {local_name}")
self.console.sh(f"docker tag {shlex.quote(registry_image)} {shlex.quote(local_name)}")
print(f"🏷️ Tagged as: {local_name}")
self.rich_console.print(f"[bold green]✅ Successfully pulled and tagged image[/bold green]")
self.rich_console.print(f"[dim]{'='*80}[/dim]")
Expand Down Expand Up @@ -689,7 +689,7 @@ def get_mount_arg(self, mount_datapaths: typing.List) -> str:
for mount_datapath in mount_datapaths:
if mount_datapath:
mount_args += (
f"-v {mount_datapath['path']}:{mount_datapath['home']}"
f"-v {shlex.quote(mount_datapath['path'])}:{shlex.quote(mount_datapath['home'])}"
)
if (
"readwrite" in mount_datapath
Expand All @@ -703,7 +703,7 @@ def get_mount_arg(self, mount_datapaths: typing.List) -> str:
if "docker_mounts" in self.context.ctx:
for mount_arg in self.context.ctx["docker_mounts"].keys():
mount_args += (
f"-v {self.context.ctx['docker_mounts'][mount_arg]}:{mount_arg} "
f"-v {shlex.quote(self.context.ctx['docker_mounts'][mount_arg])}:{shlex.quote(mount_arg)} "
)

return mount_args
Expand Down
Loading