Skip to content

refactor(v2-review): shell injection hardening, bug fixes, and test cleanup#122

Merged
coketaste merged 7 commits into
developfrom
coketaste/v2-quality-improve
May 19, 2026
Merged

refactor(v2-review): shell injection hardening, bug fixes, and test cleanup#122
coketaste merged 7 commits into
developfrom
coketaste/v2-quality-improve

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

  • Harden all shell-interpolated Docker commands with shlex.quote() to prevent injection via image names, container names, mount paths, and context paths
  • Fix TypeError crash when KFD topology is inaccessible on ROCm < 6.4.1 systems (kfd_renderDs is None)
  • Fix infinite loop in deployment monitoring when a job is cancelled (CANCELLED missing from terminal status set)
  • Consolidate pytest config into pyproject.toml as single source of truth (delete redundant pytest.ini that was silently overriding it)
  • Clean up test infrastructure: remove sys.path hack, deduplicate marker registration, fix global error handler state leak

Changes

Security hardening (shlex.quote())

File What's quoted
core/docker.py container_name, image, mount paths, cwd in docker run
execution/docker_builder.py docker_image, dockerfile, docker_context in docker build
execution/container_runner.py registry_image, local_name in docker pull/tag/rmi; mount paths in get_mount_arg()
orchestration/run_orchestrator.py image_name in docker image inspect and docker pull

Bug fixes

  • core/context.py — Guard kfd_renderDs is None before len() call, with a clear error message pointing at KFD topology permissions
  • deployment/base.py — Add DeploymentStatus.CANCELLED to the terminal set in _monitor_until_complete() so cancelled jobs don't spin forever

Test & config cleanup

  • Delete pytest.ini — was overriding all pyproject.toml [tool.pytest.ini_options] settings silently
  • pyproject.toml — Fix python_paths typo → pythonpath, add --strict-markers, filterwarnings, minversion, and full marker list (10 markers)
  • tests/conftest.py — Remove sys.path insertion (now handled by pythonpath) and pytest_configure() marker duplication
  • tests/unit/test_error_handling.py — Add setup_method/teardown_method to reset global error handler between tests
  • tests/unit/test_shell_quoting.py — 11 new tests validating quoting across all hardened code paths

Test plan

  • pytest tests/unit/ -v — all 437 tests pass (verified locally)
  • Verify shell quoting tests cover malicious inputs (; rm -rf /, $(whoami)) and safe inputs
  • Confirm no regressions in build/run workflows with normal image names and paths

…onfig cleanup

- Harden shell commands with shlex.quote() in docker.py, container_runner.py,
  docker_builder.py, and run_orchestrator.py to prevent injection via
  user-controlled values (image names, paths, container names)
- Fix TypeError when kfd_renderDs is None on restricted ROCm < 6.4.1 systems
- Add CANCELLED to deployment monitor terminal states to prevent infinite loop
- Consolidate pytest config into pyproject.toml and delete redundant pytest.ini
- Remove sys.path hack and duplicate marker registration from conftest.py
- Fix global error handler state leak in test_error_handling.py
- Add test_shell_quoting.py with 11 tests validating quoting behavior

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@coketaste coketaste self-assigned this May 3, 2026
Copilot AI review requested due to automatic review settings May 3, 2026 21:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens several Docker-related shell commands against injection, fixes a couple of runtime bugs in GPU mapping and deployment monitoring, and consolidates pytest configuration into pyproject.toml while cleaning up test infrastructure.

Changes:

  • Add shlex.quote() escaping for user-controlled values interpolated into docker run/build/pull/tag/rmi/inspect commands across core execution paths.
  • Fix ROCm < 6.4.1 crash when KFD topology is inaccessible, and prevent infinite monitoring loops on cancelled deployments.
  • Consolidate pytest config into pyproject.toml, remove redundant marker registration / sys.path hacks, and add new unit tests for shell quoting.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/madengine/core/docker.py Quote container name, image, mounts, and CWD in docker run command construction.
src/madengine/execution/docker_builder.py Quote image/dockerfile/context in docker build command (but other shell interpolations remain).
src/madengine/execution/container_runner.py Quote image names in docker pull/tag/rmi and quote mount paths in mount args.
src/madengine/orchestration/run_orchestrator.py Quote image name in docker image inspect and docker pull.
src/madengine/core/context.py Raise a clear error when KFD topology is unavailable on ROCm < 6.4.1 instead of crashing.
src/madengine/deployment/base.py Treat CANCELLED as terminal to stop monitoring loops.
tests/unit/test_shell_quoting.py Add unit tests validating quoting behavior across hardened command paths.
tests/unit/test_error_handling.py Reset global error handler between tests to prevent state leakage.
tests/conftest.py Remove sys.path insertion and redundant marker registration.
pyproject.toml Make pytest configuration the source of truth, enable --strict-markers, and define markers/warnings.
pytest.ini Remove redundant config file that could override pyproject.toml.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit/test_shell_quoting.py Outdated
Comment thread tests/unit/test_shell_quoting.py Outdated
Comment thread src/madengine/execution/docker_builder.py
coketaste and others added 2 commits May 11, 2026 15:34
…n up test imports

Add shlex.quote() to remaining unquoted shell interpolations in
docker_builder.py (grep, docker manifest inspect, docker tag, docker push,
head commands). Remove unused pytest and tempfile imports and dead temp
file block from test_shell_quoting.py.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 11, 2026 20:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread tests/unit/test_shell_quoting.py
Comment thread pyproject.toml Outdated
Copilot AI review requested due to automatic review settings May 13, 2026 13:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread src/madengine/deployment/presets/k8s/defaults.json
coketaste and others added 2 commits May 19, 2026 11:40
[tool.pytest.ini_options].minversion is the minimum pytest version,
not Python. The previous value "3.8" was the Python version. Use 7.0
to match the floor actually required by the config — pythonpath was
added in pytest 7.0.0.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 16:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.

CHANGELOG.md [2.0.3]:
- Added: K8s `storage_class` field + preset default change from
  `local_path_storage_class: "local-path"` to `storage_class: "nfs-banff"`
- Changed: pytest config consolidation into pyproject.toml; minversion
  corrected to 7.0; conftest.py cleanup
- Fixed: `kfd_renderDs is None` TypeError on restricted ROCm < 6.4.1;
  deployment monitor infinite loop on CANCELLED jobs
- Security: extended shell-injection hardening (shlex.quote) across
  docker.py, container_runner.py, docker_builder.py, run_orchestrator.py
- Tests: new test_shell_quoting.py (11 tests); test_error_handling.py
  global-state-leak fix

examples/k8s-configs/README.md:
- Update storage-class table to show new `storage_class` fallback chain
- Document 2.0.3 preset default change (`local-path` -> `nfs-banff` for
  single-node results PVC) with override snippet for clusters that still
  want local-path
- Mark `local_path_storage_class` as removed-from-preset-but-still-honoured

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@coketaste coketaste merged commit 9a3f839 into develop May 19, 2026
coketaste added a commit that referenced this pull request May 28, 2026
Resolves the CHANGELOG conflict by adopting upstream's v2.0.3 release
date (2026-05-26 — finalized upstream via #122) and graduating this
branch's Unreleased entries into a new v2.1.0 section dated 2026-05-28,
since slurm_multi / --use-image / --build-on-compute are feature work.

Auto-merged from upstream:
- fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment (#126)
- docs/wiki/index.html (wiki path rename, #129)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants