Skip to content

fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment#126

Merged
coketaste merged 5 commits into
developfrom
coketaste/fix-local-launcher
May 27, 2026
Merged

fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment#126
coketaste merged 5 commits into
developfrom
coketaste/fix-local-launcher

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

  • Docker local mode had no mechanism to set MAD_MULTI_NODE_RUNNER, unlike SLURM and K8s which generate it in their deployment layers. This caused distributed training scripts (e.g.
    Megatron-LM train_7b.sh) to fail with Error: MULTI_NODE_RUNNER is not defined on local runs.
  • Adds _generate_local_launcher_command() to ContainerRunner and calls it in run_container() after GPU resolution, generating the appropriate single-node launcher command
    (torchrun, deepspeed, etc.) and injecting it as the MAD_MULTI_NODE_RUNNER env var.
  • No-op for models that don't use MAD_MULTI_NODE_RUNNER — the env var is simply unused.

Details

SLURM sets MAD_MULTI_NODE_RUNNER in job.sh.j2 and K8s sets it in kubernetes_launcher_mixin.py. Docker local had no equivalent, so any model script referencing
$MAD_MULTI_NODE_RUNNER would fail.

The fix reuses the already-resolved launcher variable in run_container() and maps it to the correct single-node command:

  • torchrun/megatron/torchtitantorchrun --standalone --nproc_per_node=N
  • deepspeeddeepspeed --num_gpus=N
  • vllm/sglang/primus → empty (these manage their own process spawning)
  • Defaults to torchrun when launcher is a deployment-level value like "docker" or "native"

Only sets the env var if the user hasn't already provided it via docker_env_vars.

Test plan

  • Run a model that uses $MAD_MULTI_NODE_RUNNER (e.g. Megatron-LM train_7b) locally — verify it no longer fails with MULTI_NODE_RUNNER is not defined
  • Run a model that doesn't use $MAD_MULTI_NODE_RUNNER (e.g. HuggingFace script) — verify no regression
  • Verify SLURM/K8s paths are unaffected (change is scoped to ContainerRunner.run_container())

Docker local mode had no mechanism to set MAD_MULTI_NODE_RUNNER, unlike
SLURM (job.sh.j2) and K8s (kubernetes_launcher_mixin.py) which generate
it in their deployment layers. This caused train_7b.sh (Megatron-LM) to
fail with 'Error: MULTI_NODE_RUNNER is not defined' on local runs.

Add _generate_local_launcher_command() to ContainerRunner that generates
the appropriate single-node distributed process launcher command, and call
it in run_container() after GPU resolution. Reuses the already-resolved
launcher variable (lines 327-372) to stay consistent with existing launcher
parsing conventions. Defaults to torchrun for backward compatibility.
Supports torchrun, megatron-lm, torchtitan, deepspeed, vllm, sglang, and
primus launchers. The env var is simply unused for models that hardcode
their own launcher (e.g. HuggingFace scripts).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@coketaste coketaste self-assigned this May 19, 2026
Copilot AI review requested due to automatic review settings May 19, 2026 22:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to align Docker local execution with SLURM/Kubernetes by generating and injecting MAD_MULTI_NODE_RUNNER for single-node distributed runs, preventing model scripts that rely on $MAD_MULTI_NODE_RUNNER from failing in local Docker mode.

Changes:

  • Add ContainerRunner._generate_local_launcher_command() to map launcher types (torchrun/deepspeed/etc.) to a single-node launcher command.
  • In ContainerRunner.run_container(), generate MAD_MULTI_NODE_RUNNER after GPU resolution and inject it into docker_env_vars if the user didn’t already provide it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
coketaste and others added 2 commits May 22, 2026 15:03
The MAD_MULTI_NODE_RUNNER generation block in run_container() referenced
`launcher` from create_run_details_dict(), a different method's local
scope. Resolve the launcher inline using the same priority chain
(additional_context → model_info → MAD_LAUNCHER env).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 22, 2026 19:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Copilot AI review requested due to automatic review settings May 27, 2026 00:13
@coketaste coketaste merged commit 8d86f45 into develop May 27, 2026
1 check failed
@coketaste coketaste review requested due to automatic review settings May 27, 2026 00:34
coketaste added a commit that referenced this pull request May 28, 2026
Resolves the CHANGELOG conflict by adopting upstream's v2.0.3 release
date (2026-05-26 — finalized upstream via #122) and graduating this
branch's Unreleased entries into a new v2.1.0 section dated 2026-05-28,
since slurm_multi / --use-image / --build-on-compute are feature work.

Auto-merged from upstream:
- fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment (#126)
- docs/wiki/index.html (wiki path rename, #129)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants