DeepEP, `torch.compile` and Fix Megatron Training Bug by FurtherAI · Pull Request #646 · OpenPipe/ART

FurtherAI · 2026-04-08T18:09:58Z

Summary: 1.15x faster Megatron training and it actually trains now.

DeepEP

DeepEP allows for faster expert parallel (EP) comm. EP comm and the pre/post-processing work surrounding it take roughly as much time as the actual expert MLP computation (on Qwen 3 30B A3B at least) so improvements to this are important. DeepEP gives us a ~1.05x speedup. The gap between DeepEP and Megatron may grow in multi-node settings.

`torch.compile`

We add torch.compile to the model layers and disable some regions that are not compatible. This gives a ~1.10x speedup on top of DeepEP. I did not test max-autotune or cuda graphs here, just basic compilation.

Megatron Training

We noticed that megatron failed to train in the simple yes-no-maybe example. This was caused by the parameter offload. Megatron expects param data tensors to stay constant and offload/reload creating new tensors caused Megatron to lose track of them for updates. We shift to Megatron's offload API to do this properly.

We also remove the optimizer offload, since the optimizer is loaded from disk at the start of each job anyways.

Megatron Provider Options

We expose env variables for controlling Megatron parallelism. We will refactor the configuration system at some point so that you can naturally modify these, but this is the minimal control plane.

…_and_trainability_main # Conflicts: # src/art/megatron/train.py

Kovbo · 2026-04-09T18:16:44Z

Nice, training works for me! My AI had these comments:

_compile_enabled() is case-inconsistent (src/art/megatron/train.py:944-949)
return os.environ.get("ART_DISABLE_MEGATRON_COMPILE", "0") in {"0", "false", "False"}
Setting ART_DISABLE_MEGATRON_COMPILE=FALSE (uppercase) returns False, i.e. disables compile — the opposite of
what the user intends. Compare with _env_flag in provider.py which lowercases properly. Use the same helper.
Wasted optimizer build in _run_service_loop (src/art/megatron/train.py:1257)
build_training_runtime builds the optimizer, then after_job() is called immediately on entry, which sets
runtime.optimizer = None and discards it. The optimizer is then rebuilt per-job inside _load_lora_and_optimizer.
The initial build is dead work — either skip it in build_training_runtime, or skip the initial after_job()
call. Also del optimizer after runtime.optimizer = None is a no-op (just a local var).
Fragile/misleading guard in _iter_megatron_param_buffers (src/art/megatron/offload.py:15-21)
chunk_buffers = getattr(chunk, "buffers", None)
if callable(chunk_buffers):
raise RuntimeError("Megatron chunk is missing distributed param buffers")
This relies on Megatron's DDP shadowing nn.Module.buffers (a method) with an instance-attribute list. If
Megatron ever renames the attribute, this fires with a misleading error. Better: check isinstance(chunk,
megatron.core.distributed.DistributedDataParallel) and read its actual attribute (buffers /
param_and_grad_buffers) explicitly.
Silent override of user setting in _apply_runtime_env_overrides (src/art/megatron/provider.py)
After honoring ART_MEGATRON_MOE_SHARED_EXPERT_OVERLAP, the code unconditionally forces
provider.moe_shared_expert_overlap = False whenever overlap_moe_expert_parallel_comm is on. If the user
explicitly set both to true via env, their setting is silently dropped. Worth raising an error or warning
instead of silently overriding.
ART_MEGATRON_MOE_DEEPEP_NUM_SMS=none semantics (src/art/megatron/provider.py)
If set to "none", _env_optional_int returns (True, None) so provider.moe_deepep_num_sms is left untouched, and
the auto-default branch is also skipped (because the var is in os.environ). Net effect: "none" leaves whatever
value the provider had. That's at minimum surprising vs the other env keys. Consider documenting or making
"none" mean "use auto-default".
local/backend.py drop-warning is per-tokenized-result, not per-trajectory
The "Dropping ... from N trajectories" filter drops individual TokenizedResults longer than sequence_length. A
single trajectory with one too-long sample loses just that sample (which can desync advantage stats for that
group). Worth confirming that's intended — the prior truncate_long_results=True path silently handled this
differently.

…_and_trainability_main

FurtherAI added 7 commits April 8, 2026 02:53

Plumb packed sequence length through local training backends

127fb84

Add Megatron trainability runtime and service flow

2ef7969

Fix minor regressions

c52bff6

Merge remote-tracking branch 'origin/main' into austin/deepep_compile…

0199fc1

…_and_trainability_main # Conflicts: # src/art/megatron/train.py

Install nvshmem and remove patches

3d6e892

Update CI to sm_90 for DeepEP

16fd201

Fix CI uv cache upload hangs

9dfc106

FurtherAI requested a review from Kovbo April 9, 2026 02:43

Kovbo approved these changes Apr 9, 2026

View reviewed changes

FurtherAI added 4 commits April 10, 2026 05:47

Tighten Megatron env handling and optimizer init

6296a70

Fix flex attention compile option typing

ee19b6a

Merge remote-tracking branch 'origin/main' into austin/deepep_compile…

109b57a

…_and_trainability_main

Annotate flex attention kernel options

5ac3a32

FurtherAI merged commit dec6b3a into main Apr 10, 2026
5 checks passed

FurtherAI deleted the austin/deepep_compile_and_trainability_main branch April 10, 2026 08:46

vivekkalyan mentioned this pull request Apr 10, 2026

feat: Add dedicated lora mode to Megatron backend #635

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepEP, `torch.compile` and Fix Megatron Training Bug#646

DeepEP, `torch.compile` and Fix Megatron Training Bug#646
FurtherAI merged 11 commits intomainfrom
austin/deepep_compile_and_trainability_main

FurtherAI commented Apr 8, 2026

Uh oh!

Kovbo commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FurtherAI commented Apr 8, 2026

DeepEP

torch.compile

Megatron Training

Megatron Provider Options

Uh oh!

Kovbo commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`torch.compile`