Skip to content

DeepEP, torch.compile and Fix Megatron Training Bug#646

Merged
FurtherAI merged 11 commits intomainfrom
austin/deepep_compile_and_trainability_main
Apr 10, 2026
Merged

DeepEP, torch.compile and Fix Megatron Training Bug#646
FurtherAI merged 11 commits intomainfrom
austin/deepep_compile_and_trainability_main

Conversation

@FurtherAI
Copy link
Copy Markdown
Collaborator

Summary: 1.15x faster Megatron training and it actually trains now.

DeepEP

DeepEP allows for faster expert parallel (EP) comm. EP comm and the pre/post-processing work surrounding it take roughly as much time as the actual expert MLP computation (on Qwen 3 30B A3B at least) so improvements to this are important. DeepEP gives us a ~1.05x speedup. The gap between DeepEP and Megatron may grow in multi-node settings.

torch.compile

We add torch.compile to the model layers and disable some regions that are not compatible. This gives a ~1.10x speedup on top of DeepEP. I did not test max-autotune or cuda graphs here, just basic compilation.

Megatron Training

We noticed that megatron failed to train in the simple yes-no-maybe example. This was caused by the parameter offload. Megatron expects param data tensors to stay constant and offload/reload creating new tensors caused Megatron to lose track of them for updates. We shift to Megatron's offload API to do this properly.

We also remove the optimizer offload, since the optimizer is loaded from disk at the start of each job anyways.

Megatron Provider Options

We expose env variables for controlling Megatron parallelism. We will refactor the configuration system at some point so that you can naturally modify these, but this is the minimal control plane.

@FurtherAI FurtherAI requested a review from Kovbo April 9, 2026 02:43
@Kovbo
Copy link
Copy Markdown
Collaborator

Kovbo commented Apr 9, 2026

Nice, training works for me! My AI had these comments:

  1. _compile_enabled() is case-inconsistent (src/art/megatron/train.py:944-949)
    return os.environ.get("ART_DISABLE_MEGATRON_COMPILE", "0") in {"0", "false", "False"}
    Setting ART_DISABLE_MEGATRON_COMPILE=FALSE (uppercase) returns False, i.e. disables compile — the opposite of
    what the user intends. Compare with _env_flag in provider.py which lowercases properly. Use the same helper.

  2. Wasted optimizer build in _run_service_loop (src/art/megatron/train.py:1257)
    build_training_runtime builds the optimizer, then after_job() is called immediately on entry, which sets
    runtime.optimizer = None and discards it. The optimizer is then rebuilt per-job inside _load_lora_and_optimizer.
    The initial build is dead work — either skip it in build_training_runtime, or skip the initial after_job()
    call. Also del optimizer after runtime.optimizer = None is a no-op (just a local var).

  3. Fragile/misleading guard in _iter_megatron_param_buffers (src/art/megatron/offload.py:15-21)
    chunk_buffers = getattr(chunk, "buffers", None)
    if callable(chunk_buffers):
    raise RuntimeError("Megatron chunk is missing distributed param buffers")
    This relies on Megatron's DDP shadowing nn.Module.buffers (a method) with an instance-attribute list. If
    Megatron ever renames the attribute, this fires with a misleading error. Better: check isinstance(chunk,
    megatron.core.distributed.DistributedDataParallel) and read its actual attribute (buffers /
    param_and_grad_buffers) explicitly.

  4. Silent override of user setting in _apply_runtime_env_overrides (src/art/megatron/provider.py)
    After honoring ART_MEGATRON_MOE_SHARED_EXPERT_OVERLAP, the code unconditionally forces
    provider.moe_shared_expert_overlap = False whenever overlap_moe_expert_parallel_comm is on. If the user
    explicitly set both to true via env, their setting is silently dropped. Worth raising an error or warning
    instead of silently overriding.

  5. ART_MEGATRON_MOE_DEEPEP_NUM_SMS=none semantics (src/art/megatron/provider.py)
    If set to "none", _env_optional_int returns (True, None) so provider.moe_deepep_num_sms is left untouched, and
    the auto-default branch is also skipped (because the var is in os.environ). Net effect: "none" leaves whatever
    value the provider had. That's at minimum surprising vs the other env keys. Consider documenting or making
    "none" mean "use auto-default".

  6. local/backend.py drop-warning is per-tokenized-result, not per-trajectory
    The "Dropping ... from N trajectories" filter drops individual TokenizedResults longer than sequence_length. A
    single trajectory with one too-long sample loses just that sample (which can desync advantage stats for that
    group). Worth confirming that's intended — the prior truncate_long_results=True path silently handled this
    differently.

@FurtherAI FurtherAI merged commit dec6b3a into main Apr 10, 2026
5 checks passed
@FurtherAI FurtherAI deleted the austin/deepep_compile_and_trainability_main branch April 10, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants