Skip to content

feat(moe): generalize v5 batched-expert remap to Qwen3/DeepSeek (Phase 3)#110

Merged
drunkcoding merged 2 commits into
devfrom
feat/transformers-v5-batched-experts-all
Jun 24, 2026
Merged

feat(moe): generalize v5 batched-expert remap to Qwen3/DeepSeek (Phase 3)#110
drunkcoding merged 2 commits into
devfrom
feat/transformers-v5-batched-experts-all

Conversation

@drunkcoding

Copy link
Copy Markdown
Contributor

Summary

Running the migration under real transformers 5.3.0 (not just import-probing) revealed that v5's batched expert storage (mlp.experts.gate_up_proj [E, 2I, H]) is a general pattern, not Mixtral-only — Qwen3-MoE, DeepSeek-V2/V3, and OLMoE all use it. The Mixtral-only remap (#108/#109) didn't cover them, so those archs would silently break expert routing on v5 (parse_expert_id(None, None)).

This generalizes _remap_mixtral_v5_experts_remap_v5_batched_experts(state_dict, config), arch-aware:

Arch Output per-expert keys Block path
Mixtral experts.{E}.w1/w3/w2.weight .mlp.block_sparse_moe (+ gate moved)
Qwen3 / DeepSeek experts.{E}.gate_proj/up_proj/down_proj.weight keep .mlp
GPT-OSS excluded (own batched + MXFP4 path)

gate_up_proj[e].chunk(2, dim=0) → gate/up; down_proj[e] → down.

Validation (under real transformers 5.3.0)

Built synthetic tiny models and ran the remap end-to-end:

  • Mixtral: 0 unmatched, 0.0 numeric diff.
  • Qwen3-MoE: 0 unmatched, 0.0 numeric diff.
  • DeepSeek-V3: 0 unmatched, 0.0 numeric diff, shared_experts preserved (untouched).
  • GPT-OSS: correctly skipped (keys unchanged).

Tests (8, test_mixtral_v5_remap.py)

Per-arch shape + naming, numeric equivalence (Mixtral & Qwen3), parse_expert_id integration (Mixtral & Qwen3), GPT-OSS skip, v4 no-op.

Full-suite verification

  • 497 passed, 2 skipped on both transformers 4.57.6 AND 5.3.0 — fully backward-compatible.

Scope note

Still no transformers version pin bump. With this, all runtime-supported MoE archs (Mixtral, Qwen3, DeepSeek, GPT-OSS; NLLB uses a separate non-batched layout) handle v5 checkpoints. The pin bump + a real-checkpoint GPU golden-decode remain as the final gate (see .sisyphus/plans/transformers-v5-migration.md).

drunkcoding added 2 commits June 24, 2026 09:15
transformers v5 stores batched experts (mlp.experts.gate_up_proj) not just
for Mixtral but for all MoE archs. Generalize _remap_mixtral_v5_experts ->
_remap_v5_batched_experts(state_dict, config), arch-aware:
- Mixtral: block .mlp -> .block_sparse_moe, per-expert w1/w3/w2.weight.
- Qwen3 / DeepSeek: keep .mlp, per-expert gate_proj/up_proj/down_proj.weight.
- GPT-OSS excluded (own batched + MXFP4 path).

Verified under real transformers 5.3.0 on synthetic Mixtral, Qwen3-MoE, and
DeepSeek-V3 (incl. shared_experts preserved): 0 unmatched parse_expert_id,
0.0 numeric diff vs the v5 expert forward. Full suite: 497 passed on BOTH
transformers 4.57.6 and 5.3.0. Per-arch unit tests added.
The numeric-equivalence tests used unseeded random tensors with atol=1e-6,
which is below float32 accumulation-order noise (worst ~3e-5 across seeds),
causing a flaky failure on Python 3.12. Seed torch.manual_seed(0) for
determinism and use atol=1e-4 (seeded worst-case ~7.6e-6). The remap math
is exact (weight split/copy); only fused-vs-split accumulation order differs.
@drunkcoding drunkcoding merged commit b33d5fa into dev Jun 24, 2026
8 checks passed
drunkcoding added a commit that referenced this pull request Jun 24, 2026
…111)

All v5 code-side migration is complete (#106-#110): dtype shim,
PretrainedConfig import, and arch-aware batched-expert remap for
Mixtral/Qwen3/DeepSeek. Validated under real transformers 5.3.0 and 5.12.1:
full CPU suite 497 passed, all adapters import, remap exact across archs.

Resolves the dependabot transformers alert (Trainer RCE, fixed in 5.0.0rc3+;
not exploitable here since this is inference-only, but moves us onto a
supported, security-patched line).

Co-authored-by: drunkcoding <leyang.xue@ed.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant