feat(moe): generalize v5 batched-expert remap to Qwen3/DeepSeek (Phase 3)#110
Merged
Merged
Conversation
added 2 commits
June 24, 2026 09:15
transformers v5 stores batched experts (mlp.experts.gate_up_proj) not just for Mixtral but for all MoE archs. Generalize _remap_mixtral_v5_experts -> _remap_v5_batched_experts(state_dict, config), arch-aware: - Mixtral: block .mlp -> .block_sparse_moe, per-expert w1/w3/w2.weight. - Qwen3 / DeepSeek: keep .mlp, per-expert gate_proj/up_proj/down_proj.weight. - GPT-OSS excluded (own batched + MXFP4 path). Verified under real transformers 5.3.0 on synthetic Mixtral, Qwen3-MoE, and DeepSeek-V3 (incl. shared_experts preserved): 0 unmatched parse_expert_id, 0.0 numeric diff vs the v5 expert forward. Full suite: 497 passed on BOTH transformers 4.57.6 and 5.3.0. Per-arch unit tests added.
The numeric-equivalence tests used unseeded random tensors with atol=1e-6, which is below float32 accumulation-order noise (worst ~3e-5 across seeds), causing a flaky failure on Python 3.12. Seed torch.manual_seed(0) for determinism and use atol=1e-4 (seeded worst-case ~7.6e-6). The remap math is exact (weight split/copy); only fused-vs-split accumulation order differs.
drunkcoding
added a commit
that referenced
this pull request
Jun 24, 2026
…111) All v5 code-side migration is complete (#106-#110): dtype shim, PretrainedConfig import, and arch-aware batched-expert remap for Mixtral/Qwen3/DeepSeek. Validated under real transformers 5.3.0 and 5.12.1: full CPU suite 497 passed, all adapters import, remap exact across archs. Resolves the dependabot transformers alert (Trainer RCE, fixed in 5.0.0rc3+; not exploitable here since this is inference-only, but moves us onto a supported, security-patched line). Co-authored-by: drunkcoding <leyang.xue@ed.ac.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Running the migration under real transformers 5.3.0 (not just import-probing) revealed that v5's batched expert storage (
mlp.experts.gate_up_proj [E, 2I, H]) is a general pattern, not Mixtral-only — Qwen3-MoE, DeepSeek-V2/V3, and OLMoE all use it. The Mixtral-only remap (#108/#109) didn't cover them, so those archs would silently break expert routing on v5 (parse_expert_id→(None, None)).This generalizes
_remap_mixtral_v5_experts→_remap_v5_batched_experts(state_dict, config), arch-aware:experts.{E}.w1/w3/w2.weight.mlp→.block_sparse_moe(+ gate moved)experts.{E}.gate_proj/up_proj/down_proj.weight.mlpgate_up_proj[e].chunk(2, dim=0)→ gate/up;down_proj[e]→ down.Validation (under real transformers 5.3.0)
Built synthetic tiny models and ran the remap end-to-end:
Tests (8,
test_mixtral_v5_remap.py)Per-arch shape + naming, numeric equivalence (Mixtral & Qwen3),
parse_expert_idintegration (Mixtral & Qwen3), GPT-OSS skip, v4 no-op.Full-suite verification
Scope note
Still no
transformersversion pin bump. With this, all runtime-supported MoE archs (Mixtral, Qwen3, DeepSeek, GPT-OSS; NLLB uses a separate non-batched layout) handle v5 checkpoints. The pin bump + a real-checkpoint GPU golden-decode remain as the final gate (see.sisyphus/plans/transformers-v5-migration.md).