feat(moe): generalize v5 batched-expert remap to Qwen3/DeepSeek (Phase 3) by drunkcoding · Pull Request #110 · EfficientMoE/MoE-Infinity

drunkcoding · 2026-06-24T09:15:33Z

Summary

Running the migration under real transformers 5.3.0 (not just import-probing) revealed that v5's batched expert storage (mlp.experts.gate_up_proj [E, 2I, H]) is a general pattern, not Mixtral-only — Qwen3-MoE, DeepSeek-V2/V3, and OLMoE all use it. The Mixtral-only remap (#108/#109) didn't cover them, so those archs would silently break expert routing on v5 (parse_expert_id → (None, None)).

This generalizes _remap_mixtral_v5_experts → _remap_v5_batched_experts(state_dict, config), arch-aware:

Arch	Output per-expert keys	Block path
Mixtral	`experts.{E}.w1/w3/w2.weight`	`.mlp` → `.block_sparse_moe` (+ gate moved)
Qwen3 / DeepSeek	`experts.{E}.gate_proj/up_proj/down_proj.weight`	keep `.mlp`
GPT-OSS	excluded (own batched + MXFP4 path)	—

gate_up_proj[e].chunk(2, dim=0) → gate/up; down_proj[e] → down.

Validation (under real transformers 5.3.0)

Built synthetic tiny models and ran the remap end-to-end:

Mixtral: 0 unmatched, 0.0 numeric diff.
Qwen3-MoE: 0 unmatched, 0.0 numeric diff.
DeepSeek-V3: 0 unmatched, 0.0 numeric diff, shared_experts preserved (untouched).
GPT-OSS: correctly skipped (keys unchanged).

Tests (8, `test_mixtral_v5_remap.py`)

Per-arch shape + naming, numeric equivalence (Mixtral & Qwen3), parse_expert_id integration (Mixtral & Qwen3), GPT-OSS skip, v4 no-op.

Full-suite verification

497 passed, 2 skipped on both transformers 4.57.6 AND 5.3.0 — fully backward-compatible.

Scope note

Still no transformers version pin bump. With this, all runtime-supported MoE archs (Mixtral, Qwen3, DeepSeek, GPT-OSS; NLLB uses a separate non-batched layout) handle v5 checkpoints. The pin bump + a real-checkpoint GPU golden-decode remain as the final gate (see .sisyphus/plans/transformers-v5-migration.md).

transformers v5 stores batched experts (mlp.experts.gate_up_proj) not just for Mixtral but for all MoE archs. Generalize _remap_mixtral_v5_experts -> _remap_v5_batched_experts(state_dict, config), arch-aware: - Mixtral: block .mlp -> .block_sparse_moe, per-expert w1/w3/w2.weight. - Qwen3 / DeepSeek: keep .mlp, per-expert gate_proj/up_proj/down_proj.weight. - GPT-OSS excluded (own batched + MXFP4 path). Verified under real transformers 5.3.0 on synthetic Mixtral, Qwen3-MoE, and DeepSeek-V3 (incl. shared_experts preserved): 0 unmatched parse_expert_id, 0.0 numeric diff vs the v5 expert forward. Full suite: 497 passed on BOTH transformers 4.57.6 and 5.3.0. Per-arch unit tests added.

The numeric-equivalence tests used unseeded random tensors with atol=1e-6, which is below float32 accumulation-order noise (worst ~3e-5 across seeds), causing a flaky failure on Python 3.12. Seed torch.manual_seed(0) for determinism and use atol=1e-4 (seeded worst-case ~7.6e-6). The remap math is exact (weight split/copy); only fused-vs-split accumulation order differs.

…111) All v5 code-side migration is complete (#106-#110): dtype shim, PretrainedConfig import, and arch-aware batched-expert remap for Mixtral/Qwen3/DeepSeek. Validated under real transformers 5.3.0 and 5.12.1: full CPU suite 497 passed, all adapters import, remap exact across archs. Resolves the dependabot transformers alert (Trainer RCE, fixed in 5.0.0rc3+; not exploitable here since this is inference-only, but moves us onto a supported, security-patched line). Co-authored-by: drunkcoding <leyang.xue@ed.ac.uk>

drunkcoding added 2 commits June 24, 2026 09:15

drunkcoding merged commit b33d5fa into dev Jun 24, 2026
8 checks passed

drunkcoding mentioned this pull request Jun 24, 2026

build(deps): bump transformers to >=5.3.0,<6 (closes CVE-2026-1839) #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(moe): generalize v5 batched-expert remap to Qwen3/DeepSeek (Phase 3)#110

feat(moe): generalize v5 batched-expert remap to Qwen3/DeepSeek (Phase 3)#110
drunkcoding merged 2 commits into
devfrom
feat/transformers-v5-batched-experts-all

drunkcoding commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

drunkcoding commented Jun 24, 2026

Summary

Validation (under real transformers 5.3.0)

Tests (8, test_mixtral_v5_remap.py)

Full-suite verification

Scope note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests (8, `test_mixtral_v5_remap.py`)