What's Changed
- Add preliminary code by @From00 in #1
- add spec utils by @FeixLiu in #2
- add enums.py and identity_op.py by @GuoxiaWang in #3
- add vpp_simulator by @Waynezee in #5
- PaddleFleet distributed initialization and ProcessGroup create. by @Hz188 in #8
- add codestyle workflow by @swgu98 in #6
- Complete parallel_state.py by @Hz188 in #12
- Trans transformer block/layer by @FeixLiu in #11
- [CodeStyle] Ignore
PLC0414in__init__.pyfiles by @SigureMo in #13 - Complete process_groups_config.py doc and fix typo by @Hz188 in #15
- [setup] Support source installation of Paddle-Fleet by @risemeup1 in #14
- improve parallel_state.py for EPHcg & Hcg by @Hz188 in #16
- [CI] Add approval workflow by @swgu98 in #10
- support pipeline_parallel schedules by @AlAuAu in #9
- dev global vars, yaml parse by @Hz188 in #18
- [CI] fix approval workflow by @swgu98 in #20
- Add Attention and RoPE classes by @lshpku in #17
- add mtp by @FeixLiu in #22
- unittest test_schedules bugfix by @AlAuAu in #24
- add gpt_model specs by @GuoxiaWang in #23
- add mlp layer by @blacksheep-Aristotle in #26
- Trans ln by @FeixLiu in #25
- Add PackedSeqParams class by @lshpku in #30
- add psp by @FeixLiu in #31
- Update LanguageModelEmbedding and add unittest by @lshpku in #27
- Update RoPE and add unittest by @lshpku in #28
- [CI] add test workflow by @swgu98 in #19
- add pipeline/utils.py by @blacksheep-Aristotle in #33
- [CI] change ut dir by @swgu98 in #34
- [CI] git name by @swgu98 in #35
- Placeholder for building transformer configuration parsing by @Hz188 in #32
- [CI] Add typos white list by @swgu98 in #36
- change sublayer to sublayer_spec by @FeixLiu in #39
- fix some test_gpt_model dependencies by @GuoxiaWang in #40
- Add global Timers for logging by @huangjiyi in #21
- add check_initialized for dp group by @Hz188 in #41
- add coverage scripts by @XieYunshen in #38
- fix coverage scripts by @XieYunshen in #45
- First successful run of GPTModel model definition by @GuoxiaWang in #44
- Change fleet core to paddlefleet by @From00 in #46
- fix coverage bug by @risemeup1 in #47
- Some fixes to successful run glm4.5 in PaddleFormers by @From00 in #49
- single card test by @swgu98 in #50
- mv paddlefleet to src by @risemeup1 in #52
- use paddle12.6 in single card test by @risemeup1 in #53
- Add GPTModelEstimator by @huangjiyi in #59
- Fix attention dim order by @lshpku in #60
- fix ffn_hidden_size is None when init gpt_mlp by @blacksheep-Aristotle in #56
- Add estimate_mfu by @huangjiyi in #62
- refine pyproject.toml by @risemeup1 in #63
- [CI] add uv pre-commit by @swgu98 in #65
- [CI] Add nemo megatron approval by @swgu98 in #51
- Add set_logging and get_logger by @huangjiyi in #64
- bugfix init GPTModel by @GuoxiaWang in #54
- [CodeStyle][Ruff] update ruff target-version to
py310by @ooooo-create in #66 - Supprot custom op by @zhangbo9674 in #48
- 【lora】fix model for Lora by @xiaoguoguo626807 in #68
- Ci/multi card config by @swgu98 in #67
- [CI] add approval for model_parallel_config.py & transformer_config.py by @swgu98 in #70
- [MoE] Add Base MoE Layer by @hushenwei2000 in #61
- Delete sharded_state_dict to support FC save/load by @changeyoung98 in #71
- Align to PaddleFormers by @Waynezee in #72
- Ignore files generated from
uv syncfor custom ops by @ooooo-create in #69 - [Bug_Fix] fix attention_mask & skip check expert_tensor_parallel_group by @xuxinyi389 in #73
- Fix moe_layer config by @From00 in #74
- Fix save_tensors bugs and disable jit by @From00 in #75
- Add tensor parallel functions by @pkuzyc in #29
- Support TP Sharding EP For GLM4.5 by @xuxinyi389 in #76
- move spec utils to paddlefleet by @FeixLiu in #78
- Cherry pp layers by @FeixLiu in #80
- add non pipeline execution by @LiYuRio in #81
- Shared weight test by @FeixLiu in #86
- modify pylayer bug by @xiaoguoguo626807 in #87
- refine non-pp scheduler by @LiYuRio in #89
- Support MTP in GLM4.5 and add unittest by @lshpku in #55
- Use original cross_entropy and re-open the loss check in unit test by @pkuzyc in #84
- Fix rope dim order by @lshpku in #91
- support PipelineParallel by @AlAuAu in #92
- fix single card run by @huangjiyi in #90
- Fix bug in tensor_parallel unit tests by @pkuzyc in #93
- [MoE Layer] Fix EP Hang when No Tokens are Distributed in the Rank by @hushenwei2000 in #83
- pp License fix by @AlAuAu in #95
- [CI] add integration test glm by @swgu98 in #85
- Add sharded_state_dict for TP by @changeyoung98 in #94
- [CI] fix bypass by @swgu98 in #97
- Add instructions for copilot reviewer by @risemeup1 in #96
- [Feature] Add test instruction by @risemeup1 in #98
- disable test_layers.py by @swgu98 in #99
- [CI] Delete sed by @swgu98 in #101
- rename config fields to align huggingface by @Hz188 in #82
- Fix bias grad reduction of bias_geglu_back by @lshpku in #100
- fix config by @Waynezee in #108
- support pipeline_parallel_withinterleave by @AlAuAu in #102
- [Feature] Add nightly wheel publishing workflow by @swgu98 in #107
- [CI] Remove redundant AK/SK exports in nightly publish workflow by @swgu98 in #115
- suppport PipelineParallelWithInterleaveFthenB and VPPFhenBInBalancedMemory by @AlAuAu in #113
- turn off deepep on ampere and fix logging by @huangjiyi in #109
- add llava_model and clip_vit model by @blacksheep-Aristotle in #105
- support distributed_model by @AlAuAu in #111
- fix deterministic by @Waynezee in #116
- 【modelconfig】Change model layer name to support hf model by @xiaoguoguo626807 in #118
- support fp8 fusion node by @deepllz in #114
- Move sdpa before kv broadcast by @lshpku in #121
- Support fuse rope by @xuxinyi389 in #117
- model_config_and_dpo_support. by @wtmlon in #106
- Fix bugs in vocab_parallel_cross_entropy and VocabParallelEmbedding by @pkuzyc in #104
- Change name 2 by @xiaoguoguo626807 in #122
- Sequence parallel for GPTModel by @pkuzyc in #125
- Refine custom ops compile by @zhangbo9674 in #126
- add single card test and a100 test by @huangjiyi in #124
- Use Abi3 for building whl by @risemeup1 in #128
- Add setup test by @risemeup1 in #133
- add config by @Waynezee in #120
- add cp for paddlefleet by @Wennie396 in #129
- add coverge by @tianlef in #131
- Fix sharded_state_dict for single card by @changeyoung98 in #135
- fix numel block cpu by @huangjiyi in #136
- [CI] Add PR paddle wheel by @swgu98 in #137
- [CI]fix_uv_sync by @tianlef in #138
- Fix bugs in sequence parallel and add unit test by @pkuzyc in #139
- [CI] Revert paddleformers commit for integration test by @swgu98 in #140
- [Refactor] Split tokens_stable_unzip.cu into modular CUDA files by @ooooo-create in #141
- 【fused_moe】fix Moe fp8_utils.py bwd by @xiaoguoguo626807 in #142
- support matmul_bwd by @xuxinyi389 in #134
- Add dedicated FusedRMSNorm class by @lshpku in #147
- [CI] Add customop approval in
ci/check_approval.shby @ooooo-create in #145 - 【fp8】expert weight stop gradient = True can't apply_backward_hook by @xiaoguoguo626807 in #149
- [Pipeline Parallel] support pipeline parallel for gpt model by @LiYuRio in #112
- [CI] glm45 a100 by @swgu98 in #154
- [CI] add flags by @swgu98 in #155
- Support DeepEPTopKRouter by @xuxinyi389 in #146
- Gpt pp ut by @FeixLiu in #156
- [CI] Add qwen precision & Update CI by @swgu98 in #162
- [CI] Add version for wheel by @swgu98 in #163
- 【model name】update ppmodel state_dict name by @xiaoguoguo626807 in #160
- [CI] single card test on h20 by @swgu98 in #167
- GLM multi card test by @xuxinyi389 in #166
- Support fuse_swiglu_scale by @xuxinyi389 in #164
- add attn_mask_startend_row_indices for flashmask by @Wennie396 in #159
- 【config, pp】delete pipeline_dtype ; add model func by @xiaoguoguo626807 in #169
- Clean some useless code by @ooooo-create in #150
- [CI] Update config name by @swgu98 in #174
- [MoE Layer] Add BF16 GroupedGEMM and Unit Tests by @hushenwei2000 in #127
- [2025-12-11-17:21] Bump
uv.lockby @ooooo-create in #173 - fix cp bugs and add unit test for context parallel by @Wennie396 in #144
- Precision Change by @Waynezee in #184
- Add recompute by @Waynezee in #178
- add fp8_dispatch && shared_expert_overlap && offline quant by @Waynezee in #158
- Fix DeepEPTopKRouter for sp by @From00 in #186
- Support GLM45 with pipeline parallel by @LiYuRio in #168
- Move
paddlefleet.extensions.opstopaddlefleet.opsby @ooooo-create in #176 - [CI] Add
Merge PR to test branchtoApprovalworkflow and fix known-first-party inpyproject.tomlby @ooooo-create in #190 - [CI] add
rerunworkflow by @ooooo-create in #180 - [CI]incremental coverage by @tianlef in #157
- cache cos and sin for rope by @huangjiyi in #153
- [CI]change loss by @tianlef in #194
- [DeepGEMM] Support
DeepGEMMas a submodule by @ooooo-create in #191 - add empty layer by @FeixLiu in #189
- [Compat] Add triton to torch_proxy scope by @ooooo-create in #201
- Update
.github/actions/check-bypass/action.ymlby @ooooo-create in #202 - [DeepGEMM] Fix deep_gemm install by @ooooo-create in #203
- [CI] change to cli by @swgu98 in #198
- add_recompute_modules by @Waynezee in #196
- [CI]find error for log by @tianlef in #200
- [3rdparty] add check for uninitialized submodules by @ooooo-create in #204
- bug fix for moe by @FeixLiu in #199
- Revert "[CI]find error for log" by @swgu98 in #210
- fix by @swgu98 in #208
- [CI]a100 case add: gated_linear_unit: true by @tianlef in #212
- [CI]fix ci config for cli by @tianlef in #214
- [Infra] Add
instructionsfor faster local dev and removecpplint, clang-formatlocal hooks by @ooooo-create in #187 - 【Lora】fix lora pylayer bug by @xiaoguoguo626807 in #220
- 增加增量覆盖率信息打印 by @XieYunshen in #193
- [Pipeline Parallel] NoPipelineParallel bugfix by @AlAuAu in #197
- [CI] add sft+lora by @swgu98 in #216
- fix recompute by @Waynezee in #221
- Bump
uv.lockby @ooooo-create in #177 - [CI] Add new workflow to auto update
uv.lockby @ooooo-create in #183 - [CI] add moe_router_force_load_balancing by @swgu98 in #228
- [DeepEP] Add
DeepEPas a submodule by @ooooo-create in #215 - [BugFix] Fix update_dependencies.yml with limited disk space by @ooooo-create in #233
- [CI] Add
reopenedactivity to triggerpull_requestevent inApproval.ymlby @ooooo-create in #236 - [CI]fix config for pretrain memory error by @tianlef in #231
- add dict feature in function eval_batch & rename empty layer config by @Hz188 in #222
- [CI]change loss by @tianlef in #238
- [CI]change config by @tianlef in #244
- [Compat] Refine
paddle.compat.enable_torch_proxyusage by @ooooo-create in #243 - [CI] deal exit code 250 by @tianlef in #209
- update precision by @swgu98 in #245
- 【】delete Random warning only print once by @xiaoguoguo626807 in #247
- support fused_swiglu_bwd by @xuxinyi389 in #239
- pp model support dpo. by @wtmlon in #181
- [CI]fix exit code of pt log file by @tianlef in #249
- [MoE Layer] Add Grouped GEMM Fused Expert Weights Version by @hushenwei2000 in #175
- unify subbatch by @xuxinyi389 in #240
- [CI] add release3.3 paddle by @swgu98 in #255
- [CI] add release3.3 single card by @swgu98 in #256
- [CI] change shell to formers by @swgu98 in #258
- [bugfix] fix pp empty layer config bug by @Hz188 in #259
- Formalize deep_gemm unittests by @A-nnonymous in #250
- fix lora bug by @xiaoguoguo626807 in #261
- Support rrattnention in flashmask by @LLSGYN in #227
- fix_recompute_fused_rope by @huangjiyi in #264
- Fix loss diff for distributed strategies by @changeyoung98 in #254
- open fusion of swiglu by @xuxinyi389 in #251
- TopKRouter by @xuxinyi389 in #260
- Reduce GLM memory consumption by @zhangting2020 in #266
- [CI] del nemo megatron by @swgu98 in #275
- [CI] add qwen3moe by @swgu98 in #273
- [CI]Add glm dpo && coverage change by @tianlef in #274
- [CI] Grouped GEMM Intergrated Test by @hushenwei2000 in #277
- fix flash_mask_cp by @Wennie396 in #219
- [BugFix] Add nvidia-nvshmem-cu12 limit to avoid multiple definitions by @ooooo-create in #285
- [MoE Layer] Implement barrier_ep for Synchronization by @hushenwei2000 in #272
- fix cp fused_rope by @Wennie396 in #278
- Fix TransToDataType dtype cast error by @sneaxiy in #290
- chore 🤖: Bump
uv.lock(2026-01-04) by @github-actions[bot] in #291 - bug fix by @FeixLiu in #288
- Add sharded_state_dict for group_gemm by @changeyoung98 in #279
- remove unuse operations and disable sequence_parallel when tp <= 1 by @Waynezee in #289
- [3rdparty][DeepEP] Bump DeepEP by @ooooo-create in #299
- [CI] single card unittest use uv build by @swgu98 in #296
- [3rdparty][DeepEP] Bump DeepEP by @ooooo-create in #300
- [CI] precision test by @swgu98 in #295
- [MoE Layer] Fix Deep GEMM k_group Kernel Calling by @hushenwei2000 in #305
- [CI] install dependences of paddlefleet with cache by @swgu98 in #306
- [Sonicmoe] Add Sonicmoe as a submodule by @ooooo-create in #287
- [CI]Fix exit code check logit for multi card unit test by @tianlef in #303
- use uv build --wheel by @ooooo-create in #317
- chore 🤖: Bump
uv.lock(2026-01-06) by @github-actions[bot] in #313 - align config by @Waynezee in #304
- fix cp unittest by @Wennie396 in #307
- Add
check_patchelf_existsand bump sonic-moe by @ooooo-create in #326 - fix seq_aux_loss by @xuxinyi389 in #318
- [CI] update precision method by @swgu98 in #315
- [MoE Layer] Fix Router topk_weigtht in noaux_tc Method by @hushenwei2000 in #329
- [Feature] Add dynamic CUDA version-based dependency resolution by @ooooo-create in #293
- [CI]add cpu compile by @tianlef in #328
- [CI] coverage change to release by @swgu98 in #334
- [CI]disable multi card by @tianlef in #335
- tokens_unzip_gather support ue8m0 by @DanielSun11 in #310
- [CI] coverage by @swgu98 in #336
- Qwen3 vl by @blacksheep-Aristotle in #323
- [Build] Add git hash by @ooooo-create in #333
- [CI]fix coverage by @tianlef in #340
- [Build] Remove .o files from wheel before packaging by @ooooo-create in #330
- [fix]GLM45 pretrain fp8 on cuda126 by @tianlef in #342
- [MoE Layer] Support deepgemm Padding to tile_M by @hushenwei2000 in #282
- fix ut by @Waynezee in #347
- [CI] nightly multi python by @swgu98 in #344
- fix pname miss in grouped moe by @liufengwei0103 in #325
- fix rope bug by @blacksheep-Aristotle in #338
- [CI] add cancel by @swgu98 in #349
- disable fp8 and deepep when cuda12.6 by @risemeup1 in #345
- [MoE Layer] Delete moe_deep_gemm Config by @hushenwei2000 in #312
- Fix bug for tokens_unzip_gather_kernel by @DanielSun11 in #341
- fix router precision by @xuxinyi389 in #348
- Fix the bug for MultiModalRope when mbs>1 by @pkuzyc in #351
- Fix tensor model parallel world size return logic by @XieYunshen in #353
- bump sonic-moe by @ooooo-create in #355
- [CE]ADD CE by @tianlef in #316
- [CI] paddle release tag by @swgu98 in #352
- Fix the bug when get cp rank and size in rope by @pkuzyc in #358
- fix layer_norm bug by @blacksheep-Aristotle in #350
- fix seq_aux_loss by @Wennie396 in #361
- [Recompute] adapt rr and support dict in selective recompute by @Waynezee in #294
- 【moe】add moe_fuse config only lora use by @xiaoguoguo626807 in #366
- Fix the mis-match name bug of gelu_pytorch_tanh act by @pkuzyc in #363
- [CI]fix coverage by @tianlef in #369
- [DeepEP] Switch to
paddlefleet.ops.deep_epby @ooooo-create in #301 - [CI] add timeout by @swgu98 in #380
- support glm vpp overlap by @LiYuRio in #234
- [ThirdParty] Bump sonic-moe version to reduce launch triton kernel overhead by @SigureMo in #381
- [CE]add multi version python pipe by @tianlef in #357
- [MoE Layer] Default use Paddle batched_gemm when enable moe_grouped_gemm by @hushenwei2000 in #370
- fix_rr_rules by @Waynezee in #383
- [MoE Layer] Add moe_ep_barrier configuration by @hushenwei2000 in #373
- [MoE Layer] Fix AllToAll Implementation when TP > 1 by @hushenwei2000 in #360
- Revert "[DeepEP] Switch to
paddlefleet.ops.deep_ep" by @XieYunshen in #382 - add high_precision_rope by @blacksheep-Aristotle in #377
- fix_rope and seq_aux_loss by @Waynezee in #376
- Update Paddle dependency version by @swgu98 in #387
- [CI] Update grouped_gemm Unit Test for CUDA13 by @hushenwei2000 in #388
- 修改qwen3vl mrope计算逻辑 by @qhpeklh5959 in #379
- [CE]Sonic moe by @tianlef in #386
- add manual by @swgu98 in #391
- manual wheel update by @swgu98 in #392
- adapter sonic_moe by @xingmingyyj in #365
- fix fusion rope in cp by @Waynezee in #396
- add op tests by @xingmingyyj in #395
- [ThirdParty] Bump sonic-moe version to patch
paddle.emptyto support distributed env by @SigureMo in #402 - [Test]test release by @tianlef in #399
- [CI] Add a workflow to cherry-pick PR by add label with
cherry-pick: <target_branch>format by @ooooo-create in #404 - [CE] fix install paddlefleet use pip cache by @tianlef in #405
- support glm refined recompute in vpp overlap by @AlAuAu in #407
- Fix dispath node grad's autocast by @AlAuAu in #412
- [BugFix] Fix NoPipelineParallel init to support parallel parameter broadcasting by @huangjiyi in #411
- [CE]fix ce install extra-index-url by @tianlef in #414
- fix fp8 by @swgu98 in #409
- [CE]change cuda130 test machine group by @tianlef in #419
- hack test in paddle by @swgu98 in #416
- Support ue8m0 for Fuse_SPAQ by @Eddie-Wang1120 in #292
- fix p2p overlap by @FeixLiu in #408
- [UE8M0] Support Fuse_stack_transpose_fp8_quant by @Eddie-Wang1120 in #417
- [Docs] update
CONTRIBUTING.mdby @ooooo-create in #425 - [CI] nightly group by @swgu98 in #429
- [CE]change ce to fleet/formers release branch by @tianlef in #423
- opt FusedWeightedSwigluActQuantKernel by @zhengshengning in #422
- [UE8M0] Support Fused_Transpose_Split_Quant by @Eddie-Wang1120 in #378
- fix pp overlap with moe_grouped_gemm by @AlAuAu in #432
- [CE]Add ce loss check by @tianlef in #431
- fix manual build wheel commit by @swgu98 in #394
- support glm vpp overlap with dense layer by @AlAuAu in #438
- test: improve env var detection in moe test by @ooooo-create in #430
- Mtp support by @FeixLiu in #442
- Add layer number in moe router needed by rl r3 by @liufengwei0103 in #443
- [CI]add pr check by @tianlef in #454
- update precision method by @swgu98 in #446
- fix vpp overlap precision by @AlAuAu in #445
- hack mtp test for test by @tianlef in #457
- [UE8M0] fix bug for fuse_stack by @Eddie-Wang1120 in #447
- HACK test_gpt_pp_with_moe.py by @tianlef in #460
- [CE]fix install error && change tag paddle to release by @tianlef in #459
- 【】fix Qwen3vl moe BMMFunction bug by @xiaoguoguo626807 in #455
- [CE]Fix ce install paddle of release by @tianlef in #461
- Disable rr_attn_estimate_triton_op test by @LLSGYN in #462
- [Build] Clean python compile file in final wheel pkg by @ooooo-create in #458
- Disable test_tokens_unzip_gather by @DanielSun11 in #463
- add xpu backend by @yongqiangma in #452
- Recover test_tokens_unzip_gather by @DanielSun11 in #465
- Support mtp_input and hidden_states fusing with no vpp-overlap by @AlAuAu in #467
- fix fp8_quant_weight by @Waynezee in #468
- [CE]fix paddle install for multi python paddle dev by @tianlef in #466
- The release branch is packaged and uploaded to the self-hosted nightly build repository by @swgu98 in #471
- release wheel by @swgu98 in #476
- [MoE] Add moe_use_pfcc_deepep flag by @ooooo-create in #474
- [CI]change paddle to release latest by @tianlef in #479
- 【】tmp Qwen3vl moe lora fix by @xiaoguoguo626807 in #478
- Revert "[CI]change paddle to release latest" by @tianlef in #483
- fix mtp layer bug by @AlAuAu in #486
- Remove split_group_gemm by @Waynezee in #485
- update version by @swgu98 in #492
- 【qwen3vl 】fix model param name to decrease aoa by @xiaoguoguo626807 in #482
- [HACK] multi card test of moe for paddle by @tianlef in #495
- [MoE Layer] Change moe_grouped_gemm Not Support Warning to Error by @hushenwei2000 in #487
- [CI] get formers dev by @swgu98 in #499
- [CE]change release to formers 2.0 && dev to formers dev by @tianlef in #493
- 【lazy_init】change lm_head param init func to support lazy_init by @xiaoguoguo626807 in #502
- [ThirdParty] Bump DeepEP to avoid performance regression by @SigureMo in #504
- 【Lazy init】 change layers.py param init method to support lazy_init by @xiaoguoguo626807 in #503
- [CE] add dev multi python by @tianlef in #512
- update _dtype by @BossPi in #508
- [CI]release ci change to formers release by @tianlef in #509
- [JIT] disable dy2st in fleet only by @SigureMo in #515
- [ThirdParty] Set
moe_use_pfcc_deepepdefault value toTrueby @SigureMo in #517 - [CI] Update cherry-pick template to pass CI check by @SigureMo in #519
- manual build by @swgu98 in #506
- [CE]add qwen3vl by @tianlef in #524
- 【cherry-pick】lazy_init 503 by @xiaoguoguo626807 in #527
- 【cherry-pick 】fix layer dtype 508 by @xiaoguoguo626807 in #526
- 新增通用encoder by @qhpeklh5959 in #531
- change paddle to 3.3 release by @tianlef in #529
- [CodeStyle] fix some large tensor issues by @zrr1999 in #514
- move empty after mtp by @FeixLiu in #537
- [CE]change paddle url by @tianlef in #542
- auto_tuner by @xuxinyi389 in #534
- 【BugFix】refine moe_layer.py by @risemeup1 in #540
- [ThirdParty] Cleanup
moe_use_pfcc_deepepconfiguration by @SigureMo in #541 - [CodeStyle] Integrate ast-grep to pre-commit hooks by @zrr1999 in #513
- [CE]bug fix by @tianlef in #548
- Check release fix by @swgu98 in #550
- add nightly whl to bos by @swgu98 in #551
- Add FP8 linear layer by @pkuzyc in #544
- [BugFix] refine moe_layer.py by @risemeup1 in #557
- refine pyproject.toml by @risemeup1 in #556
- [Fleet CI]add a100 by @tianlef in #553
- fix_llava by @xuxinyi389 in #558
- [CE]add a100 case by @tianlef in #562
- [Fleet CI DEV] add qwen3vl by @tianlef in #565
- [CI] Remove cherry-pick labels after workflow completion by @ShigureNyako in #567
- [CE]fix ce workflow by @tianlef in #568
- [CI]change diff for unit test by @tianlef in #570
- Temporary bypass loss check, until PR77876 merged by @A-nnonymous in #573
- remove fuse_rms_norm config by @huangjiyi in #578
- Fix gpt_model by @xuxinyi389 in #579
- [CI/CE]Add GLM PT EP4 && FIX CE by @tianlef in #583
- [xpu] xpu backend run pass by @yongqiangma in #523
- change paddle version for protubuf by @tianlef in #586
- unify_gpt_model by @xuxinyi389 in #589
- [large tensor] fix CUDA extensions int64 overflow for large tensor dimensions by @zrr1999 in #561
- [MoE Layer]: Support Gated Shared Expert by @hushenwei2000 in #593
- Fix some test about enable_partial_send_recv(True)+sequence_parallel(True) by @xuxinyi389 in #590
- [Fleet CE]new CE by @tianlef in #585
- fix moe_layer_freq config by @Waynezee in #599
- [CE]fix ce precision scripts by @tianlef in #601
- [GLM5]Add MLA by @changeyoung98 in #574
- speed_up by @xuxinyi389 in #609
- 【kimi】fix Mla+flashmask bug by @xiaoguoguo626807 in #608
- 【kimi】fix moe_layer_pattern by @xiaoguoguo626807 in #611
- [CI]fix coverage by @tianlef in #614
- 【kimi】fix kimik25 bug by @xiaoguoguo626807 in #613
- Modify the sys path for unit tests by @swgu98 in #616
- 【kimi】fix kimik2 yarn rope by @xiaoguoguo626807 in #626
- [CI]add total coverage and fix ce precision by @tianlef in #623
- hack recompute with vpp overlap by @AlAuAu in #632
- [CE]fix weekly CE by @tianlef in #630
- [CE]fix ce bug by @tianlef in #634
- [CE]fix ce by @tianlef in #635
- fix auto-parallel by @sevenan2 in #587
- [Paddle Version]change to 38eee703e79 by @tianlef in #641
- [Arch]Add more GPU arches support by @swgu98 in #645
- 【kimik2】Kimik2 fix attention padding by @xiaoguoguo626807 in #649
- pass through rotary_pos_cos/sin by @xuxinyi389 in #648
- support skip mtp by @xingmingyyj in #597
- [CE]modify CE running time and max parallel by @tianlef in #651
- 【kimiK25 vision 】kimik25 vision by @xiaoguoguo626807 in #610
- fix MLA by @changeyoung98 in #637
- Update paddlepaddle-gpu dependency to version 260319 by @hushenwei2000 in #654
- Fast path for rope when cp_size==1 by @xuxinyi389 in #647
- [CI] Upgrade GitHub Actions for Node 24 compatibility by @ShigureNyako in #661
- [Paddle] change to 3.3.1 by @tianlef in #663
- [Qwen3.5] GatedDeltaNet by @pkuzyc in #624
- remove WeightOnlyMTPLayer weights stop_gradient by @xingmingyyj in #674
- support subbatch in language loss by @wangyuwen1999 in #672
- [CI]change ci type by @tianlef in #659
- Fleet mtp upgrade dev by @wtmlon in #673
- Remove no delay_scale_loss branch by @Waynezee in #642
- Gated Attention by @pkuzyc in #625
- 【kimik2】fix yarn_rope bug by @xiaoguoguo626807 in #678
- FA4 integration by @Waynezee in #638
- add build_nvshemem.sh by @risemeup1 in #684
- unit test for subbatch by @wangyuwen1999 in #677
- [wip] qwen vl sp by @FeixLiu in #665
- [TEST]fix subbatch test by @tianlef in #687
- Switch to Paddle-provided NVSHMEM wheels and add SM103 support by @swgu98 in #686
- [Test]fix subbatch test by @tianlef in #690
- [XPU] Support fused_swiglu_scale forward and backward via decomposed … by @ZhangX-21 in #671
- [XPU] support qwen_vl_30b in xpu device by @ZhangX-21 in #670
- Add block attention residuals by @pkuzyc in #693
- [ThirdParty] update sonic_moe and fa4 by @Waynezee in #689
- [Unit TEST]skip 'test_gpt_pp.py' in paddle unit test until pr 78013 merged by @tianlef in #698
- 【MLA】support mla down_proj use notplinear by @xiaoguoguo626807 in #692
- Add Fuse_vision_rope For VIT by @Eddie-Wang1120 in #697
- [Qwen3.5] Add vision encoder for qwen3.5 by @pkuzyc in #605
- [TEST]fix test_layers sys path by @tianlef in #706
- bug fix for qwen by @FeixLiu in #704
- [Coverage]add full coverage by @tianlef in #682
- Optimize VisionModel rope position encoding and packed seq attention by @huangjiyi in #695
- [CE] delete nvidia-cutlass-dsl==4.2.1 by @tianlef in #717
- Replace rms_norm and swiglu by paddle api by @zhangbo9674 in #714
- Support qwen35 by @xuxinyi389 in #724
- [CI] add some new function and ci update to python 3.13 by @tianlef in #718
- [Paddle]change paddle to 3.3.1.post20260403+ef0820a64e9 by @tianlef in #729
- [Refactor] Restructure rope_utils.py: consolidate config and inline fp32 by @huangjiyi in #732
- Add DSA Module by @xingmingyyj in #683
- fix moe lora gemm by @Lcysabcu in #709
- Fix the bug of block attention residuals in AMP by @pkuzyc in #719
- [Refactor] Remove config dependency from internal RoPE functions by @huangjiyi in #740
- update nvshmem by @swgu98 in #742
- [Benchmark]intergreation_test by @xuxinyi389 in #738
- [CI]del wget miniconda by @tianlef in #745
- [CI]change docker of python 3.12 by @tianlef in #747
- [bugfix] Fix fp8 run by @Waynezee in #728
- Add qk_norm_type config for per_layer qk_norm by @changeyoung98 in #734
- update submodule DeepEP by @youge325 in #712
- Add moe_deep_gemm config && replace matmul_add to linear by @zhangbo9674 in #751
- Update Paddle to 3.3.1.post20260409+52c898ee9ac by @DanielSun11 in #754
- arm by @swgu98 in #758
- [Feature] Support fused vision RoPE kernel for high_precision_rope by @huangjiyi in #736
- Use _new_shared_tensor for input in save_for_backward by @hushenwei2000 in #759
- [Cleanup] Replace torch proxy alias with public compat API by @ShigureNyako in #763
- Add arm pipeline by @tianlef in #741
- 20260404 add ai edited test by @liuhao2638 in #694
- [CI]fix ci unit coverage && load ecosystem lib by @tianlef in #753
- Replace fused_stack_quant and fused_weighted_swiglu_act_quant by custom_op by @DanielSun11 in #768
- [Feat] Align flashmask
_C_opsAPI with the latest overlap ver API by @Enigmatisms in #774 - [CE]fix CE bug by @tianlef in #779
- support dense mtp by @xingmingyyj in #764
- [FP8] Use
fuse_weighted_swiglu_fp8_quantin MoELayer by @SigureMo in #781 - fix NoPipelineParallel.eval_batch by @WYB27 in #778
- [CI]add more concurrency of single card by @tianlef in #775
- align PF & EC by @xuxinyi389 in #777
- 20260412 add ai edited test by @liuhao2638 in #772
- [Pipeline Parallel] Migrate PP Components to Paddle by @hushenwei2000 in #563
- Fix: skip qk fuse rope when high_precision_rope is enabled by @huangjiyi in #782
- Fix Unit Tests import PaddleFleet PP to Paddle PP by @hushenwei2000 in #793
- Remove 30 failing ai edited single card tests by @liuhao2638 in #798
- fix rng bug by @FeixLiu in #773
- Add Moe balanced logging and module profiler by @huangjiyi in #791
- update deepep by @zoooo0820 in #795
- Rm fused rmsnorm for align by @DanielSun11 in #787
- [CI]fix precision update by @tianlef in #801
- [CE]fix ce by @tianlef in #799
- add RRAttention readme by @GuoxiaWang in #806
- Sonic moe support blackwell by @risemeup1 in #811
- [CI]fix cherry pick by @tianlef in #814
- [FIX] Add CUDA check for fused RoPE ops to support non-CUDA devices by @G2uge in #800
- [V2 align] align EC&PF by @xuxinyi389 in #813
- Fix align rng for moe_router_force_load_balancing by @Xing-lil in #817
- [DotProductAttention] Support _attn_implementation config for eager attn by @zhanghonggeng in #815
- [BugFix] fix CudaError 700 in fuse_stack_transpose when N>64 by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/804
- log MoE aux/zloss into balance logs by @huangjiyi in https://github.com/PaddlePaddle/PaddleFleet/pull/824
- [CI]add skip logic for only document pr by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/818
- add commitid file by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/827
- [CE] change paddle release url by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/826
- [Paddle] change paddle version to 3.4 by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/830
- separate mtp head & loss for pp balance by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/822
- Fix FP8Linear scale error in blackwell by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/836
- fix: increase test dimensions to avoid ue8m0 scale shape truncation in CUDA kernel by @zhanghonggeng in https://github.com/PaddlePaddle/PaddleFleet/pull/837
- gate attn for mla by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/832
- Temporary disable gpt_pp test by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/834
- [ThirdParty] [FA] update flash-attention to support use_varlen in flashmask and add several bug fix. by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/838
- [New feature] Support MTP mask by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/829
- update baseline for b by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/845
- [Fix] refine recompute test by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/835
- Fix gates accumulation seq mismatch with ec by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/820
- update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/847
- fix sonic moe test by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/846
- 【Fix Tests】test_layers 单测在CI并行环境下的修复 by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/848
- Fix dsa cuda 700 by @changeyoung98 in https://github.com/PaddlePaddle/PaddleFleet/pull/850
- [New feature] Add Latent MoE support with hidden state compression/decompression projections by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/851
- Embedding and MOE support masking by input_ids by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/844
- set rng flag for weight initilization by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/853
- Support learnable routed_scaling_factor by @From00 in https://github.com/PaddlePaddle/PaddleFleet/pull/854
- Support per-depth MTP input_ids for MoE routing mask by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/856
- [align] dense mlp by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/857
- use paddle.set_flag to freeze weight initilization for cp multicards case by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/859
- flash_mask_path_fix and align z_loss by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/864
- add subbatch && auto_subbatch by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/821
- Add the RecomputeWithoutOutput util by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/862
- Support configure deep_ep num_sms and buffer size by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/861
- [MLA] Support zero-copy slice for q_rope calculation by @hushenwei2000 in https://github.com/PaddlePaddle/PaddleFleet/pull/849
- Add qk_norm_fusion by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/839
- Add RRAttention Paddle release by @LLSGYN in https://github.com/PaddlePaddle/PaddleFleet/pull/812
- [CI] disable torch check by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/869
- Fuse lm_head with cross entropy loss to reduce memory usage and speed up by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/866
- fix: ensure shared_experts use original hidden_size in Latent MoE by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/868
- [FA] enable flashmask_use_varlen by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/860
- Support Kgroupgemm in PaddleFleet by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/871
- Add align mode for aux_loss when enable mtp by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/873
- [FA] add clone for startend_row_indices by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/875
- 【fd】fix fd decoder input by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/877
- Add Test For Kgroupgemm and Fix bug by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/876
- Update submodule DeepEP by @youge325 in https://github.com/PaddlePaddle/PaddleFleet/pull/762
- Align z_loss with EC by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/888
- [EP] Add HybridEP support by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/852
- Add triton fused sigmoid gate for MLA by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/872
- 添加 fused_ce loss 的对齐模式 by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/886
- fix qk_norm_fusion use by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/892
- Revert "[EP] Add HybridEP support" by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/893
- update paddle.deps version by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/897
- Add fused_moe_topk and fused_routing_map by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/842
- Remove log in fp8 quant by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/900
- [Unit Test] change test_pp.py by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/899
- fix sharded_state_dict by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/901
- [bugfix] fix rope in mla by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/906
- [CE] fix cp case && paddle relaese change to dev by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/909
- Adapt routing_map_fusion use by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/908
- rm mtp preprocess in modeling by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/883
- [CUDAGraph] add autocudagraph by @DrRyanHuang in https://github.com/PaddlePaddle/PaddleFleet/pull/896
- [XPU] add xpu nightly build whl by @plusNew001 in https://github.com/PaddlePaddle/PaddleFleet/pull/912
- support dw overlap by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/885
- [Build] Improve XPU build dependency version handling by @G2uge in https://github.com/PaddlePaddle/PaddleFleet/pull/913
- fix dsa args name by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/914
- [XPU] fix nightly build bug by @plusNew001 in https://github.com/PaddlePaddle/PaddleFleet/pull/915
- Split paddlefleet ops into standalone package via uv workspace by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/760
- fix ci ce bug by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/918
- refine yaml by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/920
- [BugFix] Fix paddlefleet-ops version pinning in build by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/922
- version get from PADDLEFLEET_VERSION by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/925
- update ops whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/923
- Reapply "[EP] Add HybridEP support" (#893) by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/895
- remove empty_cache in languageloss by @AlAuAu in https://github.com/PaddlePaddle/PaddleFleet/pull/924
- [CI]fix ci upload for python 3.12 by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/930
- ops whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/931
- [Build] fix build system for sonic-moe/quack uv sync by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/921
- [EP] Skip build HybridEP on CUDA 12.6 by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/933
- fix ops build whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/934
- fix manual whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/940
- Add dw_p2p_overlap config by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/928
- [OPs] Build DeepEP on detected CUDA arch only and raise error on invalid CUDA arch settings and sdist by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/938
- fix qwen35 bug by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/926
- [OPS]mv third_party from paddlefleet to paddlefleet_Ops by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/945
- [OPs] Error on load ecosystem library and add
apache-tvm-ffito dependencies for SonicMoE by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/944 - Fix quack import error by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/948
- [CI] udpate pr cache ops by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/949
- support ue8m0 by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/894
- update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/955
- Add inplace SwigluProbsGradKernel by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/833
- [Chore] Bump sonic-moe to PR#25 (epilogue refactor + wgrad fix) by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/950
- 20260512 add ai edited test by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/942
- Revert "Add inplace SwigluProbsGradKernel" by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/961
- [ops] update build ops by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/965
- [Deps] Bump
paddlepaddle-gputo3.4.0.post20260514+38bb5555a0aby @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/963 - [CE]fix ce by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/969
- [Execute Infrastructure] 支持 develop 分支 commit 级别 wheel 编包及 paddle-dev ops 构建 by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/973
- disable upload ops on paddle dev by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/975
- [OPS] Remove redundant ops namespace layer and flatten import paths by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/974
- change paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/976
- [OPs] Skip DeepEP build and enable ccache for wheel builds by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/977
- [New features] Support XPU backend for fused_bias_swiglu backward by @G2uge in https://github.com/PaddlePaddle/PaddleFleet/pull/962
- fix vit rope by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/978
- add moe_routed_expert_use_bias to TransformerConfig by @huangjiyi in https://github.com/PaddlePaddle/PaddleFleet/pull/972
- [CE]fix ce by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/983
- [align mode] align mtp by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/982
- [Performance Optimization] bump sonic-moe submodule to paddle branch HEAD (04a6848) by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/984
- 20260515 add ai edited tests by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/964
- [Enhancement] Support large tensor int64 indexing in fused_swiglu_bwd by @zrr1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/967
- [OPs][HybridEP] Add HybridEP buffer SM configuration by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/981
- add fused_mla_yarn_rope_apply for mla by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/870
- Unify moe_grouped_gemm && moe_expert_fusion by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/917
- support mtp and cp in paddlefleet's EB dataflow by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/985
- Using put_along_axis_ to reduce mem in Topk Router by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/946
- [OPs][HybridEP] Update HybridEP submodule by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/986
- fix gdn padding mask prob by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/994
- [ops] update build time by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/992
- Refactor DSA Module by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/987
- [TransformerConfig] refactor(config): delete use_latent_moe config by @hushenwei2000 in https://github.com/PaddlePaddle/PaddleFleet/pull/988
- [CI] add fleet log analysis bot by @zjjlivein in https://github.com/PaddlePaddle/PaddleFleet/pull/997
- [DeepSeekV4] CSA/HCA implement by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/957
- Align init method with megatron by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/991
- Fix auto subbatch expert_id offset calculation by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/911
- test: add AI coverage tests by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/998
- [Distributed Strategy] Add Refined Recompute support for DeepEP combine overlap by @wangbingguang2026 in https://github.com/PaddlePaddle/PaddleFleet/pull/995
- fleet build time by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/996
- [CE]disable test_ai_moe_layer_5.py by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1012
- [DeepSeekV4] support mtp layer hybrid attn by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1008
- [CI]Modify pr template by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1015
- test: update AI MoE coverage and filenames by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/1014
- [feat(mHC)]: Add basic implementation of manifold hyper connection(mHC) by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1000
- [CI] Use GitHub Actions step timeouts in workflows by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1018
- [CE]fix action bill problem by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1028
- [Devs] fix coverage path mapping in CI by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1021
- [Config] add mtp loss by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1017
- Fix auto subbatch expert node indexing by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1009
- update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1033
- [New features] support FastDeploy fallback to Fleet MLA model by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/1005
- [CUDAGraph] Teardown autocudagraph cache to prevent state pollution by @DrRyanHuang in https://github.com/PaddlePaddle/PaddleFleet/pull/990
- fix paddle ci bot post comment by @zjjlivein in https://github.com/PaddlePaddle/PaddleFleet/pull/1032
- [BugFix] use int32 for kgroupgemm by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1016
- [New features] Add hash routing and softplus score functions by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/1024
- mhc fix sp/tp bug by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1036
- Support k_grouped_gemm in auto_subbatch fallback by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1037
- 【fastdeploy】Fix need_do_attention when not use MLA by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/1041
- [XPU][CI] inject paddlepaddle-xpu version in publish_wheel_nightly_xpu… by @G2uge in https://github.com/PaddlePaddle/PaddleFleet/pull/1042
- [Operator]unify_swiglu by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1043
- [New features] 新增 ClampSwiGLU 系列算子 by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/1025
- [New Feature] Fix KV cache and RoPE for greedy inference on MoE models by @liym27 in https://github.com/PaddlePaddle/PaddleFleet/pull/904
- Refine contribute md by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/1050
- [CI]fix multi cards failed by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1046
- Fix deepseek v4 by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/1047
- Add fp8 sonic moe in MoELayer by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/879
- [OPs][SonicMoE] Update
sonic-moesubmodule for compat import by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1056 - Mtp magic send by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1031
- [Bug fixes] fix duplicate m_indices generation in fp8_utils by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/1002
- [BugFix] Fix kv_b_proj global dim for Muon optimizer with GQA MLA by @liym27 in https://github.com/PaddlePaddle/PaddleFleet/pull/1057
- [Improvements] Support large tensor indexing in SwiGLU CUDA extensions by @zrr1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/1011
- [CI] Avoid approval false positives from base drift by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1060
- update flash-attention submodule by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1058
- [New features] support GQA SWA for MiMo by @GuoxiaWang in https://github.com/PaddlePaddle/PaddleFleet/pull/1039
- Support static subbatch (moe_subbatch_token_num_after_dispatch) with moe_deep_gemm=True by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1053
- [DeepSeekV4]support fused sparse attn tilelang kernel by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1054
- [XPU] add xpu ops build by @plusNew001 in https://github.com/PaddlePaddle/PaddleFleet/pull/1061
- [Custom OP] Add fused_swiglu_probs_bwd to paddlefleet_ops by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/989
- update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1067
- Revert "Mtp magic send (#1031)" by @XieYunshen in https://github.com/PaddlePaddle/PaddleFleet/pull/1064
- nvshmem on cuda132 by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1072
- [BugFix] Revert MLA GQA changes: use num_attention_heads instead of num_query_groups by @liym27 in https://github.com/PaddlePaddle/PaddleFleet/pull/1074
- Revert "[New features] support GQA SWA for MiMo (#1039)" by @GuoxiaWang in https://github.com/PaddlePaddle/PaddleFleet/pull/1076
- fix align mode FA calling by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/1077
- [New features][Bug fixes] Add TileLang CSA compressed indexer for DSv4 and fix ColumnLinear stop_gradient backward by @YJMSTR in https://github.com/PaddlePaddle/PaddleFleet/pull/1052
- Support magic_init by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/1075
- [CI] Fix ARM ops package build by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1084
- Reapply "Mtp magic send (#1031)" (#1064) by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1081
- mhc add fused kernels by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1055
- [Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy func and testcase by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/1078
- [Bug fixes] pass fa_version from forward to backward to keep fwd/bwd consistent by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1085
- [CI] Remove Python 3.10 PR package artifact flow by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1089
- fix mhc by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1091
- [New features] Sliding Window Attention for GQA and VHA by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1087
- [DeepSeekV4]align precision for megatron in dsv4 attn by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1088
- assert magic for pp=1 by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1092
- fix fuse_weighted_swiglu_fp8_quant 0_size by @zhengshengning in https://github.com/PaddlePaddle/PaddleFleet/pull/1083
- Make SonicMoE's weight layout consistent with GroupedGEMM by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/1086
- add fleet ci bot feedback collector by @zjjlivein in https://github.com/PaddlePaddle/PaddleFleet/pull/1082
- Fix mhc mtp bug in RL by @liuruyan in https://github.com/PaddlePaddle/PaddleFleet/pull/1096
- [New features] Support MLA sliding window attention by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1099
- [Bugfix] fix mtp_input_mask when enable sp by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1100
- Dsv4 align moe kernel by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1102
- [Bugfix] fix attention sinnk process in tilelang by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1105
- [BUG FIX] fix rope-theta for MLA by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1107
- [Improvements][DSv4 mHC] Align HyperConnection linear/aggregate numerics with Megatron under FLAGS_use_accuracy_compatible_kernel by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1103
- [New features] Update flash-attention submodule to support cutlass-dsl (cu13) by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1108
- Support cutlass-dsl (cu13) by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1111
- [0-size] paddlefleet_fused_swiglu_probs_bwd by @zhengshengning in https://github.com/PaddlePaddle/PaddleFleet/pull/1104
- [OPs] Integrate
cudnn-frontendtopaddlefleet-opsby @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/1114 - [CI] Increase single-card test timeout by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1116
- fix_rope by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/1106
- 【fd】support decode yarn_rope_emb by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/1117
- [Improvements] align MTP loss and mHC kernel with Megatron accuracy-compatible path by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1112
- [Bug fixes] Fix rotary_pos_emb/swa_rotary_pos_emb None guard in Context Parallel for GQA by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1119
- [align mode] add v11 model by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1118
- [Environment Adaptation] add CUDA 13.2 (cu132) build support by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1120
- add high_precision_mhc for float32 or bfloat16 calc in mHC by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1098
- fix radio by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1090
- Set cast_to_low_precision to false in csa by @changeyoung98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1125
- [Bug fixes] Fix moe_deep_gemm disable when moe fusion enable by @AlAuAu in https://github.com/PaddlePaddle/PaddleFleet/pull/1128
- add more assert by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1115
- Add logs for mtp_loss by @From00 in https://github.com/PaddlePaddle/PaddleFleet/pull/903
- Check the 0-size issue of the custom operator in paddleFleetOp by @zhengshengning in https://github.com/PaddlePaddle/PaddleFleet/pull/1121
- update approve by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1130
- [fix] restoring position_id'value when not in align mode by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1129
- [Performance Optimization] Migrate CSA indexer backward to cuDNN frontend by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1094
- [OPs] Integrate
flash_mlatopaddlefleet-opsby @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1132 - [Execute Infrastructure] Gate PaddleFleet GPT tests by repo flag by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1143
- [Feat][Bug fixes] Contextual parallelism support for DSV4 sparse-attention and some bug fixes. by @Enigmatisms in https://github.com/PaddlePaddle/PaddleFleet/pull/1139
- fix time out by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1149
- Fix csa for alignment by @changeyoung98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1148
- [Distributed Strategy][Bug fixes]fix PP hang caused by indexer loss log by @YJMSTR in https://github.com/PaddlePaddle/PaddleFleet/pull/1136
- change nvshmem on cuda132 by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1123
- support gqa selective recompute for gated_attn by @AlAuAu in https://github.com/PaddlePaddle/PaddleFleet/pull/1131
- [DSv4 mHC] Eliminate unnecessary synchronous memcpy by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1122
- update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1156
- Support Document Mask in CSA by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/1150
- [CI]fix release ci by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1165
- [Cherry-pick] fix mhc enable_grad in first fwd (#1164) by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1168
- [Bugfix] fix weight grad when enbale sequence_parallel (#1154) by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1162
- [Bug Fix] Fix sonic_moe flag bug by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/1163
- [release/0.3] update ops by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1171
- [release/0.3] Move csa dtype align to FLAGS_use_accuracy_compatible_kernel mode by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1172
- [CI] fix release ci update ops by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1180
- 【Cherry-pick】Fix decoder_input not scattered in CP without MTP by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/1179
- [release/0.3][CI] Pin tvm-ffi below 0.1.12 for paddlefleet_ops by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1185
- [release/0.3][CI]disabel single tests by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1188
- [release/0.3][Bug fixes] Revert disabling single-card tests by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1201
- [release/0.3][codex] Add -s to pytest test scripts by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1199
- [release/0.3] Fix CP scatter placement in gpt_embedding to occur before RoPE generation by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1204
- fix layer_idx if training with empty layer in head (#1181) by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1196
- [release/0.3][CI] Fix ARM ops package build dependencies by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1212
- [release/0.3] Fix build wheel by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1215
- [release/0.3][New features] refactor VHA and support Context Parallel by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1197
- [release/0.3] Add assertion to prevent sequence_parallel with CP scatter in plain path by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1213
- add logging print for index loss (#1178) by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1189
- [release/0.3] fix rope for q token by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1219
- [CP]Add cuDNN backend for CSA sparse attention by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1222
- [Bug fix] Use pad_token_id for mask by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/1224
- [CP] update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1228
- [release/0.3][New Feature] pass swa rope emb to MTP layer by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1235
- [release/0.3][align] add tp/sp in align mode by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1241
- [Cherry-Pick][New Feature] Add SharedKV for VHA by @GuoxiaWang in https://github.com/PaddlePaddle/PaddleFleet/pull/1242
- mHC fix H_res transpose (#1238) by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1240
- [release/0.3][DSV4 Attn][Fix] Fix document mask padding logic by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1249
- [Cherry-Pick][New features] Add Multimax LM-head fused CE support by @ZhouYuxuanYX in https://github.com/PaddlePaddle/PaddleFleet/pull/1244
- support autosubbatch mem info with legacy allocator by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1251
- [release/0.3][BugFix] fix align mode when cp = 1 by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1259
- [DSv4 CSA] Replace tensor iteration with tolist by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1257
- [DSv4] Add triton fused_apply_mla_rope_inplace for q and o (#1231) by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1258
- support QAT by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/1260
- [release/0.3][Performance Optimization] Migrate CSA indexer forward to cuDNN frontend by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1263
- [release/0.3][Devs][DSv4] Remove dead TileLang CSA indexer loss by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1270
- [release/0.3] Using Sparse attention API by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1278
New Contributors
- @From00 made their first contribution in #1
- @risemeup1 made their first contribution in #14
- @blacksheep-Aristotle made their first contribution in #26
- @XieYunshen made their first contribution in #38
- @zhangbo9674 made their first contribution in #48
- @hushenwei2000 made their first contribution in #61
- @xuxinyi389 made their first contribution in #73
- @LiYuRio made their first contribution in #81
- @deepllz made their first contribution in #114
- @A-nnonymous made their first contribution in #250
- @LLSGYN made their first contribution in #227
- @zhangting2020 made their first contribution in #266
- @sneaxiy made their first contribution in #290
- @github-actions[bot] made their first contribution in #291
- @liufengwei0103 made their first contribution in #325
- @qhpeklh5959 made their first contribution in #379
- @yongqiangma made their first contribution in #452
- @BossPi made their first contribution in #508
- @zrr1999 made their first contribution in #514
- @sevenan2 made their first contribution in #587
- @ZhangX-21 made their first contribution in #671
- @Lcysabcu made their first contribution in #709
- @youge325 made their first contribution in #712
- @liuhao2638 made their first contribution in #694
- @zoooo0820 made their first contribution in #795
- @G2uge made their first contribution in #800
- @Xing-lil made their first contribution in #817
- @zhanghonggeng made their first contribution in #815
- @adam-xiaoyao made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/851
- @xxyux made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/897
- @DrRyanHuang made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/896
- @plusNew001 made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/912
- @wangbingguang2026 made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/995
- @liym27 made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/904
- @baoqiwen made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/1058
Full Changelog: https://github.com/PaddlePaddle/PaddleFleet/commits/v0.3.0