Release PaddleFleet version 0.3.0 · PaddlePaddle/PaddleFleet

What's Changed

Add preliminary code by @From00 in #1
add spec utils by @FeixLiu in #2
add enums.py and identity_op.py by @GuoxiaWang in #3
add vpp_simulator by @Waynezee in #5
PaddleFleet distributed initialization and ProcessGroup create. by @Hz188 in #8
add codestyle workflow by @swgu98 in #6
Complete parallel_state.py by @Hz188 in #12
Trans transformer block/layer by @FeixLiu in #11
[CodeStyle] Ignore PLC0414 in __init__.py files by @SigureMo in #13
Complete process_groups_config.py doc and fix typo by @Hz188 in #15
[setup] Support source installation of Paddle-Fleet by @risemeup1 in #14
improve parallel_state.py for EPHcg & Hcg by @Hz188 in #16
[CI] Add approval workflow by @swgu98 in #10
support pipeline_parallel schedules by @AlAuAu in #9
dev global vars, yaml parse by @Hz188 in #18
[CI] fix approval workflow by @swgu98 in #20
Add Attention and RoPE classes by @lshpku in #17
add mtp by @FeixLiu in #22
unittest test_schedules bugfix by @AlAuAu in #24
add gpt_model specs by @GuoxiaWang in #23
add mlp layer by @blacksheep-Aristotle in #26
Trans ln by @FeixLiu in #25
Add PackedSeqParams class by @lshpku in #30
add psp by @FeixLiu in #31
Update LanguageModelEmbedding and add unittest by @lshpku in #27
Update RoPE and add unittest by @lshpku in #28
[CI] add test workflow by @swgu98 in #19
add pipeline/utils.py by @blacksheep-Aristotle in #33
[CI] change ut dir by @swgu98 in #34
[CI] git name by @swgu98 in #35
Placeholder for building transformer configuration parsing by @Hz188 in #32
[CI] Add typos white list by @swgu98 in #36
change sublayer to sublayer_spec by @FeixLiu in #39
fix some test_gpt_model dependencies by @GuoxiaWang in #40
Add global Timers for logging by @huangjiyi in #21
add check_initialized for dp group by @Hz188 in #41
add coverage scripts by @XieYunshen in #38
fix coverage scripts by @XieYunshen in #45
First successful run of GPTModel model definition by @GuoxiaWang in #44
Change fleet core to paddlefleet by @From00 in #46
fix coverage bug by @risemeup1 in #47
Some fixes to successful run glm4.5 in PaddleFormers by @From00 in #49
single card test by @swgu98 in #50
mv paddlefleet to src by @risemeup1 in #52
use paddle12.6 in single card test by @risemeup1 in #53
Add GPTModelEstimator by @huangjiyi in #59
Fix attention dim order by @lshpku in #60
fix ffn_hidden_size is None when init gpt_mlp by @blacksheep-Aristotle in #56
Add estimate_mfu by @huangjiyi in #62
refine pyproject.toml by @risemeup1 in #63
[CI] add uv pre-commit by @swgu98 in #65
[CI] Add nemo megatron approval by @swgu98 in #51
Add set_logging and get_logger by @huangjiyi in #64
bugfix init GPTModel by @GuoxiaWang in #54
[CodeStyle][Ruff] update ruff target-version to py310 by @ooooo-create in #66
Supprot custom op by @zhangbo9674 in #48
【lora】fix model for Lora by @xiaoguoguo626807 in #68
Ci/multi card config by @swgu98 in #67
[CI] add approval for model_parallel_config.py & transformer_config.py by @swgu98 in #70
[MoE] Add Base MoE Layer by @hushenwei2000 in #61
Delete sharded_state_dict to support FC save/load by @changeyoung98 in #71
Align to PaddleFormers by @Waynezee in #72
Ignore files generated from uv sync for custom ops by @ooooo-create in #69
[Bug_Fix] fix attention_mask & skip check expert_tensor_parallel_group by @xuxinyi389 in #73
Fix moe_layer config by @From00 in #74
Fix save_tensors bugs and disable jit by @From00 in #75
Add tensor parallel functions by @pkuzyc in #29
Support TP Sharding EP For GLM4.5 by @xuxinyi389 in #76
move spec utils to paddlefleet by @FeixLiu in #78
Cherry pp layers by @FeixLiu in #80
add non pipeline execution by @LiYuRio in #81
Shared weight test by @FeixLiu in #86
modify pylayer bug by @xiaoguoguo626807 in #87
refine non-pp scheduler by @LiYuRio in #89
Support MTP in GLM4.5 and add unittest by @lshpku in #55
Use original cross_entropy and re-open the loss check in unit test by @pkuzyc in #84
Fix rope dim order by @lshpku in #91
support PipelineParallel by @AlAuAu in #92
fix single card run by @huangjiyi in #90
Fix bug in tensor_parallel unit tests by @pkuzyc in #93
[MoE Layer] Fix EP Hang when No Tokens are Distributed in the Rank by @hushenwei2000 in #83
pp License fix by @AlAuAu in #95
[CI] add integration test glm by @swgu98 in #85
Add sharded_state_dict for TP by @changeyoung98 in #94
[CI] fix bypass by @swgu98 in #97
Add instructions for copilot reviewer by @risemeup1 in #96
[Feature] Add test instruction by @risemeup1 in #98
disable test_layers.py by @swgu98 in #99
[CI] Delete sed by @swgu98 in #101
rename config fields to align huggingface by @Hz188 in #82
Fix bias grad reduction of bias_geglu_back by @lshpku in #100
fix config by @Waynezee in #108
support pipeline_parallel_withinterleave by @AlAuAu in #102
[Feature] Add nightly wheel publishing workflow by @swgu98 in #107
[CI] Remove redundant AK/SK exports in nightly publish workflow by @swgu98 in #115
suppport PipelineParallelWithInterleaveFthenB and VPPFhenBInBalancedMemory by @AlAuAu in #113
turn off deepep on ampere and fix logging by @huangjiyi in #109
add llava_model and clip_vit model by @blacksheep-Aristotle in #105
support distributed_model by @AlAuAu in #111
fix deterministic by @Waynezee in #116
【modelconfig】Change model layer name to support hf model by @xiaoguoguo626807 in #118
support fp8 fusion node by @deepllz in #114
Move sdpa before kv broadcast by @lshpku in #121
Support fuse rope by @xuxinyi389 in #117
model_config_and_dpo_support. by @wtmlon in #106
Fix bugs in vocab_parallel_cross_entropy and VocabParallelEmbedding by @pkuzyc in #104
Change name 2 by @xiaoguoguo626807 in #122
Sequence parallel for GPTModel by @pkuzyc in #125
Refine custom ops compile by @zhangbo9674 in #126
add single card test and a100 test by @huangjiyi in #124
Use Abi3 for building whl by @risemeup1 in #128
Add setup test by @risemeup1 in #133
add config by @Waynezee in #120
add cp for paddlefleet by @Wennie396 in #129
add coverge by @tianlef in #131
Fix sharded_state_dict for single card by @changeyoung98 in #135
fix numel block cpu by @huangjiyi in #136
[CI] Add PR paddle wheel by @swgu98 in #137
[CI]fix_uv_sync by @tianlef in #138
Fix bugs in sequence parallel and add unit test by @pkuzyc in #139
[CI] Revert paddleformers commit for integration test by @swgu98 in #140
[Refactor] Split tokens_stable_unzip.cu into modular CUDA files by @ooooo-create in #141
【fused_moe】fix Moe fp8_utils.py bwd by @xiaoguoguo626807 in #142
support matmul_bwd by @xuxinyi389 in #134
Add dedicated FusedRMSNorm class by @lshpku in #147
[CI] Add customop approval in ci/check_approval.sh by @ooooo-create in #145
【fp8】expert weight stop gradient = True can't apply_backward_hook by @xiaoguoguo626807 in #149
[Pipeline Parallel] support pipeline parallel for gpt model by @LiYuRio in #112
[CI] glm45 a100 by @swgu98 in #154
[CI] add flags by @swgu98 in #155
Support DeepEPTopKRouter by @xuxinyi389 in #146
Gpt pp ut by @FeixLiu in #156
[CI] Add qwen precision & Update CI by @swgu98 in #162
[CI] Add version for wheel by @swgu98 in #163
【model name】update ppmodel state_dict name by @xiaoguoguo626807 in #160
[CI] single card test on h20 by @swgu98 in #167
GLM multi card test by @xuxinyi389 in #166
Support fuse_swiglu_scale by @xuxinyi389 in #164
add attn_mask_startend_row_indices for flashmask by @Wennie396 in #159
【config, pp】delete pipeline_dtype ; add model func by @xiaoguoguo626807 in #169
Clean some useless code by @ooooo-create in #150
[CI] Update config name by @swgu98 in #174
[MoE Layer] Add BF16 GroupedGEMM and Unit Tests by @hushenwei2000 in #127
[2025-12-11-17:21] Bump uv.lock by @ooooo-create in #173
fix cp bugs and add unit test for context parallel by @Wennie396 in #144
Precision Change by @Waynezee in #184
Add recompute by @Waynezee in #178
add fp8_dispatch && shared_expert_overlap && offline quant by @Waynezee in #158
Fix DeepEPTopKRouter for sp by @From00 in #186
Support GLM45 with pipeline parallel by @LiYuRio in #168
Move paddlefleet.extensions.ops to paddlefleet.ops by @ooooo-create in #176
[CI] Add Merge PR to test branch to Approval workflow and fix known-first-party in pyproject.toml by @ooooo-create in #190
[CI] add rerun workflow by @ooooo-create in #180
[CI]incremental coverage by @tianlef in #157
cache cos and sin for rope by @huangjiyi in #153
[CI]change loss by @tianlef in #194
[DeepGEMM] Support DeepGEMM as a submodule by @ooooo-create in #191
add empty layer by @FeixLiu in #189
[Compat] Add triton to torch_proxy scope by @ooooo-create in #201
Update .github/actions/check-bypass/action.yml by @ooooo-create in #202
[DeepGEMM] Fix deep_gemm install by @ooooo-create in #203
[CI] change to cli by @swgu98 in #198
add_recompute_modules by @Waynezee in #196
[CI]find error for log by @tianlef in #200
[3rdparty] add check for uninitialized submodules by @ooooo-create in #204
bug fix for moe by @FeixLiu in #199
Revert "[CI]find error for log" by @swgu98 in #210
fix by @swgu98 in #208
[CI]a100 case add: gated_linear_unit: true by @tianlef in #212
[CI]fix ci config for cli by @tianlef in #214
[Infra] Add instructions for faster local dev and remove cpplint, clang-format local hooks by @ooooo-create in #187
【Lora】fix lora pylayer bug by @xiaoguoguo626807 in #220
增加增量覆盖率信息打印 by @XieYunshen in #193
[Pipeline Parallel] NoPipelineParallel bugfix by @AlAuAu in #197
[CI] add sft+lora by @swgu98 in #216
fix recompute by @Waynezee in #221
Bump uv.lock by @ooooo-create in #177
[CI] Add new workflow to auto update uv.lock by @ooooo-create in #183
[CI] add moe_router_force_load_balancing by @swgu98 in #228
[DeepEP] Add DeepEP as a submodule by @ooooo-create in #215
[BugFix] Fix update_dependencies.yml with limited disk space by @ooooo-create in #233
[CI] Add reopened activity to trigger pull_request event in Approval.yml by @ooooo-create in #236
[CI]fix config for pretrain memory error by @tianlef in #231
add dict feature in function eval_batch & rename empty layer config by @Hz188 in #222
[CI]change loss by @tianlef in #238
[CI]change config by @tianlef in #244
[Compat] Refine paddle.compat.enable_torch_proxy usage by @ooooo-create in #243
[CI] deal exit code 250 by @tianlef in #209
update precision by @swgu98 in #245
【】delete Random warning only print once by @xiaoguoguo626807 in #247
support fused_swiglu_bwd by @xuxinyi389 in #239
pp model support dpo. by @wtmlon in #181
[CI]fix exit code of pt log file by @tianlef in #249
[MoE Layer] Add Grouped GEMM Fused Expert Weights Version by @hushenwei2000 in #175
unify subbatch by @xuxinyi389 in #240
[CI] add release3.3 paddle by @swgu98 in #255
[CI] add release3.3 single card by @swgu98 in #256
[CI] change shell to formers by @swgu98 in #258
[bugfix] fix pp empty layer config bug by @Hz188 in #259
Formalize deep_gemm unittests by @A-nnonymous in #250
fix lora bug by @xiaoguoguo626807 in #261
Support rrattnention in flashmask by @LLSGYN in #227
fix_recompute_fused_rope by @huangjiyi in #264
Fix loss diff for distributed strategies by @changeyoung98 in #254
open fusion of swiglu by @xuxinyi389 in #251
TopKRouter by @xuxinyi389 in #260
Reduce GLM memory consumption by @zhangting2020 in #266
[CI] del nemo megatron by @swgu98 in #275
[CI] add qwen3moe by @swgu98 in #273
[CI]Add glm dpo && coverage change by @tianlef in #274
[CI] Grouped GEMM Intergrated Test by @hushenwei2000 in #277
fix flash_mask_cp by @Wennie396 in #219
[BugFix] Add nvidia-nvshmem-cu12 limit to avoid multiple definitions by @ooooo-create in #285
[MoE Layer] Implement barrier_ep for Synchronization by @hushenwei2000 in #272
fix cp fused_rope by @Wennie396 in #278
Fix TransToDataType dtype cast error by @sneaxiy in #290
chore 🤖: Bump uv.lock (2026-01-04) by @github-actions[bot] in #291
bug fix by @FeixLiu in #288
Add sharded_state_dict for group_gemm by @changeyoung98 in #279
remove unuse operations and disable sequence_parallel when tp <= 1 by @Waynezee in #289
[3rdparty][DeepEP] Bump DeepEP by @ooooo-create in #299
[CI] single card unittest use uv build by @swgu98 in #296
[3rdparty][DeepEP] Bump DeepEP by @ooooo-create in #300
[CI] precision test by @swgu98 in #295
[MoE Layer] Fix Deep GEMM k_group Kernel Calling by @hushenwei2000 in #305
[CI] install dependences of paddlefleet with cache by @swgu98 in #306
[Sonicmoe] Add Sonicmoe as a submodule by @ooooo-create in #287
[CI]Fix exit code check logit for multi card unit test by @tianlef in #303
use uv build --wheel by @ooooo-create in #317
chore 🤖: Bump uv.lock (2026-01-06) by @github-actions[bot] in #313
align config by @Waynezee in #304
fix cp unittest by @Wennie396 in #307
Add check_patchelf_exists and bump sonic-moe by @ooooo-create in #326
fix seq_aux_loss by @xuxinyi389 in #318
[CI] update precision method by @swgu98 in #315
[MoE Layer] Fix Router topk_weigtht in noaux_tc Method by @hushenwei2000 in #329
[Feature] Add dynamic CUDA version-based dependency resolution by @ooooo-create in #293
[CI]add cpu compile by @tianlef in #328
[CI] coverage change to release by @swgu98 in #334
[CI]disable multi card by @tianlef in #335
tokens_unzip_gather support ue8m0 by @DanielSun11 in #310
[CI] coverage by @swgu98 in #336
Qwen3 vl by @blacksheep-Aristotle in #323
[Build] Add git hash by @ooooo-create in #333
[CI]fix coverage by @tianlef in #340
[Build] Remove .o files from wheel before packaging by @ooooo-create in #330
[fix]GLM45 pretrain fp8 on cuda126 by @tianlef in #342
[MoE Layer] Support deepgemm Padding to tile_M by @hushenwei2000 in #282
fix ut by @Waynezee in #347
[CI] nightly multi python by @swgu98 in #344
fix pname miss in grouped moe by @liufengwei0103 in #325
fix rope bug by @blacksheep-Aristotle in #338
[CI] add cancel by @swgu98 in #349
disable fp8 and deepep when cuda12.6 by @risemeup1 in #345
[MoE Layer] Delete moe_deep_gemm Config by @hushenwei2000 in #312
Fix bug for tokens_unzip_gather_kernel by @DanielSun11 in #341
fix router precision by @xuxinyi389 in #348
Fix the bug for MultiModalRope when mbs>1 by @pkuzyc in #351
Fix tensor model parallel world size return logic by @XieYunshen in #353
bump sonic-moe by @ooooo-create in #355
[CE]ADD CE by @tianlef in #316
[CI] paddle release tag by @swgu98 in #352
Fix the bug when get cp rank and size in rope by @pkuzyc in #358
fix layer_norm bug by @blacksheep-Aristotle in #350
fix seq_aux_loss by @Wennie396 in #361
[Recompute] adapt rr and support dict in selective recompute by @Waynezee in #294
【moe】add moe_fuse config only lora use by @xiaoguoguo626807 in #366
Fix the mis-match name bug of gelu_pytorch_tanh act by @pkuzyc in #363
[CI]fix coverage by @tianlef in #369
[DeepEP] Switch to paddlefleet.ops.deep_ep by @ooooo-create in #301
[CI] add timeout by @swgu98 in #380
support glm vpp overlap by @LiYuRio in #234
[ThirdParty] Bump sonic-moe version to reduce launch triton kernel overhead by @SigureMo in #381
[CE]add multi version python pipe by @tianlef in #357
[MoE Layer] Default use Paddle batched_gemm when enable moe_grouped_gemm by @hushenwei2000 in #370
fix_rr_rules by @Waynezee in #383
[MoE Layer] Add moe_ep_barrier configuration by @hushenwei2000 in #373
[MoE Layer] Fix AllToAll Implementation when TP > 1 by @hushenwei2000 in #360
Revert "[DeepEP] Switch to paddlefleet.ops.deep_ep" by @XieYunshen in #382
add high_precision_rope by @blacksheep-Aristotle in #377
fix_rope and seq_aux_loss by @Waynezee in #376
Update Paddle dependency version by @swgu98 in #387
[CI] Update grouped_gemm Unit Test for CUDA13 by @hushenwei2000 in #388
修改qwen3vl mrope计算逻辑 by @qhpeklh5959 in #379
[CE]Sonic moe by @tianlef in #386
add manual by @swgu98 in #391
manual wheel update by @swgu98 in #392
adapter sonic_moe by @xingmingyyj in #365
fix fusion rope in cp by @Waynezee in #396
add op tests by @xingmingyyj in #395
[ThirdParty] Bump sonic-moe version to patch paddle.empty to support distributed env by @SigureMo in #402
[Test]test release by @tianlef in #399
[CI] Add a workflow to cherry-pick PR by add label with cherry-pick: <target_branch> format by @ooooo-create in #404
[CE] fix install paddlefleet use pip cache by @tianlef in #405
support glm refined recompute in vpp overlap by @AlAuAu in #407
Fix dispath node grad's autocast by @AlAuAu in #412
[BugFix] Fix NoPipelineParallel init to support parallel parameter broadcasting by @huangjiyi in #411
[CE]fix ce install extra-index-url by @tianlef in #414
fix fp8 by @swgu98 in #409
[CE]change cuda130 test machine group by @tianlef in #419
hack test in paddle by @swgu98 in #416
Support ue8m0 for Fuse_SPAQ by @Eddie-Wang1120 in #292
fix p2p overlap by @FeixLiu in #408
[UE8M0] Support Fuse_stack_transpose_fp8_quant by @Eddie-Wang1120 in #417
[Docs] update CONTRIBUTING.md by @ooooo-create in #425
[CI] nightly group by @swgu98 in #429
[CE]change ce to fleet/formers release branch by @tianlef in #423
opt FusedWeightedSwigluActQuantKernel by @zhengshengning in #422
[UE8M0] Support Fused_Transpose_Split_Quant by @Eddie-Wang1120 in #378
fix pp overlap with moe_grouped_gemm by @AlAuAu in #432
[CE]Add ce loss check by @tianlef in #431
fix manual build wheel commit by @swgu98 in #394
support glm vpp overlap with dense layer by @AlAuAu in #438
test: improve env var detection in moe test by @ooooo-create in #430
Mtp support by @FeixLiu in #442
Add layer number in moe router needed by rl r3 by @liufengwei0103 in #443
[CI]add pr check by @tianlef in #454
update precision method by @swgu98 in #446
fix vpp overlap precision by @AlAuAu in #445
hack mtp test for test by @tianlef in #457
[UE8M0] fix bug for fuse_stack by @Eddie-Wang1120 in #447
HACK test_gpt_pp_with_moe.py by @tianlef in #460
[CE]fix install error && change tag paddle to release by @tianlef in #459
【】fix Qwen3vl moe BMMFunction bug by @xiaoguoguo626807 in #455
[CE]Fix ce install paddle of release by @tianlef in #461
Disable rr_attn_estimate_triton_op test by @LLSGYN in #462
[Build] Clean python compile file in final wheel pkg by @ooooo-create in #458
Disable test_tokens_unzip_gather by @DanielSun11 in #463
add xpu backend by @yongqiangma in #452
Recover test_tokens_unzip_gather by @DanielSun11 in #465
Support mtp_input and hidden_states fusing with no vpp-overlap by @AlAuAu in #467
fix fp8_quant_weight by @Waynezee in #468
[CE]fix paddle install for multi python paddle dev by @tianlef in #466
The release branch is packaged and uploaded to the self-hosted nightly build repository by @swgu98 in #471
release wheel by @swgu98 in #476
[MoE] Add moe_use_pfcc_deepep flag by @ooooo-create in #474
[CI]change paddle to release latest by @tianlef in #479
【】tmp Qwen3vl moe lora fix by @xiaoguoguo626807 in #478
Revert "[CI]change paddle to release latest" by @tianlef in #483
fix mtp layer bug by @AlAuAu in #486
Remove split_group_gemm by @Waynezee in #485
update version by @swgu98 in #492
【qwen3vl 】fix model param name to decrease aoa by @xiaoguoguo626807 in #482
[HACK] multi card test of moe for paddle by @tianlef in #495
[MoE Layer] Change moe_grouped_gemm Not Support Warning to Error by @hushenwei2000 in #487
[CI] get formers dev by @swgu98 in #499
[CE]change release to formers 2.0 && dev to formers dev by @tianlef in #493
【lazy_init】change lm_head param init func to support lazy_init by @xiaoguoguo626807 in #502
[ThirdParty] Bump DeepEP to avoid performance regression by @SigureMo in #504
【Lazy init】 change layers.py param init method to support lazy_init by @xiaoguoguo626807 in #503
[CE] add dev multi python by @tianlef in #512
update _dtype by @BossPi in #508
[CI]release ci change to formers release by @tianlef in #509
[JIT] disable dy2st in fleet only by @SigureMo in #515
[ThirdParty] Set moe_use_pfcc_deepep default value to True by @SigureMo in #517
[CI] Update cherry-pick template to pass CI check by @SigureMo in #519
manual build by @swgu98 in #506
[CE]add qwen3vl by @tianlef in #524
【cherry-pick】lazy_init 503 by @xiaoguoguo626807 in #527
【cherry-pick 】fix layer dtype 508 by @xiaoguoguo626807 in #526
新增通用encoder by @qhpeklh5959 in #531
change paddle to 3.3 release by @tianlef in #529
[CodeStyle] fix some large tensor issues by @zrr1999 in #514
move empty after mtp by @FeixLiu in #537
[CE]change paddle url by @tianlef in #542
auto_tuner by @xuxinyi389 in #534
【BugFix】refine moe_layer.py by @risemeup1 in #540
[ThirdParty] Cleanup moe_use_pfcc_deepep configuration by @SigureMo in #541
[CodeStyle] Integrate ast-grep to pre-commit hooks by @zrr1999 in #513
[CE]bug fix by @tianlef in #548
Check release fix by @swgu98 in #550
add nightly whl to bos by @swgu98 in #551
Add FP8 linear layer by @pkuzyc in #544
[BugFix] refine moe_layer.py by @risemeup1 in #557
refine pyproject.toml by @risemeup1 in #556
[Fleet CI]add a100 by @tianlef in #553
fix_llava by @xuxinyi389 in #558
[CE]add a100 case by @tianlef in #562
[Fleet CI DEV] add qwen3vl by @tianlef in #565
[CI] Remove cherry-pick labels after workflow completion by @ShigureNyako in #567
[CE]fix ce workflow by @tianlef in #568
[CI]change diff for unit test by @tianlef in #570
Temporary bypass loss check, until PR77876 merged by @A-nnonymous in #573
remove fuse_rms_norm config by @huangjiyi in #578
Fix gpt_model by @xuxinyi389 in #579
[CI/CE]Add GLM PT EP4 && FIX CE by @tianlef in #583
[xpu] xpu backend run pass by @yongqiangma in #523
change paddle version for protubuf by @tianlef in #586
unify_gpt_model by @xuxinyi389 in #589
[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions by @zrr1999 in #561
[MoE Layer]: Support Gated Shared Expert by @hushenwei2000 in #593
Fix some test about enable_partial_send_recv(True)+sequence_parallel(True) by @xuxinyi389 in #590
[Fleet CE]new CE by @tianlef in #585
fix moe_layer_freq config by @Waynezee in #599
[CE]fix ce precision scripts by @tianlef in #601
[GLM5]Add MLA by @changeyoung98 in #574
speed_up by @xuxinyi389 in #609
【kimi】fix Mla+flashmask bug by @xiaoguoguo626807 in #608
【kimi】fix moe_layer_pattern by @xiaoguoguo626807 in #611
[CI]fix coverage by @tianlef in #614
【kimi】fix kimik25 bug by @xiaoguoguo626807 in #613
Modify the sys path for unit tests by @swgu98 in #616
【kimi】fix kimik2 yarn rope by @xiaoguoguo626807 in #626
[CI]add total coverage and fix ce precision by @tianlef in #623
hack recompute with vpp overlap by @AlAuAu in #632
[CE]fix weekly CE by @tianlef in #630
[CE]fix ce bug by @tianlef in #634
[CE]fix ce by @tianlef in #635
fix auto-parallel by @sevenan2 in #587
[Paddle Version]change to 38eee703e79 by @tianlef in #641
[Arch]Add more GPU arches support by @swgu98 in #645
【kimik2】Kimik2 fix attention padding by @xiaoguoguo626807 in #649
pass through rotary_pos_cos/sin by @xuxinyi389 in #648
support skip mtp by @xingmingyyj in #597
[CE]modify CE running time and max parallel by @tianlef in #651
【kimiK25 vision 】kimik25 vision by @xiaoguoguo626807 in #610
fix MLA by @changeyoung98 in #637
Update paddlepaddle-gpu dependency to version 260319 by @hushenwei2000 in #654
Fast path for rope when cp_size==1 by @xuxinyi389 in #647
[CI] Upgrade GitHub Actions for Node 24 compatibility by @ShigureNyako in #661
[Paddle] change to 3.3.1 by @tianlef in #663
[Qwen3.5] GatedDeltaNet by @pkuzyc in #624
remove WeightOnlyMTPLayer weights stop_gradient by @xingmingyyj in #674
support subbatch in language loss by @wangyuwen1999 in #672
[CI]change ci type by @tianlef in #659
Fleet mtp upgrade dev by @wtmlon in #673
Remove no delay_scale_loss branch by @Waynezee in #642
Gated Attention by @pkuzyc in #625
【kimik2】fix yarn_rope bug by @xiaoguoguo626807 in #678
FA4 integration by @Waynezee in #638
add build_nvshemem.sh by @risemeup1 in #684
unit test for subbatch by @wangyuwen1999 in #677
[wip] qwen vl sp by @FeixLiu in #665
[TEST]fix subbatch test by @tianlef in #687
Switch to Paddle-provided NVSHMEM wheels and add SM103 support by @swgu98 in #686
[Test]fix subbatch test by @tianlef in #690
[XPU] Support fused_swiglu_scale forward and backward via decomposed … by @ZhangX-21 in #671
[XPU] support qwen_vl_30b in xpu device by @ZhangX-21 in #670
Add block attention residuals by @pkuzyc in #693
[ThirdParty] update sonic_moe and fa4 by @Waynezee in #689
[Unit TEST]skip 'test_gpt_pp.py' in paddle unit test until pr 78013 merged by @tianlef in #698
【MLA】support mla down_proj use notplinear by @xiaoguoguo626807 in #692
Add Fuse_vision_rope For VIT by @Eddie-Wang1120 in #697
[Qwen3.5] Add vision encoder for qwen3.5 by @pkuzyc in #605
[TEST]fix test_layers sys path by @tianlef in #706
bug fix for qwen by @FeixLiu in #704
[Coverage]add full coverage by @tianlef in #682
Optimize VisionModel rope position encoding and packed seq attention by @huangjiyi in #695
[CE] delete nvidia-cutlass-dsl==4.2.1 by @tianlef in #717
Replace rms_norm and swiglu by paddle api by @zhangbo9674 in #714
Support qwen35 by @xuxinyi389 in #724
[CI] add some new function and ci update to python 3.13 by @tianlef in #718
[Paddle]change paddle to 3.3.1.post20260403+ef0820a64e9 by @tianlef in #729
[Refactor] Restructure rope_utils.py: consolidate config and inline fp32 by @huangjiyi in #732
Add DSA Module by @xingmingyyj in #683
fix moe lora gemm by @Lcysabcu in #709
Fix the bug of block attention residuals in AMP by @pkuzyc in #719
[Refactor] Remove config dependency from internal RoPE functions by @huangjiyi in #740
update nvshmem by @swgu98 in #742
[Benchmark]intergreation_test by @xuxinyi389 in #738
[CI]del wget miniconda by @tianlef in #745
[CI]change docker of python 3.12 by @tianlef in #747
[bugfix] Fix fp8 run by @Waynezee in #728
Add qk_norm_type config for per_layer qk_norm by @changeyoung98 in #734
update submodule DeepEP by @youge325 in #712
Add moe_deep_gemm config && replace matmul_add to linear by @zhangbo9674 in #751
Update Paddle to 3.3.1.post20260409+52c898ee9ac by @DanielSun11 in #754
arm by @swgu98 in #758
[Feature] Support fused vision RoPE kernel for high_precision_rope by @huangjiyi in #736
Use _new_shared_tensor for input in save_for_backward by @hushenwei2000 in #759
[Cleanup] Replace torch proxy alias with public compat API by @ShigureNyako in #763
Add arm pipeline by @tianlef in #741
20260404 add ai edited test by @liuhao2638 in #694
[CI]fix ci unit coverage && load ecosystem lib by @tianlef in #753
Replace fused_stack_quant and fused_weighted_swiglu_act_quant by custom_op by @DanielSun11 in #768
[Feat] Align flashmask _C_ops API with the latest overlap ver API by @Enigmatisms in #774
[CE]fix CE bug by @tianlef in #779
support dense mtp by @xingmingyyj in #764
[FP8] Use fuse_weighted_swiglu_fp8_quant in MoELayer by @SigureMo in #781
fix NoPipelineParallel.eval_batch by @WYB27 in #778
[CI]add more concurrency of single card by @tianlef in #775
align PF & EC by @xuxinyi389 in #777
20260412 add ai edited test by @liuhao2638 in #772
[Pipeline Parallel] Migrate PP Components to Paddle by @hushenwei2000 in #563
Fix: skip qk fuse rope when high_precision_rope is enabled by @huangjiyi in #782
Fix Unit Tests import PaddleFleet PP to Paddle PP by @hushenwei2000 in #793
Remove 30 failing ai edited single card tests by @liuhao2638 in #798
fix rng bug by @FeixLiu in #773
Add Moe balanced logging and module profiler by @huangjiyi in #791
update deepep by @zoooo0820 in #795
Rm fused rmsnorm for align by @DanielSun11 in #787
[CI]fix precision update by @tianlef in #801
[CE]fix ce by @tianlef in #799
add RRAttention readme by @GuoxiaWang in #806
Sonic moe support blackwell by @risemeup1 in #811
[CI]fix cherry pick by @tianlef in #814
[FIX] Add CUDA check for fused RoPE ops to support non-CUDA devices by @G2uge in #800
[V2 align] align EC&PF by @xuxinyi389 in #813
Fix align rng for moe_router_force_load_balancing by @Xing-lil in #817
[DotProductAttention] Support _attn_implementation config for eager attn by @zhanghonggeng in #815
[BugFix] fix CudaError 700 in fuse_stack_transpose when N>64 by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/804
log MoE aux/zloss into balance logs by @huangjiyi in https://github.com/PaddlePaddle/PaddleFleet/pull/824
[CI]add skip logic for only document pr by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/818
add commitid file by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/827
[CE] change paddle release url by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/826
[Paddle] change paddle version to 3.4 by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/830
separate mtp head & loss for pp balance by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/822
Fix FP8Linear scale error in blackwell by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/836
fix: increase test dimensions to avoid ue8m0 scale shape truncation in CUDA kernel by @zhanghonggeng in https://github.com/PaddlePaddle/PaddleFleet/pull/837
gate attn for mla by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/832
Temporary disable gpt_pp test by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/834
[ThirdParty] [FA] update flash-attention to support use_varlen in flashmask and add several bug fix. by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/838
[New feature] Support MTP mask by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/829
update baseline for b by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/845
[Fix] refine recompute test by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/835
Fix gates accumulation seq mismatch with ec by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/820
update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/847
fix sonic moe test by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/846
【Fix Tests】test_layers 单测在CI并行环境下的修复 by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/848
Fix dsa cuda 700 by @changeyoung98 in https://github.com/PaddlePaddle/PaddleFleet/pull/850
[New feature] Add Latent MoE support with hidden state compression/decompression projections by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/851
Embedding and MOE support masking by input_ids by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/844
set rng flag for weight initilization by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/853
Support learnable routed_scaling_factor by @From00 in https://github.com/PaddlePaddle/PaddleFleet/pull/854
Support per-depth MTP input_ids for MoE routing mask by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/856
[align] dense mlp by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/857
use paddle.set_flag to freeze weight initilization for cp multicards case by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/859
flash_mask_path_fix and align z_loss by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/864
add subbatch && auto_subbatch by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/821
Add the RecomputeWithoutOutput util by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/862
Support configure deep_ep num_sms and buffer size by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/861
[MLA] Support zero-copy slice for q_rope calculation by @hushenwei2000 in https://github.com/PaddlePaddle/PaddleFleet/pull/849
Add qk_norm_fusion by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/839
Add RRAttention Paddle release by @LLSGYN in https://github.com/PaddlePaddle/PaddleFleet/pull/812
[CI] disable torch check by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/869
Fuse lm_head with cross entropy loss to reduce memory usage and speed up by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/866
fix: ensure shared_experts use original hidden_size in Latent MoE by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/868
[FA] enable flashmask_use_varlen by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/860
Support Kgroupgemm in PaddleFleet by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/871
Add align mode for aux_loss when enable mtp by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/873
[FA] add clone for startend_row_indices by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/875
【fd】fix fd decoder input by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/877
Add Test For Kgroupgemm and Fix bug by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/876
Update submodule DeepEP by @youge325 in https://github.com/PaddlePaddle/PaddleFleet/pull/762
Align z_loss with EC by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/888
[EP] Add HybridEP support by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/852
Add triton fused sigmoid gate for MLA by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/872
添加 fused_ce loss 的对齐模式 by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/886
fix qk_norm_fusion use by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/892
Revert "[EP] Add HybridEP support" by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/893
update paddle.deps version by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/897
Add fused_moe_topk and fused_routing_map by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/842
Remove log in fp8 quant by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/900
[Unit Test] change test_pp.py by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/899
fix sharded_state_dict by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/901
[bugfix] fix rope in mla by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/906
[CE] fix cp case && paddle relaese change to dev by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/909
Adapt routing_map_fusion use by @Xing-lil in https://github.com/PaddlePaddle/PaddleFleet/pull/908
rm mtp preprocess in modeling by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/883
[CUDAGraph] add autocudagraph by @DrRyanHuang in https://github.com/PaddlePaddle/PaddleFleet/pull/896
[XPU] add xpu nightly build whl by @plusNew001 in https://github.com/PaddlePaddle/PaddleFleet/pull/912
support dw overlap by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/885
[Build] Improve XPU build dependency version handling by @G2uge in https://github.com/PaddlePaddle/PaddleFleet/pull/913
fix dsa args name by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/914
[XPU] fix nightly build bug by @plusNew001 in https://github.com/PaddlePaddle/PaddleFleet/pull/915
Split paddlefleet ops into standalone package via uv workspace by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/760
fix ci ce bug by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/918
refine yaml by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/920
[BugFix] Fix paddlefleet-ops version pinning in build by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/922
version get from PADDLEFLEET_VERSION by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/925
update ops whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/923
Reapply "[EP] Add HybridEP support" (#893) by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/895
remove empty_cache in languageloss by @AlAuAu in https://github.com/PaddlePaddle/PaddleFleet/pull/924
[CI]fix ci upload for python 3.12 by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/930
ops whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/931
[Build] fix build system for sonic-moe/quack uv sync by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/921
[EP] Skip build HybridEP on CUDA 12.6 by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/933
fix ops build whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/934
fix manual whl by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/940
Add dw_p2p_overlap config by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/928
[OPs] Build DeepEP on detected CUDA arch only and raise error on invalid CUDA arch settings and sdist by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/938
fix qwen35 bug by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/926
[OPS]mv third_party from paddlefleet to paddlefleet_Ops by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/945
[OPs] Error on load ecosystem library and add apache-tvm-ffi to dependencies for SonicMoE by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/944
Fix quack import error by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/948
[CI] udpate pr cache ops by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/949
support ue8m0 by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/894
update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/955
Add inplace SwigluProbsGradKernel by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/833
[Chore] Bump sonic-moe to PR#25 (epilogue refactor + wgrad fix) by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/950
20260512 add ai edited test by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/942
Revert "Add inplace SwigluProbsGradKernel" by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/961
[ops] update build ops by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/965
[Deps] Bump paddlepaddle-gpu to 3.4.0.post20260514+38bb5555a0a by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/963
[CE]fix ce by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/969
[Execute Infrastructure] 支持 develop 分支 commit 级别 wheel 编包及 paddle-dev ops 构建 by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/973
disable upload ops on paddle dev by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/975
[OPS] Remove redundant ops namespace layer and flatten import paths by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/974
change paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/976
[OPs] Skip DeepEP build and enable ccache for wheel builds by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/977
[New features] Support XPU backend for fused_bias_swiglu backward by @G2uge in https://github.com/PaddlePaddle/PaddleFleet/pull/962
fix vit rope by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/978
add moe_routed_expert_use_bias to TransformerConfig by @huangjiyi in https://github.com/PaddlePaddle/PaddleFleet/pull/972
[CE]fix ce by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/983
[align mode] align mtp by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/982
[Performance Optimization] bump sonic-moe submodule to paddle branch HEAD (04a6848) by @A-nnonymous in https://github.com/PaddlePaddle/PaddleFleet/pull/984
20260515 add ai edited tests by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/964
[Enhancement] Support large tensor int64 indexing in fused_swiglu_bwd by @zrr1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/967
[OPs][HybridEP] Add HybridEP buffer SM configuration by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/981
add fused_mla_yarn_rope_apply for mla by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/870
Unify moe_grouped_gemm && moe_expert_fusion by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/917
support mtp and cp in paddlefleet's EB dataflow by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/985
Using put_along_axis_ to reduce mem in Topk Router by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/946
[OPs][HybridEP] Update HybridEP submodule by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/986
fix gdn padding mask prob by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/994
[ops] update build time by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/992
Refactor DSA Module by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/987
[TransformerConfig] refactor(config): delete use_latent_moe config by @hushenwei2000 in https://github.com/PaddlePaddle/PaddleFleet/pull/988
[CI] add fleet log analysis bot by @zjjlivein in https://github.com/PaddlePaddle/PaddleFleet/pull/997
[DeepSeekV4] CSA/HCA implement by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/957
Align init method with megatron by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/991
Fix auto subbatch expert_id offset calculation by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/911
test: add AI coverage tests by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/998
[Distributed Strategy] Add Refined Recompute support for DeepEP combine overlap by @wangbingguang2026 in https://github.com/PaddlePaddle/PaddleFleet/pull/995
fleet build time by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/996
[CE]disable test_ai_moe_layer_5.py by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1012
[DeepSeekV4] support mtp layer hybrid attn by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1008
[CI]Modify pr template by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1015
test: update AI MoE coverage and filenames by @liuhao2638 in https://github.com/PaddlePaddle/PaddleFleet/pull/1014
[feat(mHC)]: Add basic implementation of manifold hyper connection(mHC) by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1000
[CI] Use GitHub Actions step timeouts in workflows by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1018
[CE]fix action bill problem by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1028
[Devs] fix coverage path mapping in CI by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1021
[Config] add mtp loss by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1017
Fix auto subbatch expert node indexing by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1009
update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1033
[New features] support FastDeploy fallback to Fleet MLA model by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/1005
[CUDAGraph] Teardown autocudagraph cache to prevent state pollution by @DrRyanHuang in https://github.com/PaddlePaddle/PaddleFleet/pull/990
fix paddle ci bot post comment by @zjjlivein in https://github.com/PaddlePaddle/PaddleFleet/pull/1032
[BugFix] use int32 for kgroupgemm by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1016
[New features] Add hash routing and softplus score functions by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/1024
mhc fix sp/tp bug by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1036
Support k_grouped_gemm in auto_subbatch fallback by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1037
【fastdeploy】Fix need_do_attention when not use MLA by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/1041
[XPU][CI] inject paddlepaddle-xpu version in publish_wheel_nightly_xpu… by @G2uge in https://github.com/PaddlePaddle/PaddleFleet/pull/1042
[Operator]unify_swiglu by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1043
[New features] 新增 ClampSwiGLU 系列算子 by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/1025
[New Feature] Fix KV cache and RoPE for greedy inference on MoE models by @liym27 in https://github.com/PaddlePaddle/PaddleFleet/pull/904
Refine contribute md by @risemeup1 in https://github.com/PaddlePaddle/PaddleFleet/pull/1050
[CI]fix multi cards failed by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1046
Fix deepseek v4 by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/1047
Add fp8 sonic moe in MoELayer by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/879
[OPs][SonicMoE] Update sonic-moe submodule for compat import by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1056
Mtp magic send by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1031
[Bug fixes] fix duplicate m_indices generation in fp8_utils by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/1002
[BugFix] Fix kv_b_proj global dim for Muon optimizer with GQA MLA by @liym27 in https://github.com/PaddlePaddle/PaddleFleet/pull/1057
[Improvements] Support large tensor indexing in SwiGLU CUDA extensions by @zrr1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/1011
[CI] Avoid approval false positives from base drift by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1060
update flash-attention submodule by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1058
[New features] support GQA SWA for MiMo by @GuoxiaWang in https://github.com/PaddlePaddle/PaddleFleet/pull/1039
Support static subbatch (moe_subbatch_token_num_after_dispatch) with moe_deep_gemm=True by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1053
[DeepSeekV4]support fused sparse attn tilelang kernel by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1054
[XPU] add xpu ops build by @plusNew001 in https://github.com/PaddlePaddle/PaddleFleet/pull/1061
[Custom OP] Add fused_swiglu_probs_bwd to paddlefleet_ops by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/989
update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1067
Revert "Mtp magic send (#1031)" by @XieYunshen in https://github.com/PaddlePaddle/PaddleFleet/pull/1064
nvshmem on cuda132 by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1072
[BugFix] Revert MLA GQA changes: use num_attention_heads instead of num_query_groups by @liym27 in https://github.com/PaddlePaddle/PaddleFleet/pull/1074
Revert "[New features] support GQA SWA for MiMo (#1039)" by @GuoxiaWang in https://github.com/PaddlePaddle/PaddleFleet/pull/1076
fix align mode FA calling by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/1077
[New features][Bug fixes] Add TileLang CSA compressed indexer for DSv4 and fix ColumnLinear stop_gradient backward by @YJMSTR in https://github.com/PaddlePaddle/PaddleFleet/pull/1052
Support magic_init by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/1075
[CI] Fix ARM ops package build by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1084
Reapply "Mtp magic send (#1031)" (#1064) by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1081
mhc add fused kernels by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1055
[Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy func and testcase by @adam-xiaoyao in https://github.com/PaddlePaddle/PaddleFleet/pull/1078
[Bug fixes] pass fa_version from forward to backward to keep fwd/bwd consistent by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1085
[CI] Remove Python 3.10 PR package artifact flow by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1089
fix mhc by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1091
[New features] Sliding Window Attention for GQA and VHA by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1087
[DeepSeekV4]align precision for megatron in dsv4 attn by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1088
assert magic for pp=1 by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1092
fix fuse_weighted_swiglu_fp8_quant 0_size by @zhengshengning in https://github.com/PaddlePaddle/PaddleFleet/pull/1083
Make SonicMoE's weight layout consistent with GroupedGEMM by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/1086
add fleet ci bot feedback collector by @zjjlivein in https://github.com/PaddlePaddle/PaddleFleet/pull/1082
Fix mhc mtp bug in RL by @liuruyan in https://github.com/PaddlePaddle/PaddleFleet/pull/1096
[New features] Support MLA sliding window attention by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1099
[Bugfix] fix mtp_input_mask when enable sp by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1100
Dsv4 align moe kernel by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1102
[Bugfix] fix attention sinnk process in tilelang by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1105
[BUG FIX] fix rope-theta for MLA by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1107
[Improvements][DSv4 mHC] Align HyperConnection linear/aggregate numerics with Megatron under FLAGS_use_accuracy_compatible_kernel by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1103
[New features] Update flash-attention submodule to support cutlass-dsl (cu13) by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1108
Support cutlass-dsl (cu13) by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1111
[0-size] paddlefleet_fused_swiglu_probs_bwd by @zhengshengning in https://github.com/PaddlePaddle/PaddleFleet/pull/1104
[OPs] Integrate cudnn-frontend to paddlefleet-ops by @SigureMo in https://github.com/PaddlePaddle/PaddleFleet/pull/1114
[CI] Increase single-card test timeout by @ShigureNyako in https://github.com/PaddlePaddle/PaddleFleet/pull/1116
fix_rope by @Eddie-Wang1120 in https://github.com/PaddlePaddle/PaddleFleet/pull/1106
【fd】support decode yarn_rope_emb by @xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleFleet/pull/1117
[Improvements] align MTP loss and mHC kernel with Megatron accuracy-compatible path by @xuxinyi389 in https://github.com/PaddlePaddle/PaddleFleet/pull/1112
[Bug fixes] Fix rotary_pos_emb/swa_rotary_pos_emb None guard in Context Parallel for GQA by @xxyux in https://github.com/PaddlePaddle/PaddleFleet/pull/1119
[align mode] add v11 model by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1118
[Environment Adaptation] add CUDA 13.2 (cu132) build support by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1120
add high_precision_mhc for float32 or bfloat16 calc in mHC by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1098
fix radio by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1090
Set cast_to_low_precision to false in csa by @changeyoung98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1125
[Bug fixes] Fix moe_deep_gemm disable when moe fusion enable by @AlAuAu in https://github.com/PaddlePaddle/PaddleFleet/pull/1128
add more assert by @FeixLiu in https://github.com/PaddlePaddle/PaddleFleet/pull/1115
Add logs for mtp_loss by @From00 in https://github.com/PaddlePaddle/PaddleFleet/pull/903
Check the 0-size issue of the custom operator in paddleFleetOp by @zhengshengning in https://github.com/PaddlePaddle/PaddleFleet/pull/1121
update approve by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1130
[fix] restoring position_id'value when not in align mode by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1129
[Performance Optimization] Migrate CSA indexer backward to cuDNN frontend by @baoqiwen in https://github.com/PaddlePaddle/PaddleFleet/pull/1094
[OPs] Integrate flash_mla to paddlefleet-ops by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1132
[Execute Infrastructure] Gate PaddleFleet GPT tests by repo flag by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1143
[Feat][Bug fixes] Contextual parallelism support for DSV4 sparse-attention and some bug fixes. by @Enigmatisms in https://github.com/PaddlePaddle/PaddleFleet/pull/1139
fix time out by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1149
Fix csa for alignment by @changeyoung98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1148
[Distributed Strategy][Bug fixes]fix PP hang caused by indexer loss log by @YJMSTR in https://github.com/PaddlePaddle/PaddleFleet/pull/1136
change nvshmem on cuda132 by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1123
support gqa selective recompute for gated_attn by @AlAuAu in https://github.com/PaddlePaddle/PaddleFleet/pull/1131
[DSv4 mHC] Eliminate unnecessary synchronous memcpy by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1122
update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1156
Support Document Mask in CSA by @umiswing in https://github.com/PaddlePaddle/PaddleFleet/pull/1150
[CI]fix release ci by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1165
[Cherry-pick] fix mhc enable_grad in first fwd (#1164) by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1168
[Bugfix] fix weight grad when enbale sequence_parallel (#1154) by @Waynezee in https://github.com/PaddlePaddle/PaddleFleet/pull/1162
[Bug Fix] Fix sonic_moe flag bug by @pkuzyc in https://github.com/PaddlePaddle/PaddleFleet/pull/1163
[release/0.3] update ops by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1171
[release/0.3] Move csa dtype align to FLAGS_use_accuracy_compatible_kernel mode by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1172
[CI] fix release ci update ops by @tianlef in https://github.com/PaddlePaddle/PaddleFleet/pull/1180
【Cherry-pick】Fix decoder_input not scattered in CP without MTP by @xingmingyyj in https://github.com/PaddlePaddle/PaddleFleet/pull/1179
[release/0.3][CI] Pin tvm-ffi below 0.1.12 for paddlefleet_ops by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1185
[release/0.3][CI]disabel single tests by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1188
[release/0.3][Bug fixes] Revert disabling single-card tests by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1201
[release/0.3][codex] Add -s to pytest test scripts by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1199
[release/0.3] Fix CP scatter placement in gpt_embedding to occur before RoPE generation by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1204
fix layer_idx if training with empty layer in head (#1181) by @Hz188 in https://github.com/PaddlePaddle/PaddleFleet/pull/1196
[release/0.3][CI] Fix ARM ops package build dependencies by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1212
[release/0.3] Fix build wheel by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1215
[release/0.3][New features] refactor VHA and support Context Parallel by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1197
[release/0.3] Add assertion to prevent sequence_parallel with CP scatter in plain path by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1213
add logging print for index loss (#1178) by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1189
[release/0.3] fix rope for q token by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1219
[CP]Add cuDNN backend for CSA sparse attention by @ForFishes in https://github.com/PaddlePaddle/PaddleFleet/pull/1222
[Bug fix] Use pad_token_id for mask by @DanielSun11 in https://github.com/PaddlePaddle/PaddleFleet/pull/1224
[CP] update paddle by @swgu98 in https://github.com/PaddlePaddle/PaddleFleet/pull/1228
[release/0.3][New Feature] pass swa rope emb to MTP layer by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1235
[release/0.3][align] add tp/sp in align mode by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1241
[Cherry-Pick][New Feature] Add SharedKV for VHA by @GuoxiaWang in https://github.com/PaddlePaddle/PaddleFleet/pull/1242
mHC fix H_res transpose (#1238) by @Wennie396 in https://github.com/PaddlePaddle/PaddleFleet/pull/1240
[release/0.3][DSV4 Attn][Fix] Fix document mask padding logic by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1249
[Cherry-Pick][New features] Add Multimax LM-head fused CE support by @ZhouYuxuanYX in https://github.com/PaddlePaddle/PaddleFleet/pull/1244
support autosubbatch mem info with legacy allocator by @Difers in https://github.com/PaddlePaddle/PaddleFleet/pull/1251
[release/0.3][BugFix] fix align mode when cp = 1 by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1259
[DSv4 CSA] Replace tensor iteration with tolist by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1257
[DSv4] Add triton fused_apply_mla_rope_inplace for q and o (#1231) by @lshpku in https://github.com/PaddlePaddle/PaddleFleet/pull/1258
support QAT by @wangyuwen1999 in https://github.com/PaddlePaddle/PaddleFleet/pull/1260
[release/0.3][Performance Optimization] Migrate CSA indexer forward to cuDNN frontend by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1263
[release/0.3][Devs][DSv4] Remove dead TileLang CSA indexer loss by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1270
[release/0.3] Using Sparse attention API by @risemeup1111 in https://github.com/PaddlePaddle/PaddleFleet/pull/1278

New Contributors

@From00 made their first contribution in #1
@risemeup1 made their first contribution in #14
@blacksheep-Aristotle made their first contribution in #26
@XieYunshen made their first contribution in #38
@zhangbo9674 made their first contribution in #48
@hushenwei2000 made their first contribution in #61
@xuxinyi389 made their first contribution in #73
@LiYuRio made their first contribution in #81
@deepllz made their first contribution in #114
@A-nnonymous made their first contribution in #250
@LLSGYN made their first contribution in #227
@zhangting2020 made their first contribution in #266
@sneaxiy made their first contribution in #290
@github-actions[bot] made their first contribution in #291
@liufengwei0103 made their first contribution in #325
@qhpeklh5959 made their first contribution in #379
@yongqiangma made their first contribution in #452
@BossPi made their first contribution in #508
@zrr1999 made their first contribution in #514
@sevenan2 made their first contribution in #587
@ZhangX-21 made their first contribution in #671
@Lcysabcu made their first contribution in #709
@youge325 made their first contribution in #712
@liuhao2638 made their first contribution in #694
@zoooo0820 made their first contribution in #795
@G2uge made their first contribution in #800
@Xing-lil made their first contribution in #817
@zhanghonggeng made their first contribution in #815
@adam-xiaoyao made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/851
@xxyux made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/897
@DrRyanHuang made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/896
@plusNew001 made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/912
@wangbingguang2026 made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/995
@liym27 made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/904
@baoqiwen made their first contribution in https://github.com/PaddlePaddle/PaddleFleet/pull/1058

Full Changelog: https://github.com/PaddlePaddle/PaddleFleet/commits/v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PaddleFleet version 0.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!