Dsv3 dev #10273

phlrain · 2025-03-26T10:00:40Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

* add distributed run * fix topo * add distributed print

* Add fused_swiglu_act(transpose)_quant op to extern op in gpt-3 * Polishing code. * remove unecessary lines. * remove unecessary lines in cu * Add padding function to fused_swiglu_act_quant_op

* [Distribution] Support DualPipeV for deepseek * add * fix * add * add

* commit for save * revert gpu sum in add loss for mtp

* refine * fix

* add flag DSV3_USE_FP8_GEMM * fix

* add flag DSV3_USE_FP8_GEMM * fix * add fp8 comm * fix bug * fix bug * fix bug * fix bug * fix bug * fix * fix bug * fix bug * replace index_select to gather * close fuse_moe * fix dequant bug

* fix dequant bug * fix bug * fix bug * fix

…reparation for GroupedGEMM using in training. (#10190) * Add regroup_tokens op and optest, fix topk_to_multihot setup.py * Add test file * fix miscs * Add tokens_unzip & weighted_zip in preparation of fp8-groupedgemm * Added expert_idx output to tokens_unzip op. * Fix prob datatype issue. * Implemented double input&output regroup op. * Further fix bf16 issues. * Fix implicit bug. * Change the unzip op to save more useful data * Refractor and combining tokens_unzip_and_zip. * Fixed concurrent semaphore bug. * delete synchronize in zip op, and starting to add guided unzip kernel. * Add fp8 support for unzip op, but cannot fake a tensor for testing. * Added guided_unzip op. * Modified guided_unzip to satisfy real usage. * polish code * Fix typos and polish & tested code

* Added two fused op, refractor some old swiglu code. * delete unecessary print.

* optimize atten impl * optimize_attention_output_linear_fp8_memory

* Revert "add timer for deepep (#10211)" This reverts commit a874a9b. * revert timer

This reverts commit 58edb00.

* Support overlap for fusion moe, fix memlory leakage of fusion moe * fix * fix conflict

* fix * add fa3 rc --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

* CE bug fix * Update trainer_callback.py * CE bug fix * Update modeling.py --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

* fix gate prob * remove useless code

* fix * update quant cache --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

* add_stepped_rc * polish code * remove stepevent * fix bug

Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

* fix_fa3_mem * update strategy

zhangbo9674 and others added 30 commits March 13, 2025 16:24

support fp8 && refine gemm runtime && opt cpu sync in moe (#10116)

2ab44c4

Add distributed run for dsv3 (#10119)

6a203f5

* add distributed run * fix topo * add distributed print

op fuse flag (#10122)

464075a

Add fused_swiglu_act(transpose)_quant op to extern op in gpt-3. (#10124)

6c3172a

* Add fused_swiglu_act(transpose)_quant op to extern op in gpt-3 * Polishing code. * remove unecessary lines. * remove unecessary lines in cu * Add padding function to fused_swiglu_act_quant_op

Added topk_to_multihot and grad kernel to prevent CPU Stall. (#10127)

4eeb192

fix (#10130)

614e1aa

[Distribution] Support DualPipeV for deepseek (#10138)

99c047a

* [Distribution] Support DualPipeV for deepseek * add * fix * add * add

Fix cpu stall in permute and unpermute (#10147)

1fbbe2a

* commit for save * revert gpu sum in add loss for mtp

Delete fp8 gemm warning (#10134)

f4bf969

* refine * fix

Adapt to the fix_cpu_stall (#10154)

9d98f08

Add flag DSV3_USE_FP8_GEMM (#10133)

cb4be8f

* add flag DSV3_USE_FP8_GEMM * fix

using gather (#10158)

a1759fe

Opt dualpipe overlap (#10173)

473ac84

fix expert parameters (#10184)

bfaa97e

Use Fp8 dispatch for moe layer (#10165)

37ec16e

* add flag DSV3_USE_FP8_GEMM * fix * add fp8 comm * fix bug * fix bug * fix bug * fix bug * fix bug * fix * fix bug * fix bug * replace index_select to gather * close fuse_moe * fix dequant bug

fix (#10189)

a360eeb

fix (#10191)

5ecbd04

Fix dequant bug and dw compute bug (#10193)

314841d

* fix dequant bug * fix bug * fix bug * fix

optimize ds3 attention impl (#10200)

58edb00

Add several fused quanted ops in support of FP8 training (#10202)

cf5f6a5

* Added two fused op, refractor some old swiglu code. * delete unecessary print.

Optimize attention output linear fp8 memory (#10204)

0a77769

* optimize atten impl * optimize_attention_output_linear_fp8_memory

Merge MoEFlexTokenLayer to MoELayer (#10205)

dff10bf

add timer for deepep (#10211)

a874a9b

fix (#10207)

ce00414

[revert]Add timer for deepep (#10212)

eb2c988

* Revert "add timer for deepep (#10211)" This reverts commit a874a9b. * revert timer

fix permute grad bug in pp (#10213)

642869e

Revert "optimize ds3 attention impl (#10200)" (#10208)

22e4f9a

This reverts commit 58edb00.

Spliting fusion moe, fix memlory leakage of fusion moe (#10192)

f566b66

* Support overlap for fusion moe, fix memlory leakage of fusion moe * fix * fix conflict

Remove dependency on FP8 when using BF16 (#10219)

c886cb7

lshpku and others added 30 commits August 20, 2025 19:51

Implement overlapping of MTP Decoder Layer (#10963)

0efb202

yarn rope use fp32 (#10973)

fb22a9f

opti lm head backward perf (#10975)

45b9caa

fix numel bug (#10979)

740f471

fix (#10977)

e63b1e7

fix bug (#10986)

73f451c

support load hf ckpt (#10976)

a3e2344

fix bug (#10995)

fb6d8af

quant transpost disable (#10987)

1ab02cd

Fa3 recompute (#10994)

7ce548e

* fix * add fa3 rc --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

CE bug fix (#10999)

6e6b373

* CE bug fix * Update trainer_callback.py * CE bug fix * Update modeling.py --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

fix bug (#11006)

fe9e9f8

optimizer embedding grad speed (#11011)

69e9f96

optimize moe and dense bw split (#11015)

76cb4b7

Fix slow convergence issue when enable overlap_p2p_comm (#10992)

de9c524

Implement Decoder backward self-overlapping (#11003)

475942a

optimize cross entropy speed (#11012)

f83311a

fix gate prob (#10972)

17cc8a4

* fix gate prob * remove useless code

Update quant cache (#11007)

79c421f

* fix * update quant cache --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

Add stepped O1 recompute (#11010)

5b4855d

* add_stepped_rc * polish code * remove stepevent * fix bug

best speed (#11023)

3737315

fix_fa3_bug (#11038)

349c92f

Fix post_process_backward calc stream wait (#11063)

9b1c166

use key-value to init dataloader (#11067)

03cf701

update post norm rc (#11097)

95438f9

Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

fix_fa3_mem (#11033)

8d5dead

* fix_fa3_mem * update strategy

Implement auxiliary-loss-free load balancing (#11031)

da47d99

Fix FusedRMSLinear backward compute (#11095)

1b10b57

Fix empty usage error of aux-loss-free (#11116)

4e587e2

fix loss bug (#11161)

42afde5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dsv3 dev #10273

Dsv3 dev #10273

Uh oh!

phlrain commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Dsv3 dev #10273

Are you sure you want to change the base?

Dsv3 dev #10273

Uh oh!

Conversation

phlrain commented Mar 26, 2025

Before submitting

PR types

PR changes

Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants