[Speculate Decoding] Fix step_idx semantics in reasoning_phase_token_constraint and speculate set_value kernels#7349
Conversation
|
Thanks for your contribution! |
22e8a13 to
612db12
Compare
612db12 to
d2b540d
Compare
d2b540d to
9088ea1
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7349 +/- ##
==========================================
Coverage ? 74.38%
==========================================
Files ? 383
Lines ? 53614
Branches ? 8412
==========================================
Hits ? 39882
Misses ? 11015
Partials ? 2717
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9088ea1 to
8747dae
Compare
| // Shadow state: initialize from target model (prefill just finished) | ||
| step_idx[tid] = target_step - 1; | ||
| // target_step 表示历史 token 数量,prefill 后为 0 | ||
| step_idx[tid] = target_step; |
There was a problem hiding this comment.
之前有一个结论是MTP第一步推理完后和Target模型的step_idx对齐,这里预处理直接赋值为target的stpe_idx是什么原因呢?
| } | ||
| } else if (step_idx[tid] == 1 /*Encoder Step*/) { | ||
| pre_ids_all_now[1] = draft_token_now[0]; | ||
| } else if (step_idx[tid] == 0 /*Encoder Step*/) { |
There was a problem hiding this comment.
这里还沿用stpe_idx == 0的形式判断 prefill 阶段是否在所有场景都正确呢?是否考虑过重调度的情况,被驱逐的query再被拉回的时候,step_idx应该不是从0开始的
| base_model_draft_tokens_now[substep + 1] = token_this_time; | ||
| step_idx[tid] += seq_len_this_time; | ||
| pre_ids_now[step_idx[tid]] = token_this_time; | ||
| pre_ids_now[step_idx[tid] - 1] = token_this_time; |
There was a problem hiding this comment.
此处以及下面的所有改动都是因为MTP从Target模型的 step 开始解码而引发的修改吗?如果是的话,和之前相比,一定要从主模型的step_idx开始解码的原因是什么?
8747dae to
64b2013
Compare
64b2013 to
0987851
Compare
0987851 to
18bd0a5
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-14 11:58 CST
📋 Review 摘要
PR 概述:修复 reasoning_phase_token_constraint kernel 中的 step_idx 语义适配
变更范围:custom_ops/gpu_ops/reasoning_phase_token_constraint.cu、tests/operators/test_reasoning_phase_token_constraint.py
影响面 Tag:[Speculative Decoding] [OP]
📝 PR 规范检查
PR 标题包含有效的 [Speculative Decoding] tag,描述详细填写了 Motivation 和 Modifications,符合规范。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | PR 整体 | PR 描述与实际修改不一致:描述中提到修改 5 个文件,但实际只修改了 2 个 |
| 🟡 建议 | custom_ops/gpu_ops/speculate_decoding/draft_model/draft_model_set_value_by_flags.cu |
该文件仍使用旧的 step_idx 语义,建议同步修改 |
| 🟡 建议 | custom_ops/gpu_ops/speculate_decoding/speculate_set_value_by_flags_and_idx.cu |
该文件仍使用旧的 step_idx 语义,建议同步修改 |
| 🟡 建议 | tests/operators/test_draft_model_set_value_by_flags.py |
该测试仍使用旧的 step_idx 语义,建议同步修改 |
总体评价
reasoning_phase_token_constraint.cu 和 test_reasoning_phase_token_constraint.py 的修改正确,成功适配了 step_idx 的新语义。但 PR 描述与实际修改存在严重不一致:描述中提到要修改 5 个文件以适配新语义,但实际只修改了 2 个。其余 3 个文件(draft_model_set_value_by_flags.cu、speculate_set_value_by_flags_and_idx.cu、test_draft_model_set_value_by_flags.py)仍使用旧的 step_idx 语义,这可能导致 speculate decoding 各 kernel 之间行为不一致,影响系统正确性。
建议:
- 补充剩余 3 个文件的 step_idx 语义修改,或说明这些文件将在其他 PR 中处理
- 如分多 PR 处理,建议在描述中说明分批计划,避免语义变更不完整导致的问题
| int64_t t1 = (cur_step >= 1) ? pre_ids_now[cur_step - 1] : -1; | ||
| int64_t t2 = (cur_step >= 2) ? pre_ids_now[cur_step - 2] : -1; | ||
| int64_t t3 = (cur_step >= 3) ? pre_ids_now[cur_step - 3] : -1; | ||
| int64_t t0 = (cur_step >= 1) ? pre_ids_now[cur_step - 1] : -1; |
There was a problem hiding this comment.
🔴 Bug 修复验证
新语义下 pre_ids_now[0] 为第一个 output token,pre_ids_now[cur_step - 1] 正确指向最新写入的 token(当 cur_step >= 1)。修改正确。
注意:PR 描述中提到还要修改 draft_model_set_value_by_flags.cu 和 speculate_set_value_by_flags_and_idx.cu,但这两个文件并未包含在本 PR 中,请确认是否遗漏。
| token_ids_all[0, 1] = self.line_break_id | ||
| token_ids_all[0, 2] = self.think_end_id | ||
| # batch 0: pattern \n <think_end> \n \n at pre_ids_now[0..3] | ||
| token_ids_all[0, 0] = self.line_break_id |
There was a problem hiding this comment.
🟡 建议
测试用例的修改与新语义一致:step_idx=4 时,pre_ids_now[0..3] 包含 4 个历史 token,验证了 kernel 正确读取最近 4 个 token 的逻辑。
建议:PR 描述中提到的 test_draft_model_set_value_by_flags.py 测试也应同步更新,请确认是否需要修改。
There was a problem hiding this comment.
Pull request overview
该 PR 适配 speculate decoding 场景下 step_idx 的新语义(表示“历史已生成 token 数”,且 pre_ids_now[0] 为第一个 output token),从而修正 reasoning phase 状态机在读取最近 token 时的下标计算,并同步更新对应单测用例。
Changes:
- 在
update_reasoning_status_kernel中将最近 4 个 token 的读取由pre_ids_now[cur_step - k]调整为pre_ids_now[cur_step - 1 - k],并将状态转移门槛由cur_step >= 3修正为cur_step >= 4。 - 更新
tests/operators/test_reasoning_phase_token_constraint.py中构造的 token pattern 与注释说明,使其与新step_idx/pre_ids_now语义一致。
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| custom_ops/gpu_ops/reasoning_phase_token_constraint.cu | 修正 reasoning 状态机读取最近 token 的索引与边界条件,匹配新 step_idx 语义 |
| tests/operators/test_reasoning_phase_token_constraint.py | 调整测试数据构造与说明,确保覆盖状态转移逻辑在新语义下的正确性 |
| // x = 1 : Generating "\n</think>\n\n" | ||
| // - Model is emitting the explicit boundary pattern | ||
| // - In non-MTP mode, accept_num is implicitly 1 | ||
| // and does not need to be manually set | ||
| // - In MTP mode, accept_num must be 1 in verify kernel | ||
| // | ||
| // Transition condition (x = 1 -> x = 2): | ||
| // - step_idx >= 3 | ||
| // - step_idx >= 4 | ||
| // - pre_ids[-4:] exactly match: | ||
| // "\n</think>\n\n" | ||
| // |
| token_ids_all[1, 2] = self.think_end_id | ||
|
|
||
| self.token_ids_all = paddle.to_tensor(token_ids_all, dtype="int64") | ||
| self.prompt_lens = paddle.zeros([self.bs, 1], dtype="int64") |
…el (PaddlePaddle#7349) Co-authored-by: guanshihui] <guanshihui@baidu.com>
…7402, #7445 to release/online/20260415 (#7447) * [Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel. * [Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel (#7349) Co-authored-by: guanshihui] <guanshihui@baidu.com> * [Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402) * [Interrupt reasoning] Add interrupt_requests control command support --------- Co-authored-by: guanshihui] <guanshihui@baidu.com>
Motivation
适配 step_idx 新语义。
本 PR 同步修改相关 kernel 及其单测。
Modifications
custom_ops/gpu_ops/reasoning_phase_token_constraint.cuupdate_reasoning_status_kernel中读取最近 4 个历史 token 的下标逻辑step_idx即为历史 token 数,pre_ids_now[cur_step - 1]正确指向最新写入的 token,cur_step >= 4边界判断亦对应正确tests/operators/test_reasoning_phase_token_constraint.pystep_idx取值Usage or Command
运行单测验证: