[Speculate Decoding] Fix step_idx semantics in reasoning_phase_token_constraint and speculate set_value kernels by lonelygsh · Pull Request #7349 · PaddlePaddle/FastDeploy

lonelygsh · 2026-04-12T17:04:13Z

Motivation

适配 step_idx 新语义。

旧语义：step_idx 表示处理后累加的 token 数量，pre_ids_now[0] 为预留位（prompt 最后一个 token）
新语义：step_idx 表示历史已生成的 token 数量，pre_ids_now[0] 为第一个 output token

本 PR 同步修改相关 kernel 及其单测。

Modifications

custom_ops/gpu_ops/reasoning_phase_token_constraint.cu

修复 update_reasoning_status_kernel 中读取最近 4 个历史 token 的下标逻辑
新语义下 step_idx 即为历史 token 数，pre_ids_now[cur_step - 1] 正确指向最新写入的 token，cur_step >= 4 边界判断亦对应正确

tests/operators/test_reasoning_phase_token_constraint.py

调整状态转移测试中使用的 step_idx 取值

Usage or Command

运行单测验证：

python -m pytest tests/operators/test_reasoning_phase_token_constraint.py -v

paddle-bot · 2026-04-12T17:04:19Z

Thanks for your contribution!

CLAassistant · 2026-04-12T17:04:20Z

All committers have signed the CLA.

codecov-commenter · 2026-04-12T20:10:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@e83d458). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7349   +/-   ##
==========================================
  Coverage           ?   74.38%           
==========================================
  Files              ?      383           
  Lines              ?    53614           
  Branches           ?     8412           
==========================================
  Hits               ?    39882           
  Misses             ?    11015           
  Partials           ?     2717

Flag	Coverage Δ
GPU	`74.38% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Deleter-D · 2026-04-13T07:25:39Z

        // Shadow state: initialize from target model (prefill just finished)
-        step_idx[tid] = target_step - 1;
+        // target_step 表示历史 token 数量，prefill 后为 0
+        step_idx[tid] = target_step;


之前有一个结论是MTP第一步推理完后和Target模型的step_idx对齐，这里预处理直接赋值为target的stpe_idx是什么原因呢？

Deleter-D · 2026-04-13T07:27:10Z

      }
-    } else if (step_idx[tid] == 1 /*Encoder Step*/) {
-      pre_ids_all_now[1] = draft_token_now[0];
+    } else if (step_idx[tid] == 0 /*Encoder Step*/) {


这里还沿用stpe_idx == 0的形式判断 prefill 阶段是否在所有场景都正确呢？是否考虑过重调度的情况，被驱逐的query再被拉回的时候，step_idx应该不是从0开始的

Deleter-D · 2026-04-13T07:29:43Z

          base_model_draft_tokens_now[substep + 1] = token_this_time;
          step_idx[tid] += seq_len_this_time;
-          pre_ids_now[step_idx[tid]] = token_this_time;
+          pre_ids_now[step_idx[tid] - 1] = token_this_time;


此处以及下面的所有改动都是因为MTP从Target模型的 step 开始解码而引发的修改吗？如果是的话，和之前相比，一定要从主模型的step_idx开始解码的原因是什么？

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-14 11:58 CST

📋 Review 摘要

PR 概述：修复 reasoning_phase_token_constraint kernel 中的 step_idx 语义适配

变更范围：custom_ops/gpu_ops/reasoning_phase_token_constraint.cu、tests/operators/test_reasoning_phase_token_constraint.py

影响面 Tag：[Speculative Decoding] [OP]

📝 PR 规范检查

PR 标题包含有效的 [Speculative Decoding] tag，描述详细填写了 Motivation 和 Modifications，符合规范。

问题

级别	文件	概述
🔴 Bug	PR 整体	PR 描述与实际修改不一致：描述中提到修改 5 个文件，但实际只修改了 2 个
🟡 建议	`custom_ops/gpu_ops/speculate_decoding/draft_model/draft_model_set_value_by_flags.cu`	该文件仍使用旧的 step_idx 语义，建议同步修改
🟡 建议	`custom_ops/gpu_ops/speculate_decoding/speculate_set_value_by_flags_and_idx.cu`	该文件仍使用旧的 step_idx 语义，建议同步修改
🟡 建议	`tests/operators/test_draft_model_set_value_by_flags.py`	该测试仍使用旧的 step_idx 语义，建议同步修改

总体评价

reasoning_phase_token_constraint.cu 和 test_reasoning_phase_token_constraint.py 的修改正确，成功适配了 step_idx 的新语义。但 PR 描述与实际修改存在严重不一致：描述中提到要修改 5 个文件以适配新语义，但实际只修改了 2 个。其余 3 个文件（draft_model_set_value_by_flags.cu、speculate_set_value_by_flags_and_idx.cu、test_draft_model_set_value_by_flags.py）仍使用旧的 step_idx 语义，这可能导致 speculate decoding 各 kernel 之间行为不一致，影响系统正确性。

建议：

补充剩余 3 个文件的 step_idx 语义修改，或说明这些文件将在其他 PR 中处理
如分多 PR 处理，建议在描述中说明分批计划，避免语义变更不完整导致的问题

PaddlePaddle-bot · 2026-04-14T03:58:27Z

-  int64_t t1 = (cur_step >= 1) ? pre_ids_now[cur_step - 1] : -1;
-  int64_t t2 = (cur_step >= 2) ? pre_ids_now[cur_step - 2] : -1;
-  int64_t t3 = (cur_step >= 3) ? pre_ids_now[cur_step - 3] : -1;
+  int64_t t0 = (cur_step >= 1) ? pre_ids_now[cur_step - 1] : -1;


🔴 Bug 修复验证

新语义下 pre_ids_now[0] 为第一个 output token，pre_ids_now[cur_step - 1] 正确指向最新写入的 token（当 cur_step >= 1）。修改正确。

注意：PR 描述中提到还要修改 draft_model_set_value_by_flags.cu 和 speculate_set_value_by_flags_and_idx.cu，但这两个文件并未包含在本 PR 中，请确认是否遗漏。

PaddlePaddle-bot · 2026-04-14T03:58:27Z

-        token_ids_all[0, 1] = self.line_break_id
-        token_ids_all[0, 2] = self.think_end_id
+        # batch 0: pattern \n <think_end> \n \n at pre_ids_now[0..3]
+        token_ids_all[0, 0] = self.line_break_id


🟡 建议

测试用例的修改与新语义一致：step_idx=4 时，pre_ids_now[0..3] 包含 4 个历史 token，验证了 kernel 正确读取最近 4 个 token 的逻辑。

建议：PR 描述中提到的 test_draft_model_set_value_by_flags.py 测试也应同步更新，请确认是否需要修改。

Copilot

Pull request overview

该 PR 适配 speculate decoding 场景下 step_idx 的新语义（表示“历史已生成 token 数”，且 pre_ids_now[0] 为第一个 output token），从而修正 reasoning phase 状态机在读取最近 token 时的下标计算，并同步更新对应单测用例。

Changes:

在 update_reasoning_status_kernel 中将最近 4 个 token 的读取由 pre_ids_now[cur_step - k] 调整为 pre_ids_now[cur_step - 1 - k]，并将状态转移门槛由 cur_step >= 3 修正为 cur_step >= 4。
更新 tests/operators/test_reasoning_phase_token_constraint.py 中构造的 token pattern 与注释说明，使其与新 step_idx/pre_ids_now 语义一致。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
custom_ops/gpu_ops/reasoning_phase_token_constraint.cu	修正 reasoning 状态机读取最近 token 的索引与边界条件，匹配新 step_idx 语义
tests/operators/test_reasoning_phase_token_constraint.py	调整测试数据构造与说明，确保覆盖状态转移逻辑在新语义下的正确性

 //   x = 1 : Generating "\n</think>\n\n"
 //         - Model is emitting the explicit boundary pattern
 //         - In non-MTP mode, accept_num is implicitly 1
 //           and does not need to be manually set
 //         - In MTP mode, accept_num must be 1 in verify kernel
 //
 //         Transition condition (x = 1 -> x = 2):
-//         - step_idx >= 3
+//         - step_idx >= 4
 //         - pre_ids[-4:] exactly match:
 //               "\n</think>\n\n"
 //


+        token_ids_all[1, 2] = self.think_end_id

        self.token_ids_all = paddle.to_tensor(token_ids_all, dtype="int64")
        self.prompt_lens = paddle.zeros([self.bs, 1], dtype="int64")


…el (PaddlePaddle#7349) Co-authored-by: guanshihui] <guanshihui@baidu.com>

…7402, #7445 to release/online/20260415 (#7447) * [Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel. * [Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel (#7349) Co-authored-by: guanshihui] <guanshihui@baidu.com> * [Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402) * [Interrupt reasoning] Add interrupt_requests control command support --------- Co-authored-by: guanshihui] <guanshihui@baidu.com>

lonelygsh temporarily deployed to Metax_ci April 12, 2026 17:04 — with GitHub Actions Inactive

paddle-bot bot added the contributor External developers label Apr 12, 2026