[Optimization] TopP=1.0 using _random_sample#7892
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务已全部通过(10/10),当前不阻塞合并;仍有 1 个 Optional 任务失败、1 个 Optional Workflow 等待中,仅供参考。
2 任务状态汇总2.1 Required任务 : 10/10 通过
2.2 可选任务 — 30/32 通过
3 失败详情(仅 required)无 required 失败任务。 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7892 +/- ##
==========================================
Coverage ? 63.62%
==========================================
Files ? 462
Lines ? 64532
Branches ? 9892
==========================================
Hits ? 41056
Misses ? 20699
Partials ? 2777
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-22 18:39:01
📋 Review 摘要
PR 概述:top_p=1.0 场景下跳过 top_p 截断,改用 Gumbel-max trick 的 _random_sample 快速路径,同时抽离 dispatch_top_k_renorm_probs 并移除多余的 topp_seed CPU→GPU 拷贝。
变更范围:model_executor/layers/sample/、worker/gpu_model_runner.py、worker/input_batch.py、scripts/run_pre_ce.sh
影响面 Tag:[Optimization] [OP] [CI]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/worker/gpu_model_runner.py:1046 |
其他硬件 ModelRunner 未同步 top_p_list,无法享受快速路径 |
| ❓ 疑问 | fastdeploy/model_executor/layers/sample/sampler.py:1068 |
_normal_sample 中显式传入 top_p/top_k 是冗余代码 |
| ❓ 疑问 | scripts/run_pre_ce.sh:10 |
torch==2.8.0 PyPI 可用性 + triton 移除后 CI 路径影响 |
📝 PR 规范检查
标题 [Optimization] 是官方 Tag,格式合规。描述结构(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)完整,内容充实。Checklist 各项均未勾选,建议按实际情况更新勾选状态。
描述建议(可直接复制,仅需更新 Checklist 勾选状态):
## Motivation
top_p=1.0 场景下无需执行 top_p 截断,直接使用 Gumbel-max trick 的 `_random_sample` 替代完整的 `top_k_top_p_sampling`,减少不必要的 GPU 计算,优化推理采样性能。
## Modifications
- `fastdeploy/model_executor/layers/sample/meta_data.py`:`SamplingMetadata` 新增 `top_p_list` 字段(Python list,用于快速判断是否全为 1.0)
- `fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py`:将 top_k renorm probs 逻辑抽离为独立函数 `dispatch_top_k_renorm_probs`;移除 `topp_seed` 多余的 CPU→GPU 拷贝
- `fastdeploy/model_executor/layers/sample/sampler.py`:新增 `_sample_from_probs` 函数,当批次内所有请求 top_p=1.0 时走 `_random_sample` 快速路径,否则保持原 `top_k_top_p_sampling` 路径
- `fastdeploy/worker/input_batch.py`:新增 `top_p_list` list 的初始化、swap、reset 逻辑
- `fastdeploy/worker/gpu_model_runner.py`:在 `insert_tasks_v1` 中填充 `top_p_list`,在 `_prepare_inputs` 中传入 `SamplingMetadata`
## Usage or Command
N/A(性能优化,接口无变化,用户无需修改使用方式)
## Accuracy Tests
N/A(本次变更不改变 top_p < 1.0 场景的采样逻辑,仅 top_p=1.0 时改走等价的随机采样路径)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体方案清晰合理:快速路径的条件判断(top_p_list 全为 1.0 + is_cuda())逻辑正确,topp_seed CPU→GPU 多余拷贝的移除也是正确优化。主要关注点为其他硬件 ModelRunner 的同步情况,以及 CI 脚本中 torch==2.8.0 版本可用性。
| sampling_metadata.top_k, | ||
| sampling_metadata.top_k_list, | ||
| sampling_metadata, | ||
| top_p=sampling_metadata.top_p, |
There was a problem hiding this comment.
❓ 疑问 _normal_sample 中显式传入 top_p=sampling_metadata.top_p / top_k=sampling_metadata.top_k 是冗余代码。
_sample_from_probs 内部已有:
if top_p is None:
top_p = sampling_metadata.top_p
if top_k is None:
top_k = sampling_metadata.top_k直接调用 _sample_from_probs(probs, sampling_metadata, topp_seed=sampling_metadata.seed) 即可,无需重复传入。
| continue | ||
|
|
||
| assert len(request.eos_token_ids) == self.model_config.eos_tokens_lens | ||
| self.share_inputs["top_p_list"][idx] = request.get("top_p", 0.7) |
There was a problem hiding this comment.
🟡 建议 其他硬件 ModelRunner(gcu_model_runner.py、xpu_model_runner.py、metax_model_runner.py、hpu_model_runner.py)的 insert_tasks_v1 / _prepare_inputs 中均未填充/传入 top_p_list,导致这些平台构建的 SamplingMetadata.top_p_list=None。
在 _sample_from_probs 中,top_p_list is None 时 need_top_p_sampling 保持 True,这些平台无法获得 top_p=1.0 的快速路径优化,行为退化(非崩溃),但属于多硬件未同步。
建议在各硬件 runner 中同步添加 top_p_list 的初始化和传递逻辑,或在 PR 描述中说明其他平台暂不支持此优化的原因。
| https://paddle-qa.bj.bcebos.com/FastDeploy/torch-2.6.0-cp310-cp310-manylinux1_x86_64.whl \ | ||
| https://paddle-qa.bj.bcebos.com/FastDeploy/triton-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \ | ||
| https://paddle-qa.bj.bcebos.com/FastDeploy/xgrammar-0.1.19-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl | ||
| # Use prebuilt wheel files to install xgrammar==0.1.19 and torch==2.8.0 specifically for the CI environment |
There was a problem hiding this comment.
❓ 疑问 torch==2.8.0 在 PyPI 上目前可用吗?请确认该版本号已发布,避免 CI 安装失败。
同时,原脚本安装了 triton-3.2.0 wheel,此次移除后,若 CI 中 FD_SAMPLING_CLASS=triton 路径有测试用例,import triton 会失败,请确认是否需要保留 triton 安装或在 CI 中跳过 triton 路径测试。
Motivation
top_p=1.0 场景下无需执行 top_p 截断,直接使用 Gumbel-max trick 的
_random_sample替代完整的top_k_top_p_sampling,减少不必要的 GPU 计算,优化推理采样性能。Modifications
fastdeploy/model_executor/layers/sample/meta_data.py:SamplingMetadata新增top_p_list字段(Python list,用于快速判断是否全为 1.0)fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py:将 top_k renorm probs 逻辑抽离为独立函数dispatch_top_k_renorm_probs;移除topp_seed多余的 CPU→GPU 拷贝fastdeploy/model_executor/layers/sample/sampler.py:新增_sample_from_probs函数,当批次内所有请求 top_p=1.0 时走_random_sample快速路径,否则保持原top_k_top_p_sampling路径fastdeploy/worker/input_batch.py:新增top_p_listlist 的初始化、swap、reset 逻辑fastdeploy/worker/gpu_model_runner.py:在insert_tasks_v1中填充top_p_list,在_prepare_inputs中传入SamplingMetadataUsage or Command
N/A(性能优化,接口无变化,用户无需修改使用方式)
Accuracy Tests
N/A(本次变更不改变 top_p < 1.0 场景的采样逻辑,仅 top_p=1.0 时改走等价的随机采样路径)
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.