[Others]【Hackathon 10th Spring No.49】【RFC V2.0】投机解码 ngram GPU 并行 Kernel 设计 by cloudforge1 · Pull Request #1295 · PaddlePaddle/community

cloudforge1 · 2026-04-04T08:54:45Z

概要

V2.0 对已合入 RFC（NKNaN V1.0, PR #1213）的内联更新，补充完整的 GPU 并行 Kernel 设计方案。

修改内容

对 20260207_refine_speculate_decoding_ngram_for_fastdeploy.md 进行 inline override：

§三设计思路：从 12 行扩展为完整的两阶段并行架构（Phase 1 搜索 + Phase 2 CUB BlockScan）
§三关键技术点：atomicMin64 最左匹配、早退机制、模板特化搜索、scratch buffer 缓存
§四新增测试和验收：33 配置 CI 基准数据（9×–1,885× vs CPU path 加速比），含独立 CPU kernel 基准（#7203）
§一概述更新：补充生产环境规模（bsz=256, seq=128K）和零拷贝目标
元数据更新：V2.0、双作者

关联 PR

#6960 — 两阶段并行版实现（up to 1,722× vs CPU path）
#7136 — BlockScan Phase 2 + 模板特化优化版（up to 1,885× vs CPU path）
#7203 — CPU kernel 独立基准测试（isolates compute vs D2H transfer）
#7200 — 基准目标演进记录

V2.0 inline override of existing RFC: - §三 expanded from 12 lines to complete two-phase parallel architecture - Added Phase 1/Phase 2 kernel design, atomicMin64, CUB BlockScan - Added template specialization (PR #7136), early-exit, scratch cache - Added §四 with 27-config CI benchmark data (1.27x-1700x speedup) - Updated metadata: V2.0, dual-author (NKNaN V1.0 + cloudforge1 V2.0)

- Remove 实现PR row (no merged RFC has this) - Remove 依赖飞桨版本 row (not in merged RFCs) - Remove hackathon tag from 任务名称 (matches NKNaN V1.0 + fuzhenxin) - Section structure: 一概述→二现状→三设计→四测试→五排期→六影响面+参考 Matches fuzhenxin (merged) and deepseek (merged) patterns

paddle-bot · 2026-04-04T08:55:01Z

你的PR提交成功，感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备，具体请参考示例和模版。
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

- IP notice: 21→19 µs, 722→1,885× speedup - §五.2: Replace old scaling numbers with latest #7136 CI data - Production (bsz=32): 939→276 CPU, 661→19 GPU, 1.42→14.17× - Extreme (bsz=256, 131K): 275ms→284ms CPU, 162→151 GPU, 1,700→1,885× - Added high-concurrency (bsz=512): 72,640 CPU, 71 GPU, 1,030× - 27→33 benchmark configs across 7 dimensions - §五.3: Added reviewer expanded criteria (issue #7200) - §七: Updated impact with specific production/extreme numbers - §二: Condensed raw C++ to pseudo-code summary - Added issue #7200 reference, V2.0 changelog

cloudforge1 added 2 commits April 4, 2026 09:54

paddle-bot bot added the contributor label Apr 4, 2026

luotao1 mentioned this pull request Apr 4, 2026

【Hackathon 10th】开源贡献个人挑战赛 · 春节特别季 PaddlePaddle/Paddle#77429

Open

This was referenced Apr 4, 2026

[Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -optimized PaddlePaddle/FastDeploy#7136

Open

[Optimization]【Hackathon 10th Spring No.49】Port ngram_match and hybrid_mtp_ngram kernels to CUDA PaddlePaddle/FastDeploy#6960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Others]【Hackathon 10th Spring No.49】【RFC V2.0】投机解码 ngram GPU 并行 Kernel 设计#1295

[Others]【Hackathon 10th Spring No.49】【RFC V2.0】投机解码 ngram GPU 并行 Kernel 设计#1295
cloudforge1 wants to merge 3 commits intoPaddlePaddle:masterfrom
cloudforge1:task/049-rfc-parallel-ngram-kernel

cloudforge1 commented Apr 4, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cloudforge1 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

概要

修改内容

关联 PR

Uh oh!

paddle-bot bot commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudforge1 commented Apr 4, 2026 •

edited

Loading