Skip to content

[Others]【Hackathon 10th Spring No.49】【RFC V2.0】投机解码 ngram GPU 并行 Kernel 设计#1295

Open
cloudforge1 wants to merge 3 commits intoPaddlePaddle:masterfrom
cloudforge1:task/049-rfc-parallel-ngram-kernel
Open

[Others]【Hackathon 10th Spring No.49】【RFC V2.0】投机解码 ngram GPU 并行 Kernel 设计#1295
cloudforge1 wants to merge 3 commits intoPaddlePaddle:masterfrom
cloudforge1:task/049-rfc-parallel-ngram-kernel

Conversation

@cloudforge1
Copy link
Copy Markdown
Contributor

@cloudforge1 cloudforge1 commented Apr 4, 2026

概要

V2.0 对已合入 RFC(NKNaN V1.0, PR #1213)的内联更新,补充完整的 GPU 并行 Kernel 设计方案。

修改内容

20260207_refine_speculate_decoding_ngram_for_fastdeploy.md 进行 inline override:

  • §三 设计思路:从 12 行扩展为完整的两阶段并行架构(Phase 1 搜索 + Phase 2 CUB BlockScan)
  • §三 关键技术点:atomicMin64 最左匹配、早退机制、模板特化搜索、scratch buffer 缓存
  • §四 新增测试和验收:33 配置 CI 基准数据(9×–1,885× vs CPU path 加速比),含独立 CPU kernel 基准(#7203
  • §一 概述更新:补充生产环境规模(bsz=256, seq=128K)和零拷贝目标
  • 元数据更新:V2.0、双作者

关联 PR

  • #6960 — 两阶段并行版实现(up to 1,722× vs CPU path)
  • #7136 — BlockScan Phase 2 + 模板特化优化版(up to 1,885× vs CPU path)
  • #7203 — CPU kernel 独立基准测试(isolates compute vs D2H transfer)
  • #7200 — 基准目标演进记录

V2.0 inline override of existing RFC:
- §三 expanded from 12 lines to complete two-phase parallel architecture
- Added Phase 1/Phase 2 kernel design, atomicMin64, CUB BlockScan
- Added template specialization (PR #7136), early-exit, scratch cache
- Added §四 with 27-config CI benchmark data (1.27x-1700x speedup)
- Updated metadata: V2.0, dual-author (NKNaN V1.0 + cloudforge1 V2.0)
- Remove 实现PR row (no merged RFC has this)
- Remove 依赖飞桨版本 row (not in merged RFCs)
- Remove hackathon tag from 任务名称 (matches NKNaN V1.0 + fuzhenxin)
- Section structure: 一概述→二现状→三设计→四测试→五排期→六影响面+参考
  Matches fuzhenxin (merged) and deepseek (merged) patterns
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 4, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

- IP notice: 21→19 µs, 722→1,885× speedup
- §五.2: Replace old scaling numbers with latest #7136 CI data
  - Production (bsz=32): 939→276 CPU, 661→19 GPU, 1.42→14.17×
  - Extreme (bsz=256, 131K): 275ms→284ms CPU, 162→151 GPU, 1,700→1,885×
  - Added high-concurrency (bsz=512): 72,640 CPU, 71 GPU, 1,030×
- 27→33 benchmark configs across 7 dimensions
- §五.3: Added reviewer expanded criteria (issue #7200)
- §七: Updated impact with specific production/extreme numbers
- §二: Condensed raw C++ to pseudo-code summary
- Added issue #7200 reference, V2.0 changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant