[None][docs] add GVR Top-K technical blog#13714
Conversation
…— Heuristic Top-K for Blackwell Add technical blog post (EN) documenting the heuristic-guided Top-K kernel for DeepSeek-V3.2 sparse attention on NVIDIA Blackwell GPUs. Key contents: - Temporal correlation analysis of indexer scores (RoPE/YaRN Toeplitz theory) - Four-phase heuristic algorithm: preIdx stats → interpolation search → ballot-free collect → histogram+snap partition - Single-CTA micro-kernel design with ~60 KB shared memory - Kernel benchmarks: 1.32×–2.11× speedup on real SWE-Bench-64K data (B200) - End-to-end accuracy validation on 5 benchmarks (no degradation) - Integration into TensorRT-LLM via configurable dispatch path Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Replace inline LaTeX (\text{}, {,}, $..$ in tables) with plain Unicode
and text equivalents for correct rendering on GitHub Flavored Markdown.
Block math ($$...$$) with \text{} is kept as-is (renders correctly).
Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…edundant scripts - Wrap all tables in <div align="center"> for page-level centering - Use :---: separator for cell-level centering - Restore $...$ math in table cells and inline text where GitHub renders correctly - Remove duplicate trtllm-eval script block from reproduction section Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
- Remove split "$...$B" patterns (B outside math block)
- Use plain text in table cells for maximum compatibility
- Fix inline α/≈ symbols with proper $\alpha\approx$ LaTeX
- Replace $\sim$ with ~ for inline approximations
- Wrap NUM_WARPS formula in code backticks
- Use $N = 8192$ instead of {,} formatting
- Ensure $I+1\approx 3$–$4$ renders correctly
Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
GitHub fails to render "$X \approx A$–$B$" correctly (the second $..$ block is parsed as standalone math). Replace all split-range patterns with plain Unicode: "X ≈ A–B". Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
- Replace ^* with ^\ast in math (GitHub Markdown consumes * as italic)
- Replace $N > 200$K with plain N > 200K (K outside math block)
- Use A_{m} instead of A_m (protect subscript from Markdown)
Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
GitHub-flavored Markdown can misinterpret ^* as italic markup inside LaTeX math blocks. Using ^\ast avoids this ambiguity. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Replace LaTeX math ($\mathcal{N}$, $A_m$, etc.) with Unicode
equivalents (𝒩, Aₘ, etc.) in the data-source comparison table.
GitHub's math renderer is unreliable inside Markdown table cells;
Unicode characters render correctly everywhere.
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…etics - Regenerate Phase 2 secant method diagram with corrected geometry: T₂ now at the exact intersection of Secant 2 and f_target=3072 - Use S-shaped CDF survival curve for f(T) instead of exponential - Place all labels in clear white space with dashed arrow leaders - Align pmax with the secant 1 / f(T) curve intersection - Increase font sizes for better readability - Update ZH blog table with Unicode math (sync with EN version) Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Condense and refocus the Future Work around five key directions: multi-CTA support for ultra-long sequences (N>200K), prefill-phase analytical prediction without temporal history, cross-model generalization (RocketKV, NSA), multi-batch / MTP>1 variable-length unified tuning, and next-generation GPU architecture adaptation. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…o-action Condense Acknowledgement section and add an invitation for the community to contribute to TensorRT-LLM and the GPU inference ecosystem. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Distinguish batch=1 (heavily tuned) from multi-batch (functionally supported but not yet performance-optimized). Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Revert to the original AI-generated secant diagram from commit 12ba374, which the team preferred over the matplotlib replacement. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…cape Expand the introduction to cover the broader sparse attention ecosystem (DSA, NSA, MoBA, RocketKV, Quest, SAGE-KV) that relies on Top-K selection, motivating kernel-level optimization as sequences grow into 100K+. Position DSA as the concrete case study while noting the approach generalizes to any method with temporal Top-K correlation. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Keep the Chinese version as a local-only file, not pushed to remote. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…name to GVR - Add GPU Top-K literature review paragraph (RadiK ICS24, Zois ADMS19, Zhang SC23, Key approximate Top-K 2024) to Introduction - Clarify baseline is SC23 evolution on Blackwell by the same team - Fix DSA formula citation to DeepSeek-V3 technical report - Rename algorithm to "GVR Top-K" in complexity table Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…GVR/heuristic naming - Rename "End-to-End Throughput" to "End-to-End Min-Latency Benchmark on B200", remove throughput rows from table, keep latency metrics - Fix TOC entry to match updated section title - Add "heuristic-guided approach" natural phrasing in Introduction - Rename algorithm to "GVR Top-K" in complexity table Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Move green star from Secant②/f(T) curve intersection to the correct position at Secant②/f_target=3072 intersection, with T₂ as the corresponding x-coordinate. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…m branding - Update end-to-end min-latency benchmark with 4 independent trials on ISL=131K/OSL=32K, including mean and standard deviation - Refine "GVR Top-K" branding across all technical diagrams (PNGs) to match manuscript terminology - Improve algorithm flow diagram design with better alignment, larger fonts, and professional design - Remove legacy "I=1-2" footer from algorithm flow diagram for cleaner presentation Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Update GVR Top-K documentation to match the merged Scheme X dispatcher, current index_topk support, and sparse attention enablement guidance. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Move the GVR Top-K article and its media assets to the next available tech blog number so it no longer conflicts with the existing blog19 DWDP article. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Replace temporary manuscript and fork links with the official arXiv DOI so the GVR Top-K blog is ready for external publication. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Use a technical-blog title that aligns with the file name while keeping the formal arXiv title in the references. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Clarify the public GVR Top-K enablement path, update the LLM API field description, and remove internal dispatcher terminology from user-facing docs. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Mention upcoming DeepSeek V4-style long-sequence indexer Top-K configurations and the planned GVR Top-K support for index_topk 512 and 1024. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR adds documentation for the Guess-Verify-Refine (GVR) Top-K optimization feature for DeepSeek sparse attention on Blackwell GPUs, including a comprehensive new blog post explaining the algorithm and integration, updated feature documentation with configuration examples, and clarified API docstrings. ChangesGVR Top-K Feature Documentation
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~7 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md`:
- Around line 470-485: Replace the deprecated flag --extra_llm_api_options with
the canonical --config in the trtllm-bench example; update the CLI example that
invokes trtllm-bench (the command shown with model
deepseek-ai/DeepSeek-V3.2-Exp) to use --config <config.yml> instead of
--extra_llm_api_options <config.yml> so it follows the docs convention for
config-file flags.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d35394ae-02b0-4761-bb4c-139aa8c324b9
⛔ Files ignored due to path filters (11)
docs/source/blogs/media/tech_blog21_algorithm_flow.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_dispatch_logic.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_e2e_tept8_osl1k_bar.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_hit_ratio.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_indexer_topk.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_phase4_detail.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_real_data_bars.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_secant_method.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_synthetic_scaling.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_temporal_correlation_diagram.pngis excluded by!**/*.pngdocs/source/blogs/media/tech_blog21_topk_dispatch_flowchart.pngis excluded by!**/*.png
📒 Files selected for processing (4)
docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.mddocs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.mddocs/source/features/sparse-attention.mdtensorrt_llm/llmapi/llm_args.py
Use the canonical trtllm-bench --config flag in the GVR Top-K blog and include the pre-commit yapf formatting for the LLM API field description. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
|
Addressed CodeRabbit's comment by replacing /bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #46588 [ run ] triggered by Bot. Commit: |
|
/bot --help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot kill |
|
PR_Github #46590 [ kill ] triggered by Bot. Commit: |
|
/bot skip --comment "No need to run full CI" |
|
PR_Github #46590 [ kill ] completed with state |
|
PR_Github #46592 [ skip ] triggered by Bot. Commit: |
|
PR_Github #46592 [ skip ] completed with state |
|
Hi @hchings @chang-l @venkywonka, gentle ping for review when you have a chance. Current status:
What this PR does:
Relationship to prior GVR Top-K PRs:
Thanks! |
Summary
This PR adds Tech Blog 21 for Guess-Verify-Refine (GVR) Top-K, the data-aware exact Top-K path for DeepSeek Sparse Attention (DSA) decode on Blackwell. The blog explains why decode-time indexer Top-K becomes a long-context bottleneck, how GVR uses temporal correlation from the previous decode step, how it fits into TensorRT-LLM, and how users can enable it.
DeepSeekSparseAttentionConfig.enable_heuristic_topkfield description so the LLM API reference explicitly names GVR Top-K, currentindex_topk=2048support, and fallback behavior.Related PRs
Key Files
docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.mddocs/source/blogs/media/tech_blog21_*.pngdocs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.mddocs/source/features/sparse-attention.mdtensorrt_llm/llmapi/llm_args.pyenable_heuristic_topkAPI / User-Facing Docs
No new API is introduced in this PR. It documents the existing
enable_heuristic_topkoption and clarifies that:DeepSeekSparseAttentionConfig(enable_heuristic_topk=True)or the equivalent YAML config.index_topk=2048on Blackwell (SM100+).index_topk=512/1024support is previewed for future long-sequence DSA workloads.Test plan
python3 -m py_compile tensorrt_llm/llmapi/llm_args.py.tensorrt_llm/llmapi/llm_args.pywith IDE lints.Author
Long Cheng 243710427+longcheng-nv@users.noreply.github.com
Made-with: gpt-5.5-high