[None][docs] add GVR Top-K technical blog by longcheng-nv · Pull Request #13714 · NVIDIA/TensorRT-LLM

longcheng-nv · 2026-05-03T09:20:59Z

Summary

This PR adds Tech Blog 21 for Guess-Verify-Refine (GVR) Top-K, the data-aware exact Top-K path for DeepSeek Sparse Attention (DSA) decode on Blackwell. The blog explains why decode-time indexer Top-K becomes a long-context bottleneck, how GVR uses temporal correlation from the previous decode step, how it fits into TensorRT-LLM, and how users can enable it.

New GVR Top-K technical blog: introduces the motivation, temporal-correlation observation, four-phase Guess/Verify/Refine algorithm, exactness story, TensorRT-LLM dispatch path, and B200 performance results.
Supporting media assets: adds Tech Blog 21 figures for the DSA indexer Top-K flow, temporal correlation, algorithm phases, dispatch logic, single-op results, and end-to-end TPOT reduction.
Cross-blog linkage: updates Tech Blog 15's DSA Top-K section to point readers to the newer GVR Top-K article.
User-facing docs/API wording: updates sparse attention docs and DeepSeekSparseAttentionConfig.enable_heuristic_topk field description so the LLM API reference explicitly names GVR Top-K, current index_topk=2048 support, and fallback behavior.

Related PRs

[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention #12385 introduced the temporally-correlated heuristic/GVR Top-K decode path for the DSA indexer.
[None][perf] Scheme X L2-aware dispatcher and PDL launchers for sparse-attention GVR Top-K #13477 added the hardware-aware GVR dispatcher, PDL launchers, CUDA Graph warmup, and GVR naming polish.
This PR documents the merged GVR Top-K feature for external users and links it from the surrounding TensorRT-LLM docs.

Key Files

File	Description
`docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md`	Main GVR Top-K technical blog
`docs/source/blogs/media/tech_blog21_*.png`	Figures used by Tech Blog 21
`docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md`	Cross-link from the original DSA Top-K discussion
`docs/source/features/sparse-attention.md`	Sparse attention docs with optional GVR Top-K enablement
`tensorrt_llm/llmapi/llm_args.py`	LLM API field description for `enable_heuristic_topk`

API / User-Facing Docs

No new API is introduced in this PR. It documents the existing enable_heuristic_topk option and clarifies that:

GVR Top-K is opt-in through DeepSeekSparseAttentionConfig(enable_heuristic_topk=True) or the equivalent YAML config.
The current GVR fast path supports index_topk=2048 on Blackwell (SM100+).
Planned index_topk=512/1024 support is previewed for future long-sequence DSA workloads.
Unsupported configurations fall back to the production insertion/radix Top-K path.

Test plan

Documentation-only change.
Verified local relative links and image paths in the new blog.
Ran python3 -m py_compile tensorrt_llm/llmapi/llm_args.py.
Checked tensorrt_llm/llmapi/llm_args.py with IDE lints.

Author

Long Cheng 243710427+longcheng-nv@users.noreply.github.com

Made-with: gpt-5.5-high

…— Heuristic Top-K for Blackwell Add technical blog post (EN) documenting the heuristic-guided Top-K kernel for DeepSeek-V3.2 sparse attention on NVIDIA Blackwell GPUs. Key contents: - Temporal correlation analysis of indexer scores (RoPE/YaRN Toeplitz theory) - Four-phase heuristic algorithm: preIdx stats → interpolation search → ballot-free collect → histogram+snap partition - Single-CTA micro-kernel design with ~60 KB shared memory - Kernel benchmarks: 1.32×–2.11× speedup on real SWE-Bench-64K data (B200) - End-to-end accuracy validation on 5 benchmarks (no degradation) - Integration into TensorRT-LLM via configurable dispatch path Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Replace inline LaTeX (\text{}, {,}, $..$ in tables) with plain Unicode and text equivalents for correct rendering on GitHub Flavored Markdown. Block math ($$...$$) with \text{} is kept as-is (renders correctly). Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…edundant scripts - Wrap all tables in <div align="center"> for page-level centering - Use :---: separator for cell-level centering - Restore $...$ math in table cells and inline text where GitHub renders correctly - Remove duplicate trtllm-eval script block from reproduction section Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

- Remove split "$...$B" patterns (B outside math block) - Use plain text in table cells for maximum compatibility - Fix inline α/≈ symbols with proper $\alpha\approx$ LaTeX - Replace $\sim$ with ~ for inline approximations - Wrap NUM_WARPS formula in code backticks - Use $N = 8192$ instead of {,} formatting - Ensure $I+1\approx 3$–$4$ renders correctly Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

GitHub fails to render "$X \approx A$–$B$" correctly (the second $..$ block is parsed as standalone math). Replace all split-range patterns with plain Unicode: "X ≈ A–B". Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

- Replace ^* with ^\ast in math (GitHub Markdown consumes * as italic) - Replace $N > 200$K with plain N > 200K (K outside math block) - Use A_{m} instead of A_m (protect subscript from Markdown) Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

GitHub-flavored Markdown can misinterpret ^* as italic markup inside LaTeX math blocks. Using ^\ast avoids this ambiguity. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Replace LaTeX math ($\mathcal{N}$, $A_m$, etc.) with Unicode equivalents (𝒩, Aₘ, etc.) in the data-source comparison table. GitHub's math renderer is unreliable inside Markdown table cells; Unicode characters render correctly everywhere. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…etics - Regenerate Phase 2 secant method diagram with corrected geometry: T₂ now at the exact intersection of Secant 2 and f_target=3072 - Use S-shaped CDF survival curve for f(T) instead of exponential - Place all labels in clear white space with dashed arrow leaders - Align pmax with the secant 1 / f(T) curve intersection - Increase font sizes for better readability - Update ZH blog table with Unicode math (sync with EN version) Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Condense and refocus the Future Work around five key directions: multi-CTA support for ultra-long sequences (N>200K), prefill-phase analytical prediction without temporal history, cross-model generalization (RocketKV, NSA), multi-batch / MTP>1 variable-length unified tuning, and next-generation GPU architecture adaptation. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…o-action Condense Acknowledgement section and add an invitation for the community to contribute to TensorRT-LLM and the GPU inference ecosystem. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Distinguish batch=1 (heavily tuned) from multi-batch (functionally supported but not yet performance-optimized). Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Revert to the original AI-generated secant diagram from commit 12ba374, which the team preferred over the matplotlib replacement. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…cape Expand the introduction to cover the broader sparse attention ecosystem (DSA, NSA, MoBA, RocketKV, Quest, SAGE-KV) that relies on Top-K selection, motivating kernel-level optimization as sequences grow into 100K+. Position DSA as the concrete case study while noting the approach generalizes to any method with temporal Top-K correlation. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Keep the Chinese version as a local-only file, not pushed to remote. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…name to GVR - Add GPU Top-K literature review paragraph (RadiK ICS24, Zois ADMS19, Zhang SC23, Key approximate Top-K 2024) to Introduction - Clarify baseline is SC23 evolution on Blackwell by the same team - Fix DSA formula citation to DeepSeek-V3 technical report - Rename algorithm to "GVR Top-K" in complexity table Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…GVR/heuristic naming - Rename "End-to-End Throughput" to "End-to-End Min-Latency Benchmark on B200", remove throughput rows from table, keep latency metrics - Fix TOC entry to match updated section title - Add "heuristic-guided approach" natural phrasing in Introduction - Rename algorithm to "GVR Top-K" in complexity table Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Move green star from Secant②/f(T) curve intersection to the correct position at Secant②/f_target=3072 intersection, with T₂ as the corresponding x-coordinate. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

…m branding - Update end-to-end min-latency benchmark with 4 independent trials on ISL=131K/OSL=32K, including mean and standard deviation - Refine "GVR Top-K" branding across all technical diagrams (PNGs) to match manuscript terminology - Improve algorithm flow diagram design with better alignment, larger fonts, and professional design - Remove legacy "I=1-2" footer from algorithm flow diagram for cleaner presentation Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Update GVR Top-K documentation to match the merged Scheme X dispatcher, current index_topk support, and sparse attention enablement guidance. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Move the GVR Top-K article and its media assets to the next available tech blog number so it no longer conflicts with the existing blog19 DWDP article. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Replace temporary manuscript and fork links with the official arXiv DOI so the GVR Top-K blog is ready for external publication. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Use a technical-blog title that aligns with the file name while keeping the formal arXiv title in the references. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Clarify the public GVR Top-K enablement path, update the LLM API field description, and remove internal dispatcher terminology from user-facing docs. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

Mention upcoming DeepSeek V4-style long-sequence indexer Top-K configurations and the planned GVR Top-K support for index_topk 512 and 1024. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

coderabbitai · 2026-05-03T09:24:31Z

📝 Walkthrough

Walkthrough

This PR adds documentation for the Guess-Verify-Refine (GVR) Top-K optimization feature for DeepSeek sparse attention on Blackwell GPUs, including a comprehensive new blog post explaining the algorithm and integration, updated feature documentation with configuration examples, and clarified API docstrings.

Changes

GVR Top-K Feature Documentation

Layer / File(s)	Summary
Configuration API `tensorrt_llm/llmapi/llm_args.py`	`DeepSeekSparseAttentionConfig.enable_heuristic_topk` docstring updated to describe GVR Top-K behavior, supported conditions (`index_topk=2048` on Blackwell/SM100+), and fallback to production Top-K path when prerequisites are not met.
Feature Quick-Start `docs/source/features/sparse-attention.md`	Added Python and YAML examples for optional GVR Top-K acceleration with `index_topk=2048` and `enable_heuristic_topk=True`, plus description of opt-in behavior and runtime dispatcher behavior.
Technical Deep-Dive `docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md`	New comprehensive blog post explaining GVR algorithm phases (guess, verify, candidate collection, exact refinement), TensorRT-LLM integration points, configuration controls, fallback conditions, operator-level and end-to-end performance/accuracy results, and reproduction instructions.
Cross-Reference `docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md`	Added follow-up note linking to blog21 and describing GVR Top-K as a further optimization using temporal correlation with hardware-aware dispatch between GVR and radix paths.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~7 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][docs] add GVR Top-K technical blog' is clear and directly related to the main change—adding a technical blog documenting GVR Top-K. It follows the repository's title template and concisely summarizes the primary contribution.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description comprehensively covers all required template sections: it explains the issue/solution, details test coverage (documentation-only), and includes a complete PR checklist review.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md`:
- Around line 470-485: Replace the deprecated flag --extra_llm_api_options with
the canonical --config in the trtllm-bench example; update the CLI example that
invokes trtllm-bench (the command shown with model
deepseek-ai/DeepSeek-V3.2-Exp) to use --config <config.yml> instead of
--extra_llm_api_options <config.yml> so it follows the docs convention for
config-file flags.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d35394ae-02b0-4761-bb4c-139aa8c324b9

📥 Commits

Reviewing files that changed from the base of the PR and between 8311568 and 9bfa633.

⛔ Files ignored due to path filters (11)

docs/source/blogs/media/tech_blog21_algorithm_flow.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_dispatch_logic.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_e2e_tept8_osl1k_bar.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_hit_ratio.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_indexer_topk.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_phase4_detail.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_real_data_bars.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_secant_method.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_synthetic_scaling.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_temporal_correlation_diagram.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_topk_dispatch_flowchart.png is excluded by !**/*.png

📒 Files selected for processing (4)

docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md
docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md
docs/source/features/sparse-attention.md
tensorrt_llm/llmapi/llm_args.py

Use the canonical trtllm-bench --config flag in the GVR Top-K blog and include the pre-commit yapf formatting for the LLM API field description. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

longcheng-nv · 2026-05-03T09:31:52Z

Addressed CodeRabbit's comment by replacing --extra_llm_api_options <config.yml> with the canonical --config <config.yml> in the trtllm-bench example. Also included the CI pre-commit/yapf formatting update for llm_args.py.

/bot run --disable-fail-fast

longcheng-nv · 2026-05-03T09:34:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-03T09:40:35Z

PR_Github #46588 [ run ] triggered by Bot. Commit: c839365 Link to invocation

juney-nvidia · 2026-05-03T11:14:58Z

/bot --help

github-actions · 2026-05-03T11:15:07Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

juney-nvidia · 2026-05-03T11:16:26Z

/bot kill

juney-nvidia

LGTM

tensorrt-cicd · 2026-05-03T11:23:15Z

PR_Github #46590 [ kill ] triggered by Bot. Commit: c839365 Link to invocation

juney-nvidia · 2026-05-03T11:23:35Z

/bot skip --comment "No need to run full CI"

tensorrt-cicd · 2026-05-03T11:24:00Z

PR_Github #46590 [ kill ] completed with state SUCCESS. Commit: c839365
Successfully killed previous jobs for commit c839365

Link to invocation

tensorrt-cicd · 2026-05-03T11:29:53Z

PR_Github #46592 [ skip ] triggered by Bot. Commit: c839365 Link to invocation

tensorrt-cicd · 2026-05-03T11:37:13Z

PR_Github #46592 [ skip ] completed with state SUCCESS. Commit: c839365
Skipping testing for commit c839365

Link to invocation

longcheng-nv · 2026-05-03T12:01:07Z

Hi @hchings @chang-l @venkywonka, gentle ping for review when you have a chance.

Current status:

GitHub checks are green: DCO, pre-commit, PR title, checklist, and base freshness all pass.
Blossom CI is green with Bot Pipeline Skipped SUCCESS, Release Check SUCCESS after @juney-nvidia skipped full CI for this docs-only PR.
CodeRabbit's actionable comment has been addressed (trtllm-bench example now uses --config).

What this PR does:

Adds Tech Blog 21 for Guess-Verify-Refine (GVR) Top-K, covering the motivation, algorithm intuition, TensorRT-LLM integration, enablement path, fallback behavior, and performance/accuracy results.
Adds the blog media assets and links the existing DeepSeek-V3.2 Top-K discussion in Tech Blog 15 to the new GVR article.
Updates sparse attention docs and the LLM API field description so enable_heuristic_topk is clearly documented as GVR Top-K.

Relationship to prior GVR Top-K PRs:

[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention #12385 introduced the temporally-correlated heuristic/GVR Top-K decode path for the DSA indexer.
[None][perf] Scheme X L2-aware dispatcher and PDL launchers for sparse-attention GVR Top-K #13477 added the hardware-aware GVR dispatcher, PDL launchers, CUDA Graph warmup, and GVR naming polish.
This PR is the external-facing documentation/blog follow-up for those merged code changes.

Thanks!

longcheng-nv added 26 commits May 3, 2026 13:53

[None][docs] blog19: fix broken table separator (stray space)

6d33512

Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog19: replace ^* with ^\ast for GitHub math rendering

270fa14

GitHub-flavored Markdown can misinterpret ^* as italic markup inside LaTeX math blocks. Using ^\ast avoids this ambiguity. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog19: clarify multi-batch status in Future Work

6d60a3e

Distinguish batch=1 (heavily tuned) from multi-batch (functionally supported but not yet performance-optimized). Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog19: restore original secant method diagram

dbd3400

Revert to the original AI-generated secant diagram from commit 12ba374, which the team preferred over the matplotlib replacement. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog19: remove ZH technical blog from tracking

ef5aa04

Keep the Chinese version as a local-only file, not pushed to remote. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog19: align GVR docs with merged Top-K code

24c5b02

Update GVR Top-K documentation to match the merged Scheme X dispatcher, current index_topk support, and sparse attention enablement guidance. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog21: renumber GVR Top-K blog

3e0f1ba

Move the GVR Top-K article and its media assets to the next available tech blog number so it no longer conflicts with the existing blog19 DWDP article. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog21: update GVR paper references

2a072df

Replace temporary manuscript and fork links with the official arXiv DOI so the GVR Top-K blog is ready for external publication. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog21: simplify GVR blog title

7c9118f

Use a technical-blog title that aligns with the file name while keeping the formal arXiv title in the references. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog21: polish GVR Top-K docs

d76ba7d

Clarify the public GVR Top-K enablement path, update the LLM API field description, and remove internal dispatcher terminology from user-facing docs. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

[None][docs] blog21: preview GVR Top-K smaller-K support

9bfa633

Mention upcoming DeepSeek V4-style long-sequence indexer Top-K configurations and the planned GVR Top-K support for index_topk 512 and 1024. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

longcheng-nv requested review from a team as code owners May 3, 2026 09:21

longcheng-nv requested review from chang-l and hchings May 3, 2026 09:21

longcheng-nv requested a review from venkywonka May 3, 2026 09:21

github-actions Bot assigned longcheng-nv May 3, 2026

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

Comment thread docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md

[None][docs] blog21: address GVR docs review comments

c839365

Use the canonical trtllm-bench --config flag in the GVR Top-K blog and include the pre-commit yapf formatting for the LLM API field description. Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>

juney-nvidia approved these changes May 3, 2026

View reviewed changes

juney-nvidia merged commit 9dce3fc into NVIDIA:main May 3, 2026
8 checks passed

Conversation

longcheng-nv commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related PRs

Key Files

API / User-Facing Docs

Test plan

Author

Uh oh!

coderabbitai Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

longcheng-nv commented May 3, 2026

Uh oh!

longcheng-nv commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

juney-nvidia commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

juney-nvidia commented May 3, 2026

Uh oh!

juney-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

juney-nvidia commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

longcheng-nv commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

longcheng-nv commented May 3, 2026 •

edited

Loading

coderabbitai Bot commented May 3, 2026 •

edited

Loading