Skip to content

[None][docs] add GVR Top-K technical blog#13714

Merged
juney-nvidia merged 27 commits intoNVIDIA:mainfrom
longcheng-nv:tmp/heuristic_topK_report
May 3, 2026
Merged

[None][docs] add GVR Top-K technical blog#13714
juney-nvidia merged 27 commits intoNVIDIA:mainfrom
longcheng-nv:tmp/heuristic_topK_report

Conversation

@longcheng-nv
Copy link
Copy Markdown
Collaborator

@longcheng-nv longcheng-nv commented May 3, 2026

Summary

This PR adds Tech Blog 21 for Guess-Verify-Refine (GVR) Top-K, the data-aware exact Top-K path for DeepSeek Sparse Attention (DSA) decode on Blackwell. The blog explains why decode-time indexer Top-K becomes a long-context bottleneck, how GVR uses temporal correlation from the previous decode step, how it fits into TensorRT-LLM, and how users can enable it.

  • New GVR Top-K technical blog: introduces the motivation, temporal-correlation observation, four-phase Guess/Verify/Refine algorithm, exactness story, TensorRT-LLM dispatch path, and B200 performance results.
  • Supporting media assets: adds Tech Blog 21 figures for the DSA indexer Top-K flow, temporal correlation, algorithm phases, dispatch logic, single-op results, and end-to-end TPOT reduction.
  • Cross-blog linkage: updates Tech Blog 15's DSA Top-K section to point readers to the newer GVR Top-K article.
  • User-facing docs/API wording: updates sparse attention docs and DeepSeekSparseAttentionConfig.enable_heuristic_topk field description so the LLM API reference explicitly names GVR Top-K, current index_topk=2048 support, and fallback behavior.

Related PRs

Key Files

File Description
docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md Main GVR Top-K technical blog
docs/source/blogs/media/tech_blog21_*.png Figures used by Tech Blog 21
docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md Cross-link from the original DSA Top-K discussion
docs/source/features/sparse-attention.md Sparse attention docs with optional GVR Top-K enablement
tensorrt_llm/llmapi/llm_args.py LLM API field description for enable_heuristic_topk

API / User-Facing Docs

No new API is introduced in this PR. It documents the existing enable_heuristic_topk option and clarifies that:

  • GVR Top-K is opt-in through DeepSeekSparseAttentionConfig(enable_heuristic_topk=True) or the equivalent YAML config.
  • The current GVR fast path supports index_topk=2048 on Blackwell (SM100+).
  • Planned index_topk=512/1024 support is previewed for future long-sequence DSA workloads.
  • Unsupported configurations fall back to the production insertion/radix Top-K path.

Test plan

  • Documentation-only change.
  • Verified local relative links and image paths in the new blog.
  • Ran python3 -m py_compile tensorrt_llm/llmapi/llm_args.py.
  • Checked tensorrt_llm/llmapi/llm_args.py with IDE lints.

Author

Long Cheng 243710427+longcheng-nv@users.noreply.github.com

Made-with: gpt-5.5-high

…— Heuristic Top-K for Blackwell

Add technical blog post (EN) documenting the heuristic-guided Top-K
kernel for DeepSeek-V3.2 sparse attention on NVIDIA Blackwell GPUs.

Key contents:
- Temporal correlation analysis of indexer scores (RoPE/YaRN Toeplitz theory)
- Four-phase heuristic algorithm: preIdx stats → interpolation search →
  ballot-free collect → histogram+snap partition
- Single-CTA micro-kernel design with ~60 KB shared memory
- Kernel benchmarks: 1.32×–2.11× speedup on real SWE-Bench-64K data (B200)
- End-to-end accuracy validation on 5 benchmarks (no degradation)
- Integration into TensorRT-LLM via configurable dispatch path

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Replace inline LaTeX (\text{}, {,}, $..$ in tables) with plain Unicode
and text equivalents for correct rendering on GitHub Flavored Markdown.
Block math ($$...$$) with \text{} is kept as-is (renders correctly).

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…edundant scripts

- Wrap all tables in <div align="center"> for page-level centering
- Use :---: separator for cell-level centering
- Restore $...$ math in table cells and inline text where GitHub renders correctly
- Remove duplicate trtllm-eval script block from reproduction section

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
- Remove split "$...$B" patterns (B outside math block)
- Use plain text in table cells for maximum compatibility
- Fix inline α/≈ symbols with proper $\alpha\approx$ LaTeX
- Replace $\sim$ with ~ for inline approximations
- Wrap NUM_WARPS formula in code backticks
- Use $N = 8192$ instead of {,} formatting
- Ensure $I+1\approx 3$–$4$ renders correctly

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
GitHub fails to render "$X \approx A$–$B$" correctly (the second
$..$ block is parsed as standalone math). Replace all split-range
patterns with plain Unicode: "X ≈ A–B".

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
- Replace ^* with ^\ast in math (GitHub Markdown consumes * as italic)
- Replace $N > 200$K with plain N > 200K (K outside math block)
- Use A_{m} instead of A_m (protect subscript from Markdown)

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
GitHub-flavored Markdown can misinterpret ^* as italic markup
inside LaTeX math blocks. Using ^\ast avoids this ambiguity.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Replace LaTeX math ($\mathcal{N}$, $A_m$, etc.) with Unicode
equivalents (𝒩, Aₘ, etc.) in the data-source comparison table.
GitHub's math renderer is unreliable inside Markdown table cells;
Unicode characters render correctly everywhere.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…etics

- Regenerate Phase 2 secant method diagram with corrected geometry:
  T₂ now at the exact intersection of Secant 2 and f_target=3072
- Use S-shaped CDF survival curve for f(T) instead of exponential
- Place all labels in clear white space with dashed arrow leaders
- Align pmax with the secant 1 / f(T) curve intersection
- Increase font sizes for better readability
- Update ZH blog table with Unicode math (sync with EN version)

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Condense and refocus the Future Work around five key directions:
multi-CTA support for ultra-long sequences (N>200K), prefill-phase
analytical prediction without temporal history, cross-model
generalization (RocketKV, NSA), multi-batch / MTP>1 variable-length
unified tuning, and next-generation GPU architecture adaptation.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…o-action

Condense Acknowledgement section and add an invitation for the
community to contribute to TensorRT-LLM and the GPU inference
ecosystem.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Distinguish batch=1 (heavily tuned) from multi-batch (functionally
supported but not yet performance-optimized).

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Revert to the original AI-generated secant diagram from commit
12ba374, which the team preferred over the matplotlib replacement.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…cape

Expand the introduction to cover the broader sparse attention ecosystem
(DSA, NSA, MoBA, RocketKV, Quest, SAGE-KV) that relies on Top-K
selection, motivating kernel-level optimization as sequences grow into
100K+. Position DSA as the concrete case study while noting the
approach generalizes to any method with temporal Top-K correlation.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Keep the Chinese version as a local-only file, not pushed to remote.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…name to GVR

- Add GPU Top-K literature review paragraph (RadiK ICS24, Zois ADMS19,
  Zhang SC23, Key approximate Top-K 2024) to Introduction
- Clarify baseline is SC23 evolution on Blackwell by the same team
- Fix DSA formula citation to DeepSeek-V3 technical report
- Rename algorithm to "GVR Top-K" in complexity table

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…GVR/heuristic naming

- Rename "End-to-End Throughput" to "End-to-End Min-Latency Benchmark
  on B200", remove throughput rows from table, keep latency metrics
- Fix TOC entry to match updated section title
- Add "heuristic-guided approach" natural phrasing in Introduction
- Rename algorithm to "GVR Top-K" in complexity table

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Move green star from Secant②/f(T) curve intersection to the correct
position at Secant②/f_target=3072 intersection, with T₂ as the
corresponding x-coordinate.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
…m branding

- Update end-to-end min-latency benchmark with 4 independent trials on
  ISL=131K/OSL=32K, including mean and standard deviation
- Refine "GVR Top-K" branding across all technical diagrams (PNGs) to
  match manuscript terminology
- Improve algorithm flow diagram design with better alignment, larger
  fonts, and professional design
- Remove legacy "I=1-2" footer from algorithm flow diagram for cleaner
  presentation

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Update GVR Top-K documentation to match the merged Scheme X dispatcher, current index_topk support, and sparse attention enablement guidance.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Move the GVR Top-K article and its media assets to the next available tech blog number so it no longer conflicts with the existing blog19 DWDP article.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Replace temporary manuscript and fork links with the official arXiv DOI so the GVR Top-K blog is ready for external publication.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Use a technical-blog title that aligns with the file name while keeping the formal arXiv title in the references.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Clarify the public GVR Top-K enablement path, update the LLM API field description, and remove internal dispatcher terminology from user-facing docs.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
Mention upcoming DeepSeek V4-style long-sequence indexer Top-K configurations and the planned GVR Top-K support for index_topk 512 and 1024.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
@longcheng-nv longcheng-nv requested review from a team as code owners May 3, 2026 09:21
@longcheng-nv longcheng-nv requested review from chang-l and hchings May 3, 2026 09:21
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

📝 Walkthrough

Walkthrough

This PR adds documentation for the Guess-Verify-Refine (GVR) Top-K optimization feature for DeepSeek sparse attention on Blackwell GPUs, including a comprehensive new blog post explaining the algorithm and integration, updated feature documentation with configuration examples, and clarified API docstrings.

Changes

GVR Top-K Feature Documentation

Layer / File(s) Summary
Configuration API
tensorrt_llm/llmapi/llm_args.py
DeepSeekSparseAttentionConfig.enable_heuristic_topk docstring updated to describe GVR Top-K behavior, supported conditions (index_topk=2048 on Blackwell/SM100+), and fallback to production Top-K path when prerequisites are not met.
Feature Quick-Start
docs/source/features/sparse-attention.md
Added Python and YAML examples for optional GVR Top-K acceleration with index_topk=2048 and enable_heuristic_topk=True, plus description of opt-in behavior and runtime dispatcher behavior.
Technical Deep-Dive
docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md
New comprehensive blog post explaining GVR algorithm phases (guess, verify, candidate collection, exact refinement), TensorRT-LLM integration points, configuration controls, fallback conditions, operator-level and end-to-end performance/accuracy results, and reproduction instructions.
Cross-Reference
docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md
Added follow-up note linking to blog21 and describing GVR Top-K as a further optimization using temporal correlation with hardware-aware dispatch between GVR and radix paths.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~7 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][docs] add GVR Top-K technical blog' is clear and directly related to the main change—adding a technical blog documenting GVR Top-K. It follows the repository's title template and concisely summarizes the primary contribution.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively covers all required template sections: it explains the issue/solution, details test coverage (documentation-only), and includes a complete PR checklist review.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md`:
- Around line 470-485: Replace the deprecated flag --extra_llm_api_options with
the canonical --config in the trtllm-bench example; update the CLI example that
invokes trtllm-bench (the command shown with model
deepseek-ai/DeepSeek-V3.2-Exp) to use --config <config.yml> instead of
--extra_llm_api_options <config.yml> so it follows the docs convention for
config-file flags.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d35394ae-02b0-4761-bb4c-139aa8c324b9

📥 Commits

Reviewing files that changed from the base of the PR and between 8311568 and 9bfa633.

⛔ Files ignored due to path filters (11)
  • docs/source/blogs/media/tech_blog21_algorithm_flow.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_dispatch_logic.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_e2e_tept8_osl1k_bar.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_hit_ratio.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_indexer_topk.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_phase4_detail.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_real_data_bars.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_secant_method.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_synthetic_scaling.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_temporal_correlation_diagram.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_topk_dispatch_flowchart.png is excluded by !**/*.png
📒 Files selected for processing (4)
  • docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md
  • docs/source/blogs/tech_blog/blog21_Temporal_Correlation_Meets_Sparse_Attention.md
  • docs/source/features/sparse-attention.md
  • tensorrt_llm/llmapi/llm_args.py

Use the canonical trtllm-bench --config flag in the GVR Top-K blog and include the pre-commit yapf formatting for the LLM API field description.

Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

Addressed CodeRabbit's comment by replacing --extra_llm_api_options <config.yml> with the canonical --config <config.yml> in the trtllm-bench example. Also included the CI pre-commit/yapf formatting update for llm_args.py.

/bot run --disable-fail-fast

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46588 [ run ] triggered by Bot. Commit: c839365 Link to invocation

@juney-nvidia
Copy link
Copy Markdown
Collaborator

/bot --help

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@juney-nvidia
Copy link
Copy Markdown
Collaborator

/bot kill

Copy link
Copy Markdown
Collaborator

@juney-nvidia juney-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46590 [ kill ] triggered by Bot. Commit: c839365 Link to invocation

@juney-nvidia
Copy link
Copy Markdown
Collaborator

/bot skip --comment "No need to run full CI"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46590 [ kill ] completed with state SUCCESS. Commit: c839365
Successfully killed previous jobs for commit c839365

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46592 [ skip ] triggered by Bot. Commit: c839365 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46592 [ skip ] completed with state SUCCESS. Commit: c839365
Skipping testing for commit c839365

Link to invocation

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

Hi @hchings @chang-l @venkywonka, gentle ping for review when you have a chance.

Current status:

  • GitHub checks are green: DCO, pre-commit, PR title, checklist, and base freshness all pass.
  • Blossom CI is green with Bot Pipeline Skipped SUCCESS, Release Check SUCCESS after @juney-nvidia skipped full CI for this docs-only PR.
  • CodeRabbit's actionable comment has been addressed (trtllm-bench example now uses --config).

What this PR does:

  • Adds Tech Blog 21 for Guess-Verify-Refine (GVR) Top-K, covering the motivation, algorithm intuition, TensorRT-LLM integration, enablement path, fallback behavior, and performance/accuracy results.
  • Adds the blog media assets and links the existing DeepSeek-V3.2 Top-K discussion in Tech Blog 15 to the new GVR article.
  • Updates sparse attention docs and the LLM API field description so enable_heuristic_topk is clearly documented as GVR Top-K.

Relationship to prior GVR Top-K PRs:

Thanks!

@juney-nvidia juney-nvidia merged commit 9dce3fc into NVIDIA:main May 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants