Skip to content

[#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats#11922

Merged
eopXD merged 5 commits intoNVIDIA:mainfrom
wojciech-wais:fix/clamp-usednumblocks-non-negative-kv-cache-stats
Apr 8, 2026
Merged

[#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats#11922
eopXD merged 5 commits intoNVIDIA:mainfrom
wojciech-wais:fix/clamp-usednumblocks-non-negative-kv-cache-stats

Conversation

@wojciech-wais
Copy link
Copy Markdown
Contributor

@wojciech-wais wojciech-wais commented Mar 4, 2026

In disaggregated serving (prefill mode), getNumFreeBlocks() can momentarily exceed getMaxNumBlocks(), causing getNumAllocatedBlocks() to return negative values. This propagates through KV cache stats as negative usedNumBlocks, which is invalid for metrics consumers.

Clamp the result of getNumAllocatedBlocks() to a minimum of zero using std::max to prevent negative block counts from being reported.

Fixes #11879

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a potential underflow issue in KV cache block allocation that could cause incorrect memory management.
  • Chores

    • Updated copyright year.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…tats

In disaggregated serving (prefill mode), getNumFreeBlocks() can
momentarily exceed getMaxNumBlocks(), causing getNumAllocatedBlocks()
to return negative values. This propagates through KV cache stats as
negative usedNumBlocks, which is invalid for metrics consumers.

Clamp the result of getNumAllocatedBlocks() to a minimum of zero
using std::max to prevent negative block counts from being reported.

Fixes NVIDIA#11879

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
@wojciech-wais wojciech-wais requested a review from a team as a code owner March 4, 2026 21:10
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

A defensive programming fix to the KV cache manager that prevents negative block count calculations by clamping the result to a minimum value of 0 using std::max. Also updates the copyright year from 2022-2024 to 2022-2026.

Changes

Cohort / File(s) Summary
KV Cache Block Underflow Protection
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Adds std::max(SizeType32{0}, ...) guard to the allocated blocks calculation to prevent negative values in disagg/prefill mode. Updates copyright year to 2026.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description lacks required sections. While it explains the issue, it is missing the 'Description' section header with details and omits 'Test Coverage' entirely, which are mandatory template sections. Add a proper 'Description' section explaining the problem and solution, and fill in the 'Test Coverage' section listing the relevant tests that validate this fix (e.g., KV cache block allocation tests).
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: clamping usedNumBlocks to non-negative in KV cache stats, directly addressing the core fix in the changeset.
Linked Issues check ✅ Passed The code change directly addresses issue #11879 by clamping getNumAllocatedBlocks() to zero, preventing negative usedNumBlocks from being reported in metrics.
Out of Scope Changes check ✅ Passed All changes are scoped to the fix: clamping allocated blocks calculation and updating copyright year. No unrelated modifications were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 4, 2026
Copy link
Copy Markdown
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you!

@eopXD
Copy link
Copy Markdown
Collaborator

eopXD commented Mar 5, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37825 [ run ] triggered by Bot. Commit: 6dec99a Link to invocation

@eopXD eopXD self-requested a review March 5, 2026 07:06
Copy link
Copy Markdown
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, revoking the approval. The fix should root to why getNumFreeBlocks is greater than getMaxNumBlocks which causes the negative number. This is not a good fix to the problem.

… root cause

Add a warning log in WindowBlockManager::getNumFreeBlocks() when the
primary free block count exceeds the total block count. This diagnostic
helps identify the root cause of negative usedNumBlocks values in
disaggregated/prefill mode.

Root cause analysis: after swapMemoryPoolBlockOffset() in
offload/onboard paths, getCacheLevel() returns the block's new
physical location rather than its original queue membership. This
causes mNumFreeBlocksPerLevel counters in the eviction policy to
become desynchronized blocks claimed from one level get released
to the other level, inflating the primary free counter beyond the
actual primary pool size. The existing verifyQueueIntegrity() check
detects this exact condition.

The clamp in getNumAllocatedBlocks() (prior commit) prevents negative
values from propagating to stats consumers while the accounting
inconsistency is investigated further.

Refs NVIDIA#11879

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37825 [ run ] completed with state SUCCESS. Commit: 6dec99a
/LLM/main/L0_MergeRequest_PR pipeline #29286 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Link to invocation

Comment thread cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp Outdated
…nt exceeding max

Replace TLLM_LOG_WARNING with TLLM_CHECK_WITH_INFO assertion when
numFreeBlocks exceeds getMaxNumBlocks() to catch block accounting
inconsistencies in CI. Remove the now-unnecessary std::max clamp in
getNumAllocatedBlocks() since the assertion ensures the invariant.

Signed-off-by: Wojtek Marczenko <wojciech.marczenko@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
@wojciech-wais wojciech-wais force-pushed the fix/clamp-usednumblocks-non-negative-kv-cache-stats branch from 9f136d4 to d2bf822 Compare March 11, 2026 07:23
@karljang
Copy link
Copy Markdown
Collaborator

@wojciech-wais,
Sorry. We’ve encountered a conflict here. Could you please resolve it? Thank you.

…e-stats

Signed-off-by: Wojciech Wais <12673503+wojciech-wais@users.noreply.github.com>
@wojciech-wais
Copy link
Copy Markdown
Contributor Author

@karljang
Sure.
Should be sorted out now.

@karljang
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40459 [ run ] triggered by Bot. Commit: 9ad94b0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40459 [ run ] completed with state FAILURE. Commit: 9ad94b0
/LLM/main/L0_MergeRequest_PR pipeline #31548 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40487 [ run ] triggered by Bot. Commit: 9ad94b0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40487 [ run ] completed with state SUCCESS. Commit: 9ad94b0
/LLM/main/L0_MergeRequest_PR pipeline #31575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@wojciech-wais
Copy link
Copy Markdown
Contributor Author

PR_Github #40487 [ run ] completed with state SUCCESS. Commit: 9ad94b0 /LLM/main/L0_MergeRequest_PR pipeline #31575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

I've reviewed those tests failures and IMO they are not strictly related to my change.

@karljang
Copy link
Copy Markdown
Collaborator

karljang commented Apr 3, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41731 [ run ] triggered by Bot. Commit: f69f601 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41731 [ run ] completed with state SUCCESS. Commit: f69f601
/LLM/main/L0_MergeRequest_PR pipeline #32632 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

karljang commented Apr 4, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41801 [ run ] triggered by Bot. Commit: f69f601 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41801 [ run ] completed with state SUCCESS. Commit: f69f601
/LLM/main/L0_MergeRequest_PR pipeline #32695 completed with status: 'SUCCESS'

CI Report

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

karljang commented Apr 4, 2026

@eopXD
Could you please review this pr again? Thank you!

@nvpohanh
Copy link
Copy Markdown
Collaborator

nvpohanh commented Apr 7, 2026

@eopXD could you review this? thanks

@karljang karljang requested a review from eopXD April 8, 2026 00:11
@eopXD eopXD enabled auto-merge (squash) April 8, 2026 09:19
@eopXD eopXD merged commit 6776857 into NVIDIA:main Apr 8, 2026
5 checks passed
Doloxetine pushed a commit to Doloxetine/TensorRT-LLM that referenced this pull request Apr 8, 2026
…tats (NVIDIA#11922)

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
Signed-off-by: Wojtek Marczenko <wojciech.marczenko@gmail.com>
Signed-off-by: Wojciech Wais <12673503+wojciech-wais@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>
Signed-off-by: Doloxetine <youliyuan92@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: KvEvent Metrics, usedNumBlocks, can have negative block sizes in disagg/prefill mode

6 participants