[#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats by wojciech-wais · Pull Request #11922 · NVIDIA/TensorRT-LLM

wojciech-wais · 2026-03-04T21:10:55Z

In disaggregated serving (prefill mode), getNumFreeBlocks() can momentarily exceed getMaxNumBlocks(), causing getNumAllocatedBlocks() to return negative values. This propagates through KV cache stats as negative usedNumBlocks, which is invalid for metrics consumers.

Clamp the result of getNumAllocatedBlocks() to a minimum of zero using std::max to prevent negative block counts from being reported.

Fixes #11879

Summary by CodeRabbit

Bug Fixes
- Fixed a potential underflow issue in KV cache block allocation that could cause incorrect memory management.
Chores
- Updated copyright year.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…tats In disaggregated serving (prefill mode), getNumFreeBlocks() can momentarily exceed getMaxNumBlocks(), causing getNumAllocatedBlocks() to return negative values. This propagates through KV cache stats as negative usedNumBlocks, which is invalid for metrics consumers. Clamp the result of getNumAllocatedBlocks() to a minimum of zero using std::max to prevent negative block counts from being reported. Fixes NVIDIA#11879 Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

coderabbitai · 2026-03-04T21:13:51Z

📝 Walkthrough

Walkthrough

A defensive programming fix to the KV cache manager that prevents negative block count calculations by clamping the result to a minimum value of 0 using std::max. Also updates the copyright year from 2022-2024 to 2022-2026.

Changes

Cohort / File(s)	Summary
KV Cache Block Underflow Protection `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`	Adds `std::max(SizeType32{0}, ...)` guard to the allocated blocks calculation to prevent negative values in disagg/prefill mode. Updates copyright year to 2026.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description lacks required sections. While it explains the issue, it is missing the 'Description' section header with details and omits 'Test Coverage' entirely, which are mandatory template sections.	Add a proper 'Description' section explaining the problem and solution, and fill in the 'Test Coverage' section listing the relevant tests that validate this fix (e.g., KV cache block allocation tests).

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: clamping usedNumBlocks to non-negative in KV cache stats, directly addressing the core fix in the changeset.
Linked Issues check	✅ Passed	The code change directly addresses issue `#11879` by clamping getNumAllocatedBlocks() to zero, preventing negative usedNumBlocks from being reported in metrics.
Out of Scope Changes check	✅ Passed	All changes are scoped to the fix: clamping allocated blocks calculation and updating copyright year. No unrelated modifications were introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

eopXD

Looks good. Thank you!

eopXD · 2026-03-05T07:00:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-05T07:06:11Z

PR_Github #37825 [ run ] triggered by Bot. Commit: 6dec99a Link to invocation

eopXD

Sorry, revoking the approval. The fix should root to why getNumFreeBlocks is greater than getMaxNumBlocks which causes the negative number. This is not a good fix to the problem.

… root cause Add a warning log in WindowBlockManager::getNumFreeBlocks() when the primary free block count exceeds the total block count. This diagnostic helps identify the root cause of negative usedNumBlocks values in disaggregated/prefill mode. Root cause analysis: after swapMemoryPoolBlockOffset() in offload/onboard paths, getCacheLevel() returns the block's new physical location rather than its original queue membership. This causes mNumFreeBlocksPerLevel counters in the eviction policy to become desynchronized blocks claimed from one level get released to the other level, inflating the primary free counter beyond the actual primary pool size. The existing verifyQueueIntegrity() check detects this exact condition. The clamp in getNumAllocatedBlocks() (prior commit) prevents negative values from propagating to stats consumers while the accounting inconsistency is investigated further. Refs NVIDIA#11879 Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

tensorrt-cicd · 2026-03-05T14:17:03Z

PR_Github #37825 [ run ] completed with state SUCCESS. Commit: 6dec99a
/LLM/main/L0_MergeRequest_PR pipeline #29286 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Link to invocation

…nt exceeding max Replace TLLM_LOG_WARNING with TLLM_CHECK_WITH_INFO assertion when numFreeBlocks exceeds getMaxNumBlocks() to catch block accounting inconsistencies in CI. Remove the now-unnecessary std::max clamp in getNumAllocatedBlocks() since the assertion ensures the invariant. Signed-off-by: Wojtek Marczenko <wojciech.marczenko@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

karljang · 2026-03-25T23:43:12Z

@wojciech-wais,
Sorry. We’ve encountered a conflict here. Could you please resolve it? Thank you.

…e-stats Signed-off-by: Wojciech Wais <12673503+wojciech-wais@users.noreply.github.com>

wojciech-wais · 2026-03-26T07:00:17Z

@karljang
Sure.
Should be sorted out now.

karljang · 2026-03-26T15:27:32Z

/bot run

tensorrt-cicd · 2026-03-26T15:35:06Z

PR_Github #40459 [ run ] triggered by Bot. Commit: 9ad94b0 Link to invocation

tensorrt-cicd · 2026-03-26T21:25:58Z

PR_Github #40459 [ run ] completed with state FAILURE. Commit: 9ad94b0
/LLM/main/L0_MergeRequest_PR pipeline #31548 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

karljang · 2026-03-27T05:48:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T05:54:41Z

PR_Github #40487 [ run ] triggered by Bot. Commit: 9ad94b0 Link to invocation

tensorrt-cicd · 2026-03-27T09:41:40Z

PR_Github #40487 [ run ] completed with state SUCCESS. Commit: 9ad94b0
/LLM/main/L0_MergeRequest_PR pipeline #31575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

wojciech-wais · 2026-03-27T14:53:04Z

PR_Github #40487 [ run ] completed with state SUCCESS. Commit: 9ad94b0 /LLM/main/L0_MergeRequest_PR pipeline #31575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR

If you cannot view the failures, ask the CI triggerer to share details

Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

I've reviewed those tests failures and IMO they are not strictly related to my change.

…e-stats

karljang · 2026-04-03T19:33:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T19:39:35Z

PR_Github #41731 [ run ] triggered by Bot. Commit: f69f601 Link to invocation

tensorrt-cicd · 2026-04-04T02:59:01Z

PR_Github #41731 [ run ] completed with state SUCCESS. Commit: f69f601
/LLM/main/L0_MergeRequest_PR pipeline #32632 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

karljang · 2026-04-04T05:08:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-04T05:14:38Z

PR_Github #41801 [ run ] triggered by Bot. Commit: f69f601 Link to invocation

tensorrt-cicd · 2026-04-04T07:17:35Z

PR_Github #41801 [ run ] completed with state SUCCESS. Commit: f69f601
/LLM/main/L0_MergeRequest_PR pipeline #32695 completed with status: 'SUCCESS'

CI Report

Link to invocation

karljang · 2026-04-04T07:50:29Z

@eopXD
Could you please review this pr again? Thank you!

nvpohanh · 2026-04-07T05:27:33Z

@eopXD could you review this? thanks

…tats (NVIDIA#11922) Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com> Signed-off-by: Wojtek Marczenko <wojciech.marczenko@gmail.com> Signed-off-by: Wojciech Wais <12673503+wojciech-wais@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com> Signed-off-by: Doloxetine <youliyuan92@gmail.com>

wojciech-wais requested a review from a team as a code owner March 4, 2026 21:10

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 4, 2026

eopXD approved these changes Mar 5, 2026

View reviewed changes

eopXD self-requested a review March 5, 2026 07:06

eopXD requested changes Mar 5, 2026

View reviewed changes

eopXD reviewed Mar 6, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp Outdated

wojciech-wais force-pushed the fix/clamp-usednumblocks-non-negative-kv-cache-stats branch from 9f136d4 to d2bf822 Compare March 11, 2026 07:23

Merge branch 'main' into fix/clamp-usednumblocks-non-negative-kv-cach…

9ad94b0

…e-stats Signed-off-by: Wojciech Wais <12673503+wojciech-wais@users.noreply.github.com>

Merge branch 'main' into fix/clamp-usednumblocks-non-negative-kv-cach…

f69f601

…e-stats

karljang requested a review from eopXD April 8, 2026 00:11

eopXD approved these changes Apr 8, 2026

View reviewed changes

eopXD enabled auto-merge (squash) April 8, 2026 09:19

eopXD merged commit 6776857 into NVIDIA:main Apr 8, 2026
5 checks passed

Conversation

wojciech-wais commented Mar 4, 2026 • edited by eopXD Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

eopXD left a comment

Choose a reason for hiding this comment

Uh oh!

eopXD commented Mar 5, 2026

Uh oh!

tensorrt-cicd commented Mar 5, 2026

Uh oh!

eopXD left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 5, 2026

Uh oh!

Uh oh!

karljang commented Mar 25, 2026

Uh oh!

wojciech-wais commented Mar 26, 2026

Uh oh!

karljang commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

karljang commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

wojciech-wais commented Mar 27, 2026

Uh oh!

karljang commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

karljang commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

karljang commented Apr 4, 2026

Uh oh!

nvpohanh commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wojciech-wais commented Mar 4, 2026 •

edited by eopXD

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading