fix: bound InfiniLM paged decode by block table capacity by voltjia · Pull Request #740 · InfiniTensor/InfiniOps

voltjia · 2026-06-23T07:18:27Z

Summary

Clamp split-KV decode sequence length to the token capacity reachable through the supplied block table.
Apply the same bound to both warp and CTA split-KV decode helpers.

Motivation

Paged decode receives cache_lens plus a block table with max_num_blocks_per_seq entries. If the sequence length is larger than the tokens addressable by the table, split-KV decode can iterate beyond accessible block-table entries. The kernel should only scan tokens reachable through the provided table.

This PR keeps the change isolated to that bounds fix. It does not include the TP operator-cache fix from #738 or the random-sample cache-key fix from #739.

Closes #

Type of Change

feat — new feature / new operator / new platform
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system / CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Platforms Affected

Smoke Test Result

clang-format -i src/native/cuda/ops/paged_attention_infinilm/kernel.cuh

git diff --check
# passed

ruff format --check and ruff check were not run because ruff is not installed in the remote host/container used for validation, and this PR does not modify Python files.

Test Results on Supported Platforms

Platform	Affected	Build / Smoke Result	Full Result / Notes
NVIDIA	Yes	C++ formatting and whitespace checks passed.	No standalone full-platform test was run for this isolated kernel bounds change.
Iluvatar	No	N/A	N/A
MetaX	No	N/A	N/A
Cambricon	No	N/A	N/A
Moore	No	N/A	N/A
Ascend	No	N/A	N/A

Benchmark / Performance Impact

No expected performance impact. The change only caps the scanned range when cache_lens exceeds the reachable block-table capacity.

Notes for Reviewers

The computed bound is max_num_blocks_per_seq * page_block_size.
Both split-KV helper variants use the bounded sequence length before computing shard ranges.

fix: bound paged decode by block-table capacity

1a58f39

voltjia requested a review from a team June 23, 2026 07:18

voltjia merged commit 7cd674a into master Jun 23, 2026
16 of 18 checks passed

voltjia deleted the fix/paged-attention-seqlen-bound branch June 23, 2026 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound InfiniLM paged decode by block table capacity#740

fix: bound InfiniLM paged decode by block table capacity#740
voltjia merged 1 commit into
masterfrom
fix/paged-attention-seqlen-bound

voltjia commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

voltjia commented Jun 23, 2026

Summary

Motivation

Type of Change

Platforms Affected

Smoke Test Result

Test Results on Supported Platforms

Benchmark / Performance Impact

Notes for Reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant