Skip to content

fix: bound InfiniLM paged decode by block table capacity#740

Merged
voltjia merged 1 commit into
masterfrom
fix/paged-attention-seqlen-bound
Jun 23, 2026
Merged

fix: bound InfiniLM paged decode by block table capacity#740
voltjia merged 1 commit into
masterfrom
fix/paged-attention-seqlen-bound

Conversation

@voltjia

@voltjia voltjia commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Clamp split-KV decode sequence length to the token capacity reachable through the supplied block table.
  • Apply the same bound to both warp and CTA split-KV decode helpers.

Motivation

Paged decode receives cache_lens plus a block table with max_num_blocks_per_seq entries. If the sequence length is larger than the tokens addressable by the table, split-KV decode can iterate beyond accessible block-table entries. The kernel should only scan tokens reachable through the provided table.

This PR keeps the change isolated to that bounds fix. It does not include the TP operator-cache fix from #738 or the random-sample cache-key fix from #739.

Closes #

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system / CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Smoke Test Result

clang-format -i src/native/cuda/ops/paged_attention_infinilm/kernel.cuh

git diff --check
# passed

ruff format --check and ruff check were not run because ruff is not installed in the remote host/container used for validation, and this PR does not modify Python files.

Test Results on Supported Platforms

Platform Affected Build / Smoke Result Full Result / Notes
NVIDIA Yes C++ formatting and whitespace checks passed. No standalone full-platform test was run for this isolated kernel bounds change.
Iluvatar No N/A N/A
MetaX No N/A N/A
Cambricon No N/A N/A
Moore No N/A N/A
Ascend No N/A N/A

Benchmark / Performance Impact

No expected performance impact. The change only caps the scanned range when cache_lens exceeds the reachable block-table capacity.

Notes for Reviewers

  • The computed bound is max_num_blocks_per_seq * page_block_size.
  • Both split-KV helper variants use the bounded sequence length before computing shard ranges.

@voltjia voltjia requested a review from a team June 23, 2026 07:18
@voltjia voltjia merged commit 7cd674a into master Jun 23, 2026
16 of 18 checks passed
@voltjia voltjia deleted the fix/paged-attention-seqlen-bound branch June 23, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant