Skip to content

fix: preserve q/k/v quantizer mapping in AST attention patching#1307

Open
Brumbelow wants to merge 1 commit intoNVIDIA:mainfrom
Brumbelow:fix/issue-1064-kv-attention-ast-ordering
Open

fix: preserve q/k/v quantizer mapping in AST attention patching#1307
Brumbelow wants to merge 1 commit intoNVIDIA:mainfrom
Brumbelow:fix/issue-1064-kv-attention-ast-ordering

Conversation

@Brumbelow
Copy link
Copy Markdown

@Brumbelow Brumbelow commented Apr 21, 2026

Summary

Preserve q/k/v quantizer wiring when register_attention_for_kv_quant() patches AST-generated attention wrappers.

Motivation

The old AST patching logic relied on breadth-first ast.walk() order, which can visit nested and sequential attention matmuls in a different order than runtime evaluation. That could attach q/k/v quantizers to the wrong operands.

Changes

  • switch attention matmul collection to deterministic post-order traversal
  • patch the first matmul as q/k score computation and the second as attention/value aggregation
  • keep the transpose wrapper only on the key operand for per-token KV-cache quantization
  • add sequential unit coverage for torch.matmul, torch.bmm, and @
  • assert that q, k, and v quantizers see the expected tensors while preserving forward outputs

Testing

Run with:

  • python -m pytest tests/unit/torch/quantization/plugins/test_attention_quant.py
  • python -m pytest tests/unit/torch/quantization/test_quantize_replace.py
  • pre-commit run --all-files

Checklist

  • Backward compatible
  • Followed guidance, no copied code.
  • Added tests
  • No docs changes (no API changes)

Additional information:
Closes #1064.

Summary by CodeRabbit

  • New Features

    • Enhanced attention quantization with improved operand instrumentation and more accurate quantizer application order.
    • Better determinism when identifying quantization targets within attention mechanisms.
  • Tests

    • Added comprehensive test coverage for attention quantization verification.
    • New parametrized tests validate quantizer behavior and ensure numerical correctness of quantized attention outputs across different attention implementations.

Signed-off-by: Andrew Brumbelow <andrewbrumbelow@gmail.com>
@Brumbelow Brumbelow requested a review from a team as a code owner April 21, 2026 02:43
@Brumbelow Brumbelow requested a review from sychen52 April 21, 2026 02:43
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

The changes fix quantizer wiring in attention mechanisms by replacing breadth-first AST traversal with depth-first post-order traversal for node collection, updating operand indexing logic via a new helper, and reordering which quantizers are applied to BMM and binary matmul operations. Tests validate correct quantizer invocation.

Changes

Cohort / File(s) Summary
Attention Plugin Core Logic
modelopt/torch/quantization/plugins/attention.py
Introduced collect_attention_nodes() for depth-first post-order AST traversal; added get_operand_indices() helper to determine which operands to instrument; generalized transpose behavior via transpose_quantizers collection; reordered quantizer application targets for len(bmm_nodes)==2 and len(bin_matmul_nodes)==2 cases to patch correct operand indices.
Attention Quantization Tests
tests/unit/torch/quantization/plugins/test_attention_quant.py
Added three sequential attention modules (SequentialMatmulAttention, SequentialBMMAttention, SequentialBinMatmulAttention) that compute attention via explicit matmul/bmm/@ operations; introduced RecordingIdentityQuantizer to record cloned inputs; added parametrized test test_kv_quant_sequential_attention_wiring that validates quantizers are invoked exactly once with expected q, k, v operands.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: preserve q/k/v quantizer mapping in AST attention patching' clearly and specifically summarizes the main change: fixing quantizer mapping preservation in AST attention patching.
Linked Issues check ✅ Passed The PR addresses issue #1064 by implementing deterministic AST traversal, preserving correct q/k/v quantizer wiring, and adding test coverage for sequential attention modules.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the q/k/v quantizer mapping issue: AST patching logic updates, operand indexing, and corresponding test cases are all in scope.
Security Anti-Patterns ✅ Passed Security review of attention.py and test_attention_quant.py found no anti-patterns: no eval/exec, torch.load with weights_only=False, numpy.load with allow_pickle=True, trust_remote_code=True, or # nosec suppressions.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug for register_attention_for_kv_quant

1 participant