Skip to content

[AIROCMLIR-707] Make attentionSweeps split_kv filter device-memory-aware#2366

Closed
bogdan-petkovic wants to merge 1 commit into
ROCm:developfrom
bogdan-petkovic:bogdan-petkovic/attention-sweeps-device-memory-filter
Closed

[AIROCMLIR-707] Make attentionSweeps split_kv filter device-memory-aware#2366
bogdan-petkovic wants to merge 1 commit into
ROCm:developfrom
bogdan-petkovic:bogdan-petkovic/attention-sweeps-device-memory-filter

Conversation

@bogdan-petkovic
Copy link
Copy Markdown
Contributor

Motivation

Some attention sweep samples with large split_kv generate very high temporary storage pressure and are currently rejected as FAIL after entering the pipeline.
This PR adds an early sweep-side prefilter so memory-heavy splitKV cases can be skipped before expensive execution, reducing wasted sweep effort while keeping existing compiler/runtime behavior unchanged.

Technical Details

Updated mlir/utils/performance/attentionSweeps.py to add a device-memory-aware splitKV prefilter in sampling:

  • Added splitKV extra-storage estimator for sampled attention shapes.
  • Added default splitKV limit policy based on visible device memory:
    • deviceMem / 8, clamped to [1 GiB, 8 GiB]
    • fallback to 1.5 GiB when memory query is unavailable.
  • Added CLI control:
    • --splitkv-extra-bytes-limit (non-negative int64-validated override).
      Refactored sample filtering to track reasons separately:
  • MAX_TOKENS filter
  • splitKV extra-storage filter
  • cumulative reporting across initial and refill sampling batches.
    Added focused unit tests in mlir/utils/performance/tests/test_attentionSweeps.py:
  • limit policy behavior (clamp/fallback),
  • splitKV estimator behavior,
  • per-reason filter accounting.
    No compiler/verifier logic was changed in this PR; scope is limited to sweep generation/filtering behavior.

Test Plan

  • Run attentionSweeps.py through CI to validate end-to-end sweep behavior with the new splitKV prefilter.
  • Confirm that memory-heavy splitKV cases are filtered before expensive execution stages.

Test Result

  • CI attention-sweep run

Submission Checklist

Signed-off-by: bogdan-petkovic <bogdan.petkovic@htecgroup.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the attention sweep generator to proactively filter out sampled configurations with large split_kv settings that are likely to create excessive temporary-storage pressure, using a device-memory-aware default limit (with a CLI override). This keeps the compiler/runtime behavior unchanged while reducing wasted sweep effort on configurations likely to OOM or time out later in the pipeline.

Changes:

  • Add a splitKV extra temporary-storage estimator and use it as an early sampling-time filter.
  • Add a default splitKV limit policy derived from visible device VRAM (with clamping/fallback) and a --splitkv-extra-bytes-limit override.
  • Improve filtering accounting/reporting by tracking reasons (MAX_TOKENS vs splitKV extra-storage) and reporting cumulative totals across refill batches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +458 to +464
parser.add_argument(
'--splitkv-extra-bytes-limit',
type=_parse_nonnegative_int64,
default=None,
help=("Max allowed estimated splitKV extra temporary storage (bytes). "
"If unset, a device-based default is used (deviceMem/8 clamped to "
"[1 GiB, 8 GiB], fallback 1.5 GiB)."))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants