[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn by xiaoxiaohehe001 · Pull Request #7210 · PaddlePaddle/FastDeploy

xiaoxiaohehe001 · 2026-04-07T07:44:25Z

Motivation

修复 DispatchFlashAttentionMask 中 batch_size 的推导来源：由 cu_seq_q 改为 cu_seq_k
注释掉运行时的 PADDLE_ENFORCE shape 校验

Background

在 SM90 flash mask attention 算子中，cu_seqlens_q 和 seq_lens_encoder 的输入 shape 可能按 max_batch 维度预分配，其实际有效长度可能小于 tensor 的第一维大小。此时若以 cu_seq_q.dims()[0] - 1 推导 batch_size，会得到一个偏大的值（等于 max_batch 而非真实 batch size），导致后续 kernel launch 的 batch 维度不正确。

cu_seq_k 始终按真实 batch size 填充，因此改为从 cu_seq_k 推导 batch_size 可获得正确值。

同时，由于 cu_seqlens_q / seq_lens_encoder 的 shape 可能为 max_batch（大于实际 batch size），原有的 PADDLE_ENFORCE(batch_size == seq_len_encoder.dims()[0]) 等断言将误报失败，因此暂时注释掉相关校验。

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-07T07:44:37Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 针对 SM90 的 flash_mask_attention 自定义算子，修复 batch_size 推导来源不正确导致的 kernel launch 维度错误，并放宽（移除）部分运行时 shape 断言以兼容按 max_batch 预分配但有效 batch 更小的输入形态。

Changes:

将 batch_size 的推导从 cu_seqlens_q 改为 cu_seqlens_k（避免 max_batch 预分配导致的偏大 batch_size）。
注释/移除对 seq_len_encoder 与 batch_size 严格相等的运行时 shape 校验（避免误报）。

Copilot · 2026-04-07T07:47:04Z

custom_ops/gpu_ops/flash_mask_attn/flash_mask_attn.cu


  PADDLE_ENFORCE(k_token_num == v_input.dims()[0], "Unmatched shape");
  PADDLE_ENFORCE(head_dim == 128, "Unmatched shape");
  PADDLE_ENFORCE(batch_size > 0, "Unmatched shape");


这里把 batch_size 改为由 cu_seq_k 推导是合理的，但同时移除了对 seq_len_encoder / cu_seq_q 与 batch_size 一致性的校验后，若出现 cu_seq_k 真实 batch_size 大于 seq_len_encoder.dims()[0] 或 cu_seq_q.dims()[0]-1 的情况，kernel 仍会按 grid_dims.z=batch_size 访问 seq_len_encoder[bidb] / cu_seq_q[bidb]，会产生越界读并导致未定义行为。建议把原来的“==”校验放宽为下界校验（例如 seq_len_encoder.dims()[0] >= batch_size 且 cu_seq_q.dims()[0] >= batch_size+1），至少保证不会 OOB；如果允许更小，也需要相应收缩 params.batch_size / launch grid。

Suggested change

PADDLE_ENFORCE(batch_size > 0, "Unmatched shape");

PADDLE_ENFORCE(batch_size > 0, "Unmatched shape");

PADDLE_ENFORCE(seq_len_encoder.dims()[0] >= batch_size, "Unmatched shape");

PADDLE_ENFORCE(cu_seq_q.dims()[0] >= batch_size + 1, "Unmatched shape");

Copilot · 2026-04-07T07:47:04Z

custom_ops/gpu_ops/flash_mask_attn/flash_mask_attn.cu

+  const int batch_size = cu_seq_k.dims()[0] - 1;

  PADDLE_ENFORCE(k_token_num == v_input.dims()[0], "Unmatched shape");
  PADDLE_ENFORCE(head_dim == 128, "Unmatched shape");
  PADDLE_ENFORCE(batch_size > 0, "Unmatched shape");


本次修改是为兼容 cu_seqlens_q / seq_lens_encoder 可能按 max_batch 预分配但有效 batch 更小的场景。建议补充一个单测覆盖该 case（例如构造 cu_seq_q/seq_len_encoder 的 first-dim > 实际 batch_size，且 cu_seq_k 仍为真实 batch_size+1），以防后续有人恢复“==”断言或再次把 batch_size 推导改回 cu_seq_q 导致回归。

codecov-commenter · 2026-04-07T09:15:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@18f0124). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7210   +/-   ##
==========================================
  Coverage           ?   73.25%           
==========================================
  Files              ?      376           
  Lines              ?    52949           
  Branches           ?     8264           
==========================================
  Hits               ?    38789           
  Misses             ?    11443           
  Partials           ?     2717

Flag	Coverage Δ
GPU	`73.25% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[BugFix] fix_flash_mask_attn_sm90

d137464

Copilot AI review requested due to automatic review settings April 7, 2026 07:44

xiaoxiaohehe001 had a problem deploying to Metax_ci April 7, 2026 07:44 — with GitHub Actions Error

Copilot started reviewing on behalf of xiaoxiaohehe001 April 7, 2026 07:45 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

[BugFix] fix_flash_mask_attn_sm90

ec2a545

xiaoxiaohehe001 had a problem deploying to Metax_ci April 7, 2026 07:49 — with GitHub Actions Failure

zhoutianzi666 approved these changes Apr 7, 2026

View reviewed changes

xiaoxiaohehe001 mentioned this pull request Apr 7, 2026

[Cherry-Pick][BugFix] Fix batch_size derivation and relax shape check…#7210 #7212

Open

5 tasks

xiaoxiaohehe001 mentioned this pull request Apr 7, 2026

[Cherry-Pick][BugFix] Fix batch_size derivation and relax shape check… #7216

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn#7210

[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn#7210
xiaoxiaohehe001 wants to merge 2 commits intoPaddlePaddle:developfrom
xiaoxiaohehe001:fix_flash_mask_attn_sm90

xiaoxiaohehe001 commented Apr 7, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

codecov-commenter commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xiaoxiaohehe001 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Background

Checklist

Uh oh!

paddle-bot bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 7, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiaoxiaohehe001 commented Apr 7, 2026 •

edited

Loading