Skip to content

perf(build): batch generated dispatch builds#609

Merged
voltjia merged 1 commit into
masterfrom
perf/generated-dispatch-batching
May 15, 2026
Merged

perf(build): batch generated dispatch builds#609
voltjia merged 1 commit into
masterfrom
perf/generated-dispatch-batching

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented May 15, 2026

Summary

  • Split generated dispatch implementation definitions across multiple generated source files.
  • Keep one shared generated_dispatch.h declaration header while batching per-operator dispatch definitions.
  • Mark the shared IndexToOffset header helper inline so generated dispatch shards can include the same implementation headers safely.
  • Add INFINIOPS_DISPATCH_BATCH_SIZE to tune generated dispatch source batch size.

Motivation

The generated dispatch translation unit can become large as bindings grow. Splitting it into multiple generated source files lets the build system compile dispatch definitions in parallel and reduces single-translation-unit compiler pressure.

Closes # N/A

Type of Change

  • N/A feat — new feature / new operator / new platform
  • N/A fix — bug fix
  • perf — performance improvement
  • N/A refactor — code restructuring without behavior change
  • N/A test — adding or fixing tests only
  • N/A docs — documentation only
  • N/A build / ci — build system or CI configuration
  • N/A chore — tooling, formatting, or other non-code changes
  • N/A Breaking change

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Direct profile, PYTEST_WORKERS=1.

Platform Built pytest Result Notes / Hardware
NVIDIA Yes 4151 passed, 1375 skipped in 312.73s Full pytest tests/. Matches #604/#605/#606/#607/#608 pass/skip counts.
Iluvatar Yes 3651 passed, 375 skipped in 259.43s Full pytest tests/. Matches #604/#605/#606/#607/#608 pass/skip counts.
MetaX Yes 5795 passed, 1447 skipped in 351.71s Full pytest tests/. Matches #604/#605/#606/#607/#608 pass/skip counts.
Cambricon Yes 3073 passed, 3857 skipped in 891.89s Full pytest tests/. Preserves #607's fix for the previous integer-input failures.
Moore Yes 300 failed, 5459 passed, 1483 skipped in 536.58s Full pytest tests/. Same known pre-existing tests/test_gemm.py MUSA failures as #604/#605/#606/#607/#608.
Ascend Yes 3828 passed, 138 skipped in 509.84s; container exit code 137 after pytest Full pytest tests/. No pytest failures in this run; same post-test exit-code behavior as #605/#606/#607/#608 and no regression versus #604.

Compared with the last merged baseline PRs #604, #608, #605, #607, and #606, this PR has no regression in build status, collected coverage, or failure count.

Full validation summaries
ruff:
1 file already formatted
All checks passed!

clang-format:
clang-format --dry-run --Werror src/common/generic_utils.h

generation smoke:
generated/bindings/generated_dispatch_0.cc
generated/bindings/generated_dispatch_1.cc

nvidia build=0 test=0
4151 passed, 1375 skipped in 312.73s

iluvatar build=0 test=0
3651 passed, 375 skipped in 259.43s

metax build=0 test=0
5795 passed, 1447 skipped in 351.71s

cambricon build=0 test=0
3073 passed, 3857 skipped in 891.89s

moore build=0 test=1
300 failed, 5459 passed, 1483 skipped in 536.58s

ascend build=0 test=137
3828 passed, 138 skipped in 509.84s

Benchmark / Performance Impact

Expected to improve build parallelism for generated dispatch bindings. Runtime behavior is unchanged.

Notes for Reviewers

This only changes generated binding source layout. generated_dispatch.h still declares the same entrypoints, while generated source files are split by batches of operators. IndexToOffset is made inline because generated dispatch shards can now include the same implementation headers in multiple translation units.


Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits.
  • Branch name follows <type>/xxx-yyyy-zzzz.
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit.
  • No stray merge commits from master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal.
  • No dead code, debug prints, or unrelated TODOs were added.
  • No unrelated formatting churn.
  • N/A Public API changes.

General Code Hygiene

  • Code is self-explanatory.
  • Modified files end with a newline.
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments are wrapped in backticks.
  • Comments are in English.
  • Comments are complete sentences.

C++ Specific

  • Code follows the existing C++ style.
  • clang-format was run on the modified C++ header.
  • N/A clang-tidy was not run for this small helper linkage fix.
  • N/A Operator parameter order.
  • No exceptions are thrown.
  • N/A Error and warning messages.
  • N/A Kernel file naming.
  • N/A Kernel/launcher separation.
  • N/A Constructor initializer list order.
  • Member/namespace spacing follows surrounding style.
  • N/A New operators.
  • N/A Raw new/delete.

Python Specific

  • Code follows the surrounding Python style.
  • ruff check scripts/generate_wrappers.py passed.
  • ruff format --check scripts/generate_wrappers.py passed.
  • N/A New comments.
  • No blank line rule violations were introduced.
  • N/A Docstrings.
  • N/A Type hints.

Testing

  • Full platform testing was run across NVIDIA, Iluvatar, MetaX, Cambricon, Moore, and Ascend.
  • N/A No new tests were added.
  • N/A No test parameterization changes.
  • N/A No new flaky tests.
  • Regression coverage: full-platform builds compile and link the generated dispatch shards.

Build, CI, and Tooling

  • Fresh package build passed on all supported platforms.
  • N/A compile_commands.json.
  • N/A Backend auto-detection.
  • N/A CUDA-like mutual exclusion.
  • N/A CI workflows are expected to validate formatting.
  • N/A Runtime dependencies.

Documentation

  • N/A User-facing documentation changes.
  • N/A New public utilities.
  • N/A Breaking changes.

Security and Safety

  • No secrets or personal data are committed.
  • N/A Third-party code.
  • N/A Unsafe pointer arithmetic.

@voltjia voltjia force-pushed the perf/generated-dispatch-batching branch from 1f7d42d to bbd9546 Compare May 15, 2026 08:45
@voltjia voltjia marked this pull request as ready for review May 15, 2026 09:06
@voltjia voltjia requested review from a team, Ziminli and crapromer May 15, 2026 09:06
@voltjia
Copy link
Copy Markdown
Collaborator Author

voltjia commented May 15, 2026

@wooway777 初审,@Ziminli 终审。

Ziminli
Ziminli previously approved these changes May 15, 2026
@voltjia voltjia force-pushed the perf/generated-dispatch-batching branch from bbd9546 to 17d7823 Compare May 15, 2026 09:23
@voltjia voltjia requested review from Ziminli and wooway777 May 15, 2026 09:41
@voltjia voltjia merged commit 6dc0879 into master May 15, 2026
4 checks passed
@voltjia voltjia deleted the perf/generated-dispatch-batching branch May 15, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants