feat(ascend): add Ascend framework layer — runtime, type mapping, bui…#46
Merged
feat(ascend): add Ascend framework layer — runtime, type mapping, bui…#46
Conversation
fb9f42f to
62fb25a
Compare
Collaborator
Author
|
nv |
Collaborator
Author
|
metax |
Collaborator
Author
|
iluvatar |
Collaborator
Author
|
cambricon |
Collaborator
Author
|
moore |
Collaborator
Author
|
ascend |
80acc8b to
7628b2f
Compare
voltjia
requested changes
Apr 13, 2026
zhangyue207
pushed a commit
that referenced
this pull request
Apr 13, 2026
- Rename `toAclDtype` → `ToAclDtype`, `isIntegerDtype` → `IsIntegerDtype` (Google C++ Style Guide PascalCase). - Reorder `switch` cases in `ToAclDtype` to match `DataType` enum definition. - Simplify `device_.h` include to `#include "device.h"`. - Add Markdown backticks to code references in comments and help messages. - Add blank lines before `return`/`if` per CONTRIBUTING.md Python style rules. - Reorder pybind11 generated params: `Handle` (`stream`) before `Config` (`implementation_index`), matching `Operator::call` signature. - Rename `Matmul` → `MatMul` (ONNX convention), params → `input`/`other`/`out`, remove `trans_a`/`trans_b` (use `Gemm` for transposed matmul). - Rename `AddRmsNorm` params: `x1`/`x2`/`gamma` → `input`/`other`/`weight`, `y_out`/`x_out` → `out`/`rstd_out` (PyTorch conventions). - Rename `skip_unsupported_dtype` → `skip_unsupported_dtypes`. - Replace `get_npu_stream` with generic `get_stream(device)` using `torch.accelerator.current_stream` with device-specific fallbacks. - Reorder `_PLATFORM_TO_TORCH_DEVICE` with `nvidia` first.
…ld integration Add Ascend platform scaffolding: - `device_.h`: `DeviceEnabled<kAscend>` specialization - `data_type_.h`: `toAclDtype()`, `isIntegerDtype()` - `common.h`: `buildAclTensor()` with optional transpose - `workspace_pool_.h`: stream-keyed workspace allocator - `runtime_.h`: `Runtime<kAscend>` (Malloc, Free, Memcpy, Memset) - 5 new operator base classes (AddRmsNorm, FlashAttention, Matmul, ReshapeAndCache, RotaryEmbedding) Integrate into CMake build system, Python binding generation (stream + optional tensor support), and examples runtime API.
…emove missing include - Wrap `aclrtMemcpy` (5-arg) and `aclrtMemset` (4-arg) in lambdas to match the generic 4-arg / 3-arg calling convention used by examples. - Assert `aclrtMalloc` return value in `WorkspacePool::ensure()`. - Remove `ascend/gemm/kernel.h` include from `runtime_api.h` (file does not exist until the kernels commit).
- Add Ascend GEMM specialization using `aclnnAddmm`/`aclnnBaddbmm`. - Add `get_npu_stream()` helper and NPU device detection in test utils. - Add `skip_unsupported_dtype` fixture for Ascend in conftest. - Update `runtime_api.h` with Ascend backend entry.
The `aclrtMalloc` call was the sole expression inside `assert()`, so it was compiled away in release builds (NDEBUG). This left the workspace buffer null, causing `aclnnAddmm` to return ACLNN_ERR_PARAM_NULLPTR (161001) for any operation that requires workspace (e.g. alpha != 1.0).
`CudaCausalSoftmax` was missing `#include "cuda/runtime_utils.h"`, causing `RuntimeUtils` to be undefined. Drop `std::forward` from `Operator::make` nested lambda — NVCC instantiates the body during SFINAE invocability checks even inside `if constexpr` false branches, causing template resolution failures. All operator constructors take parameters by value, so lvalue pass has identical semantics.
Upgrade base image from `nvcr.io/nvidia/pytorch:24.10-py3` (CUDA 12.6) to `25.12-py3` (CUDA 13.1), aligning CI with the local dev environment. Restore `std::forward<Args>(args)...` in `Operator::make`, as the NVCC bug that required dropping it is fixed in the newer toolkit.
`Tensor::Size` (`unsigned long`) to `int64_t` narrowing is an error on MetaX's clang-based compiler (`-Wc++11-narrowing`).
- Add blank lines between struct/class members per style guide - Capitalize comments and use backtick syntax for code refs in `matmul.h` - Move `import re` to module level in `generate_wrappers.py` - Add blank lines before `for`/`return` per PEP 8 in `generate_wrappers.py` - Replace `-k npu` with `--devices ascend` in CI config
- Fix `ruff format` violations in `generate_wrappers.py` and `test_gemm.py`. - Fix `ruff isort` violation: move `import re` into stdlib group. - Add backticks around identifiers in comments (`numel()`, `operator()`, `make()`, `torch_npu`, `uint16`/`uint32`/`uint64`). - Add missing blank line after `if` block in `skip_unsupported_dtype`. - Remove `.worktrees/` from project `.gitignore` (belongs in global gitignore).
- Rename `toAclDtype` → `ToAclDtype`, `isIntegerDtype` → `IsIntegerDtype` (Google C++ Style Guide PascalCase). - Reorder `switch` cases in `ToAclDtype` to match `DataType` enum definition. - Simplify `device_.h` include to `#include "device.h"`. - Add Markdown backticks to code references in comments and help messages. - Add blank lines before `return`/`if` per CONTRIBUTING.md Python style rules. - Reorder pybind11 generated params: `Handle` (`stream`) before `Config` (`implementation_index`), matching `Operator::call` signature. - Rename `Matmul` → `MatMul` (ONNX convention), params → `input`/`other`/`out`, remove `trans_a`/`trans_b` (use `Gemm` for transposed matmul). - Rename `AddRmsNorm` params: `x1`/`x2`/`gamma` → `input`/`other`/`weight`, `y_out`/`x_out` → `out`/`rstd_out` (PyTorch conventions). - Rename `skip_unsupported_dtype` → `skip_unsupported_dtypes`. - Replace `get_npu_stream` with generic `get_stream(device)` using `torch.accelerator.current_stream` with device-specific fallbacks. - Reorder `_PLATFORM_TO_TORCH_DEVICE` with `nvidia` first.
…peAndCache`, `RotaryEmbedding`
The codegen script `generate_wrappers.py` uses `_snake_to_pascal()` to
derive the class name from the filename. `matmul` -> `Matmul`, but the
class was renamed to `MatMul` (ONNX convention). Renaming the file to
`mat_mul.h` makes `_snake_to_pascal("mat_mul")` -> `MatMul`, fixing the
`IndexError: list index out of range` build failure.
The CI image has not been rebuilt with the 25.12 base yet, so NVCC 12.6 (in 24.10-py3) still instantiates the std::forward call inside if-constexpr false branches. Drop std::forward — all operator constructors take parameters by value, so lvalue pass is equivalent.
This reverts commit bf9e4b1.
bf9e4b1 to
7398f9f
Compare
Collaborator
voltjia
approved these changes
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…ld integration
Add Ascend platform scaffolding:
device_.h:DeviceEnabled<kAscend>specializationdata_type_.h:toAclDtype(),isIntegerDtype()common.h:buildAclTensor()with optional transposeworkspace_pool_.h: stream-keyed workspace allocatorruntime_.h:Runtime<kAscend>(Malloc, Free, Memcpy, Memset)Integrate into CMake build system, Python binding generation (stream + optional tensor support), and examples runtime API.