[WIP] feat: add InfiniOps as optional kernel provider by chen2021673 · Pull Request #161 · InfiniTensor/InfiniTrain

chen2021673 · 2026-05-28T09:04:57Z

Summary

引入 InfiniOps 作为可选的 kernel provider，通过 USE_INFINIOPS=ON 启用。
插拔粒度落在 Dispatcher::Call()：Dispatcher 在 GetKernel 里增加白名单 hook，命中的 key 路由到 InfiniOps registry，未命中则回退到默认 CUDA kernel。

linear、matmul、outer 三个上层算子从直接调用 GemmCuda 改为 Dispatcher::Instance().Call<void>({device.type(), "Gemm"}, ...)，因此 InfiniOps 一次提供 Gemm 即可透明覆盖这三个上层算子，无需逐个包装。

Changes

构建：新增 USE_INFINIOPS CMake 选项；third_party/InfiniOps 作为子模块接入；启用时按需 add_subdirectory 并链接 InfiniOps::infiniops。
Dispatcher：GetKernel 增加 InfiniOps lookup hook；未命中 whitelist 时维持原行为。
Registry：新增 InfiniOpsRegistry（独立于主 Dispatcher 的 map）+ REGISTER_INFINIOPS_KERNEL 宏 + 全局 whitelist（当前包含 Gemm、AddForward）。
Adapter：adapter.{h,cc} 提供 ToOpsDataType / ToOpsDevice / ToOpsTensor 类型与张量桥接，dtype/device 对照采用 inline const std::unordered_map。
Handle：CUDA backend 通过 handle factory 注入 stream，避免 adapter 与 CUDA runtime 头硬绑定。
InfiniOps 算子封装：gemm.cc、elementwise.cc（AddForward）。
现有 CUDA kernel 改写：common/gemm.{cu,cuh} 把 GemmCuda 重命名为 Gemm 并通过 REGISTER_KERNEL 注册；linear/matmul/outer 的所有 GEMM 调用走 Dispatcher。

测试

当前单卡测试性能精度对齐

问题：InfiniTrain 原生 CUDA GEMM 在 gemm.cu (line 63) 固定用CUBLAS_COMPUTE_32F，但 InfiniOps NVIDIA GEMM 原来对 fp32 走的是CUBLAS_COMPUTE_32F_FAST_TF32，这里我手动修改了 InfiniOps 源码

TODO：解决编译warning

Wire InfiniOps in as a pluggable kernel provider keyed at the GEMM level: Dispatcher consults a per-key whitelist hook and routes registered ops to InfiniOps, falling back to the default CUDA kernel otherwise. linear, matmul and outer now invoke Gemm via Dispatcher rather than calling the cuBLAS wrapper directly, so InfiniOps Gemm transparently covers all three.

kilinchange · 2026-05-29T07:32:42Z

这里是出于什么原因要单独写一套 registry，而不能直接复用 InfiniTrain 原有的注册表呢？

见下回复

kilinchange · 2026-05-29T07:36:24Z

这个头文件内容没什么问题，但不适合放到 include 里作为公共头文件暴露，先放 infini_train/src/kernels/common 里吧

kilinchange · 2026-05-29T07:39:43Z

这里不应该给 infinops 开额外分支，之前接沐曦 kernel 这块是不需要动的。

这里沐曦和 infinops 的区别在于 infinops 需要解耦 device。

沐曦/MACA 是 InfiniTrain 的一个 device backend。它和 CUDA/CPU 一样，kernel 是按 device + op 注册的，所以可以继续用现有 REGISTER_KERNEL(device, kernel_name, kernel_func)

但 infinops 里面封装了不同 device 的执行，不是某一个具体 device，而是 kernel provider / backend provider。InfiniOps 自己内部再根据 handle/tensor device 去适配 NVIDIA、CPU、MUSA、Moore 等后端。所以框架这边调的时候也要在 Dispatcher::GetKernel 之前或里面做一层 provider policy。 infiniops_registry.h 里重写一套注册也是这个原因，其实就是无法复用REGISTER_KERNEL 接口，要把 device 参数换成 backend 参数。

这里其实可以不特化写一个 InfiniOpsRegistry，而是给Dispatcher添加一个通用的REGISTER_KERNEL_BACKEND(backend, kernel_name, kernel_func) ，InfiniOps调用REGISTER_KERNEL_BACKEND("InfiniOps", kernel_name, kernel_func)。这样 dispatcher.h 可能会整洁一点。

我理解可以复用 REGISTER_KERNEL(device, kernel_name, kernel_func) 来注册不同 device 的 infiniops kernel。在框架层，不需要关心底层算子库的实现，只按需注册对应 device 的 kernel 即可，类似：
https://github.com/InfiniTensor/InfiniTensor/blob/fcd1fb0299e181f841918c4db4e5f13a18a2ae60/src/kernels/infiniop/element_wise.cc#L36

kilinchange · 2026-05-29T07:40:55Z

@@ -0,0 +1,25 @@
+#include "infini_train/include/core/kernel_provider/infiniops/adapter.h"


这个不是配套头文件吧

done，修改头文件顺序。

kilinchange · 2026-05-29T07:48:13Z

+
+} // namespace infini_train::kernel_provider::infiniops
+
+REGISTER_INFINIOPS_KERNEL(AddForward, infini_train::kernel_provider::infiniops::AddForward)


如果是为了修改注册 key 而专门给 infiniops 写一套注册机制的话感觉不是很有必要，直接按平台注册就行。

意思是按平台注册但其实平台参数没用？

平台参数在这里的作用更多是框架侧显式声明“哪些平台注册了 kernel”（类似白名单机制）。如果某个平台未注册，dispatcher 可以直接报错。

具体执行时使用哪个 device，还是在 adapter 层通过 ToOpsDevice 转换后再传给 infiniops。

kilinchange · 2026-05-29T07:52:34Z

这部分是必要的通用 gemm 接口抽象改动，不涉及 infiniops 相关，可以考虑单独提 pr 先合。

kilinchange · 2026-05-29T07:54:35Z

 // FIXME: Requires stride tracking in the Tensor class before this can be implemented
 // correctly. Currently always returns true as a placeholder. The contiguous guard in
-// elementwise.cu ensures non-contiguous tensors fall back to the broadcast path.
+// the elementwise provider ensures non-contiguous tensors fall back to the broadcast path.


这块不用改吧

感觉应该是 elementwise provider，起码 elementwise.maca 里也有这个逻辑

kilinchange · 2026-05-29T07:55:50Z

    std::shared_ptr<Tensor> Contiguous();
    // FIXME: Currently returns true unconditionally. Requires stride tracking in the Tensor
-    // class before this can be implemented correctly. The guard in elementwise.cu ensures
+    // class before this can be implemented correctly. The elementwise broadcast guard ensures


不用改吧

不应该限制在.cu？

chen2021673 · 2026-06-04T02:48:42Z

拆成2个PR：
#166
#167

chen2021673 force-pushed the infiniops_plug_in branch from 614baf6 to 91b309a Compare May 28, 2026 09:10

chen2021673 changed the title ~~feat: add InfiniOps as optional kernel provider~~ [WIP] feat: add InfiniOps as optional kernel provider May 28, 2026

chen2021673 force-pushed the infiniops_plug_in branch from 91b309a to c7e3c27 Compare May 28, 2026 09:23

chen2021673 force-pushed the infiniops_plug_in branch from c7e3c27 to 865b51c Compare May 29, 2026 06:56

kilinchange reviewed May 29, 2026

View reviewed changes

chen2021673 closed this Jun 4, 2026

chen2021673 mentioned this pull request Jun 4, 2026

[WIP] feat: add InfiniOps as optional kernel provider #167

Open

chen2021673 deleted the infiniops_plug_in branch June 5, 2026 01:55

		@@ -0,0 +1,25 @@
		#include "infini_train/include/core/kernel_provider/infiniops/adapter.h"


		} // namespace infini_train::kernel_provider::infiniops

		REGISTER_INFINIOPS_KERNEL(AddForward, infini_train::kernel_provider::infiniops::AddForward)

Conversation

chen2021673 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

测试

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chen2021673 Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chen2021673 Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chen2021673 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chen2021673 commented May 28, 2026 •

edited

Loading

chen2021673 Jun 2, 2026 •

edited

Loading

chen2021673 Jun 2, 2026 •

edited

Loading