diff --git a/DEV.md b/DEV.md
index 407ea5c1f..77920c2bc 100644
--- a/DEV.md
+++ b/DEV.md
@@ -6,12 +6,12 @@ Dear 开发者，感谢你参与 InfiniCore 开源项目的开发！本文档将
 
 ### 项目模块体系
 
+- infinicore：统一计算框架。提供 Python 和 C++ 接口，支持多种硬件平台。
+- infinirt：统一底层运行时库，提供 C 语言接口，依赖 infini-utils。
+- infiniop：统一底层算子库，提供 C 语言接口，依赖 infinirt。除了 C++ 算子实现之外，也包括使用九齿（triton）的算子实现，这部分算子需要在编译之前使用脚本生成源文件。安装后可以运行位于 `test/infiniop` 中的单测脚本进行测试。
+- infiniccl：统一通信库，提供 C 语言接口，依赖 infinirt。
 - infini-utils：全模块通用工具代码。
-- infinirt：运行时库，依赖 infini-utils。
-- infiniop：算子库，依赖 infinirt。除了 C++ 算子实现之外，也包括使用九齿（triton）的算子实现，这部分算子需要在编译之前使用脚本生成源文件。安装后可以运行位于 `test/infiniop` 中的单测脚本进行测试。
-- infiniccl：通信库，依赖 infinirt。
 - utils-test：工具库测试代码，依赖 infini-utils。
-- infiniop-test：算子库测试框架代码。与单测不同，读取gguf测例文件进行测试（详见[`测例文档`](test/infiniop-test/README.md)）。使用前需要安装好 infiniop。
 - infiniccl-test：通信库测试代码，使用前需要安装好 infiniccl。
 
 ### 文件目录结构
@@ -21,10 +21,18 @@ Dear 开发者，感谢你参与 InfiniCore 开源项目的开发！本文档将
 ├── xmake/*.lua  # 各平台 xmake 编译配置， 包含各平台特有的编译方式
 │    
 ├── include/  # 对外暴露的头文件目录，安装时会被复制到安装目录
-│   ├── infiniop/*.h  # InfiniOP算子库子头文件
-│   ├── *.h  # 模块核心头文件
+│   ├── infinicore/*.hpp  # InfiniCore计算库头文件（C++）
+│   ├── infiniop/*.h  # InfiniOP算子库子头文件（C）
+│   ├── *.h/.hpp  # 模块核心头文件
 │ 
 ├── src/  # 各模块源代码目录，包含源代码文件以及不对外暴露的头文件
+│   ├── infinicore/ # InfiniCore源代码目录
+│   │   ├── context/  # 张量运行时/硬件上下文管理源代码目录
+│   │   ├── nn/  # 机器学习模块源代码目录
+│   │   ├── ops/  # 张量算子源代码目录
+│   │   ├── pybind/  # pybind 接口源代码目录
+│   │   ├── tensor/  # 张量库源代码目录
+│   │ 
 │   ├── infiniop/ # InfiniOP算子库源代码目录
 │   │   ├── devices/  # 每个设备平台各自的通用代码目录
 │   │   ├── ops/ # 算子实现代码目录
@@ -35,15 +43,14 @@ Dear 开发者，感谢你参与 InfiniCore 开源项目的开发！本文档将
 │   │   ├── elementwise/  # 逐元素类算子通用代码目录
 │   │   ├── *.h  # 核心结构体定义
 │   │
-│   ├── infiniop-test/  # InfiniOP算子库测试框架
 │   ├── infinirt/ # InfiniRT运行时库源代码目录
 │   ├── infiniccl/ # InfiniCCL集合通信库源代码目录
 │  
 ├── test/ # 测试源代码目录
+│   ├── infinicore/ # InfiniCore测试目录
+│   │       ├── ops/*.py     # 算子单测脚本（依赖各平台PyTorch）
 │   ├── infiniop/ # InfiniOP算子库单元测试目录
 │   │       ├── *.py     # 单测脚本（依赖各平台PyTorch）
-│   ├── infiniop-test/
-│   │       ├── test_generate/ # 算子库测试框架测例生成脚本
 │  
 ├── scripts/ # 脚本目录
 │   ├── install.py # 安装编译脚本
@@ -64,13 +71,7 @@ Dear 开发者，感谢你参与 InfiniCore 开源项目的开发！本文档将
 
 ### 如何开发一个新算子
 
-1. 根据算子定义设计算子接口，在 [`InfiniCore文档`](https://github.com/InfiniTensor/InfiniCore-Documentation) 中添加算子文档。提交文档 PR 。
-2. 在 `include/infiniop/` 中添加算子头文件，并 include 到 `include/infiniop.h` 中。
-3. 在 `src/infiniop/ops/` 中添加算子实现目录，并在目录中创建 `operator.cc` 文件实现头文件中的接口。
-4. 在 `src/infiniop/ops/[op]/[device]/` 中添加平台算子实现。注意复用平台公共代码（比如逐元素计算和规约计算），开发过程中把未来可复用的代码写在相应公用代码目录里。比如 cuda kernel 可以多个平台公用，可以考虑在头文件中实现，并在多个源文件中使用。
-5. 算子实现可以成功编译安装后，在 `test/infiniop/` 中添加单测脚本，与 PyTorch 实现进行正确性和性能比较。测例应覆盖算子常用类型和形状。测试成功之后可以将测例添加至 `scripts/python_test.py` 一键测试脚本中（这样 Github 自动测试也会包含该算子）。
-6. 在 `test/infiniop-test/` 算子测试框架中添加该算子的测例脚本。脚本应该包含构建该算子 gguf 测例的类，并在 main 函数中添加几个随机测例。验证随机 gguf 测例可以通过测试框架的测试程序。
-7. 按照流程提交代码 PR 。
+- 如果你想通过 C++ 以及硬件原生语言开发一个新的算子，请阅读 [infinicore::ops 开发指南](/src/infinicore/ops/README.md)
 
 ### C++ 代码命名书写规范
 
@@ -138,6 +139,8 @@ Dear 开发者，感谢你参与 InfiniCore 开源项目的开发！本文档将
     int getMaxValue() const;
     ```
 
+    InfiniCore 中和 torch 对齐的接口，使用 `snake_case`。
+
 4. const/volatile修饰符写在类型前面
 
     ```c++
diff --git a/README.md b/README.md
index 67b5c807a..e00251949 100644
--- a/README.md
+++ b/README.md
@@ -26,9 +26,17 @@ InfiniCore 是一个跨平台统一编程工具集，为不同芯片平台的功
 
 API 定义以及使用方式详见 [`InfiniCore文档`](https://github.com/InfiniTensor/InfiniCore-Documentation)。
 
+## 项目依赖
+
+- [Xmake](https://xmake.io/)：跨平台自动构建工具，用于编译 InfiniCore 项目。
+- [gcc-11](https://gcc.gnu.org/) 以上或者 [clang-16](https://clang.llvm.org/)：基础编译器，需要支持 C++ 17 标准。
+- [Python>=3.10](https://www.python.org/)
+  - [PyTorch](https://pytorch.org/)：可选，用于对比测试。
+- 各个硬件平台的工具包：请参考各厂商官方文档（如英伟达平台需要安装 CUDA Toolkit）。
+
 ## 配置和使用
 
-### 子模块
+### 一、克隆项目
 
 由于仓库中含有子模块，所以在克隆时请添加 `--recursive` 或 `--recurse-submodules`，如：
 
@@ -42,7 +50,27 @@ git clone --recursive https://github.com/InfiniTensor/InfiniCore.git
 git submodule update --init --recursive
 ```
 
-### 一键安装
+如果你需要在本地开发九齿算子（即需要对九齿算子库进行修改），推荐单独克隆[九齿算子库](https://github.com/InfiniTensor/ntops)，并从本地安装：
+
+```shell
+git clone https://github.com/InfiniTensor/ntops.git
+cd ntops
+pip install -e .
+```
+
+### 二、编译安装
+
+InfiniCore 项目主要包括：
+
+1. 底层 C 库（InfiniOP/InfiniRT/InfiniCCL）：[`一键安装`](#一键安装底层库)|[`手动安装`](#手动安装底层库)；
+2. InfiniCore C++ 库：[`安装指令`](#2-安装-c-库)
+3. InfiniCore Python 包（依赖[九齿算子库](https://github.com/InfiniTensor/ntops)）：[`安装指令`](#3-安装-python-包)
+
+三者需要按照顺序进行编译安装。
+
+#### 1. 安装底层库
+
+##### 一键安装底层库
 
 在 `script/` 目录中提供了 `install.py` 安装脚本。使用方式如下：
 
@@ -69,11 +97,17 @@ python scripts/install.py [XMAKE_CONFIG_FLAGS]
 | `--ninetoothed=[y\|n]`   | 是否编译九齿实现                 | n
 | `--ccl=[y\|n]`           | 是否编译 InfiniCCL 通信库接口实现 | n
 
-### 手动安装
+#####  手动安装底层库
 
 0. 生成九齿算子（可选）
 
-    参见[使用九齿](#使用九齿)章节。
+   - 克隆并安装[九齿算子库](https://github.com/InfiniTensor/ntops)。
+
+   - 在 `InfiniCore` 文件夹下运行以下命令 AOT 编译库中的九齿算子：
+
+     ```shell
+     PYTHONPATH=${PYTHONPATH}:src python scripts/build_ntops.py
+     ```
 
 1. 项目配置
 
@@ -118,91 +152,60 @@ python scripts/install.py [XMAKE_CONFIG_FLAGS]
 
    按输出提示设置 `INFINI_ROOT` 和 `LD_LIBRARY_PATH` 环境变量。
 
-### 运行测试
-
-#### 运行Python算子测试
+#### 2. 安装 C++ 库
 
 ```shell
-python test/infiniop/[operator].py [--cpu | --nvidia | --cambricon | --ascend]
-```
-
-#### 一键运行所有Python算子测试
-
-```shell
-python scripts/python_test.py [--cpu | --nvidia | --cambricon | --ascend]
+xmake build _infinicore
+xmake install _infinicore
 ```
 
-#### 算子测试框架
-
-详见 `test/infiniop-test` 目录
-
-#### 通信库（InfiniCCL）测试
-
-编译（需要先安装InfiniCCL）：
+#### 3. 安装 Python 包
 
 ```shell
-xmake build infiniccl-test
+pip install .
 ```
 
-在英伟达平台运行测试（会自动使用所有可见的卡）：
+或
 
 ```shell
-infiniccl-test --nvidia
+pip install . -e
 ```
 
-### `infinicore` Python 包
+注：开发时建议加入 `-e` 选项（即 `pip install -e .`），这样对 `python/infinicore` 做的更改将会实时得到反映，同时对 C++ 层所做的修改也只需要运行 `xmake build _infinicore && xmake install _infinicore` 便可以生效。
 
-#### 构建
+### 三、运行测试
 
-1. 进行[手动安装](#手动安装)。
-2. 构建与安装内部依赖库 `_infinicore`：
+#### 运行 InfiniCore Python算子接口测试
 
-```shell
-xmake build _infinicore
+```bash
+python test/infinicore/run.py --nvidia --verbose --bench
 ```
 
-#### 安装
+使用 -h 查看更多参数。
 
-1. 安装 `_infinicore`：
+#### 运行 InfiniOP 算子测试
 
 ```shell
-xmake install _infinicore
-```
-
-2. 安装 `infinicore`：
-
-```shell
-pip install .
+# 测试单算子
+python test/infiniop/[operator].py [--cpu | --nvidia | --cambricon | --ascend]
+# 测试全部算子
+python scripts/python_test.py [--cpu | --nvidia | --cambricon | --ascend]
 ```
 
-注：开发时建议加入 `-e` 选项（即 `pip install -e .`），这样对 `python/infinicore` 做的更改将会实时得到反映，同时对 C++ 层所做的修改也只需要运行 `xmake build _infinicore && xmake install _infinicore` 便可以生效。
-
-### 使用九齿
-
-[九齿](https://github.com/InfiniTensor/ninetoothed)是一门基于 Triton 但提供更高层抽象的领域特定语言（DSL）。使用九齿可以降低算子的开发门槛，并且提高开发效率。
-
-InfiniCore 目前已经可以接入使用九齿实现的算子，但是这部分实现的编译是默认关闭的。如果选择编译库中的九齿实现，需要使用 `--ninetoothed=y`，并在运行一键安装脚本前完成以下准备工作：
+#### 通信库（InfiniCCL）测试
 
-1. 安装九齿与[九齿算子库](https://github.com/InfiniTensor/ntops)：
+编译（需要先安装底层库中的 InfiniCCL 库）：
 
 ```shell
-git clone https://github.com/InfiniTensor/ntops.git
-cd ntops
-pip install -e .
+xmake build infiniccl-test
 ```
 
-注：安装 `ntops` 时，`ninetoothed` 会被当成依赖也一并安装进来。
-
-2. 在 `InfiniCore` 文件夹下运行以下命令 AOT 编译库中的九齿算子：
+在英伟达平台运行测试（会自动使用所有可见的卡）：
 
 ```shell
-PYTHONPATH=${PYTHONPATH}:src python scripts/build_ntops.py
+infiniccl-test --nvidia
 ```
 
-注：如果对九齿相关文件有修改，需要重新构建 InfiniCore 时，也需要同时运行以上命令进行重新生成。
-
-3. 按照上面的指引进行[一键安装](#一键安装)或者[手动安装](#手动安装)。
-
 ## 如何开源贡献
 
 见 [`InfiniCore开发者手册`](DEV.md)。
diff --git a/src/infinicore/ops/README.md b/src/infinicore/ops/README.md
new file mode 100644
index 000000000..dff34d1e6
--- /dev/null
+++ b/src/infinicore/ops/README.md
@@ -0,0 +1,236 @@
+# infinicore::ops 开发指南
+
+infinicore::ops 模块包含了 InfiniCore 所有 C++ 算子的接口和实现。外部用户可以通过 `include/infinicore/ops/*OPNAME*/*OPNAME*.h` 中定义的 C++ 接口进行算子调用。部分算子会通过 pybind 暴露给 python 前端。
+
+## 开发指南
+
+### 1. 算子定义
+
+创建 `include/infinicore/ops/*OPNAME*/*OPNAME*.h` 头文件，并根据算子名称定义算子的类以及外部计算接口（包括 in-place 和 out-of-place 两种模式），注意算子名称不能重复。
+
+一个算子类主要包含以下部分：
+
+- schema 定义，用于描述算子的输入输出参数形式。
+- execute 函数，算子的计算逻辑。
+- dispatcher 分发器，用于注册算子在不同设备上的 kernel 实现。一个进程中，一种算子只有一个全局分发器，每种设备上只能同时注册一个 kernel 实现，可以多次注册对之前的实现进行覆盖。详细信息请参考 `include/infinicore/ops/common/dispatcher.hpp`。
+
+示例 `Matmul` 算子的头文件如下：
+
+```c++
+#pragma once
+
+#include "../device.hpp"
+#include "common/op.hpp"
+
+namespace infinicore::op {
+class Matmul {
+public:
+    using schema = void (*)(Tensor, Tensor, Tensor);
+    static void execute(Tensor c, Tensor a, Tensor b);
+    static common::OpDispatcher<schema> &dispatcher();
+};
+
+Tensor matmul(Tensor a, Tensor b);
+void matmul_(Tensor c, Tensor a, Tensor b);
+}
+```
+
+### 2. 算子实现
+
+在 `src/infinicore/ops/*OPNAME*/*OPNAME*.cpp` 文件中实现算子的计算逻辑。
+
+- execute 函数，使用算子的分发器，调用对应硬件上的核函数。
+- 计算接口，使用 execute 函数实现算子接口的计算逻辑，包括 in-place 和 out-of-place 两种模式，其中 in-place 模式的接口函数名以 `_` 结尾，将输出接口写入给定的参数中；out-of-place 模式的接口会为输出创建新的 Tensor。
+
+示例 `Matmul` 算子的实现如下：
+
+```c++
+#include "infinicore/ops/matmul.hpp"
+
+namespace infinicore::op {
+
+common::OpDispatcher<Matmul::schema> &Matmul::dispatcher() {
+    static common::OpDispatcher<Matmul::schema> dispatcher_;
+    return dispatcher_;
+};
+
+void Matmul::execute(Tensor c, Tensor a, Tensor b) {
+    dispatcher().lookup(context::getDevice().getType())(c, a, b);
+}
+
+Tensor matmul(Tensor a, Tensor b) {
+    Shape shape = a->shape();
+    Size size = a->ndim();
+    shape[size - 1] = b->size(size - 1);
+    auto c = Tensor::empty(shape, a->dtype(), a->device());
+    matmul_(c, a, b);
+    return c;
+}
+
+void matmul_(Tensor c, Tensor a, Tensor b) {
+    Matmul::execute(c, a, b);
+}
+}
+```
+
+### 3. Kernel 注册
+
+在 `src/infinicore/ops/*OPNAME*/` 目录中添加算子和函数实现，并在算子的分发器中进行注册。你可以选择为单个设备、多个设备、或全部平台注册 kernel 实现（函数指针），你还可以通过使用 `override_existing` 模式覆盖之前的实现。具体信息请参考 `include/infinicore/ops/common/dispatcher.hpp`：
+
+```c++
+// 为某个设备注册 kernel 实现
+void registerDevice(Device::Type device_type, Fn fn, bool override_existing = true);
+
+// 为多个设备注册 kernel 实现
+void registerDevice(std::initializer_list<Device::Type> device_types, Fn fn, bool override_existing = true);
+
+// 为全部平台注册 kernel 实现
+void registerAll(Fn fn, bool override_existing = true);
+
+// 查找 kernel 实现
+Fn lookup(Device::Type device_type) const;
+```
+
+如果你为多个（或全部）设备注册了同一个 kernel 实现，那么你需要自行实现不同设备的分发机制。比如本框架中的 InfiniOP 算子库，其算子接口在不同平台都保持了一致，并根据当前设备类型自动分发，因此在注册时会为所有平台注册同一个计算函数。以 Matmul 算子为例：
+
+```c++
+namespace infinicore::op::matmul_impl::infiniop {
+
+// InfiniOP 算子缓存（线程级）
+thread_local common::OpCache<size_t, infiniopGemmDescriptor_t> caches(
+    100,
+    [](infiniopGemmDescriptor_t &desc) {
+        if (desc != nullptr) {
+            INFINICORE_CHECK_ERROR(infiniopDestroyGemmDescriptor(desc));
+            desc = nullptr;
+        }
+    });
+
+// 计算函数
+void calculate(Tensor c, Tensor a, Tensor b){
+    // ...
+    INFINICORE_CHECK_ERROR(infiniopGemm(
+        desc, workspace->data(), workspace_size,
+        c->data(), a->data(), b->data(), 1.f, 0.f, context::getStream()));
+}
+
+// 在加载 InfiniCore 时为全平台注册 InfiniOP实现
+static bool registered = []() {
+    Matmul::dispatcher().registerAll(&calculate, false);
+    return true;
+}();
+
+}
+```
+
+你可以仿照上面的例子单独为不同平台实现核函数并注册。请注意在 `xmake/*lua` 中添加对源文件的编译方式，并做好跨平台隔离工作以保证项目在别的平台上也可以正常编译。你可以选择像上面的例子一样，通过 `static bool registered = []() {...}` 方式在加载时注册核函数，但请注意避免加载时为同一个算子重复注册不同核函数的未定义行为。你也可以在程序运行时显式地注册算子。
+
+如果你想通过 InfiniOP 库来实现算子，请参考 [`InfiniOP 开发者文档`](src/infiniop/README.md) 文件。
+
+### 4. Python 接口
+
+通过 pybind11 将 C++ 算子暴露给 Python 前端，需要在 `src/infinicore/pybind11/ops/*OPNAME*/` 目录中添加相应的头文件，并在 `src/infinicore/pybind11/ops.hpp` 中调用。之后你需要在 `python/infinicore/ops/` 目录中为算子添加一个 Python 文件，通过调用你刚才定义的 pybind 接口实现你的 Python 接口，并将 Python 接口通过 `python/infinicore/__init__.py` 暴露给外部。
+
+### 5. Python 测试
+
+在实现了 Python 接口后，你需要在 `/test/infinicore/ops/` 中添加相应的算子测试脚本，并确保测试通过。该目录下的测试使用了统一的测试框架，大部分测试功能已经实现，比如根据形状构建随机张量、自动测试算子的正确性和性能等。你需要继承 `BaseOperatorTest` 类并实现 `get_test_cases`、`get_tensor_dtypes`、`get_tolerance_map`、`torch_operator`、`infinicore_operator` 等跟算子有关的方法。其中 `torch_operator` 为对比用的 pytorch 版算子实现，而 `infinicore_operator` 为你所实现的 InfiniCore 版算子。以 silu 算子为例：
+
+```python
+class OpTest(BaseOperatorTest):
+    """SiLU test with simplified test case parsing"""
+
+    def __init__(self):
+        super().__init__("SiLU")
+
+    def get_test_cases(self):
+        return _TEST_CASES
+
+    def get_tensor_dtypes(self):
+        return _TENSOR_DTYPES
+
+    def get_tolerance_map(self):
+        return _TOLERANCE_MAP
+
+    def torch_operator(self, input, out=None, **kwargs):
+        # SiLU implementation: input * sigmoid(input)
+        sigmoid_input = torch.sigmoid(input)
+        result = input * sigmoid_input
+        if out is not None:
+            out.copy_(result)
+            return out
+        return result
+
+    def infinicore_operator(self, input, out=None, **kwargs):
+        return infinicore.silu(input, out=out)
+```
+
+在测试脚本中你需要为算子测试脚本添加测例。请参考 `TestCase` 类的定义，提供输入输出张量的形状、数据类型、步长，以及其他参数的数值等。你可以指定算子计算是否涉及 in-place 或 out-of-place 模式。你可以像示例一样将测例写的更简洁，并通过 `parse_test_cases` 函数来解析测例数据。
+
+```python
+_TEST_CASES_DATA = [
+    # Basic 2D SiLU
+    (TestCase.BOTH, (2, 4), None, None),
+    (TestCase.BOTH, (128, 64), None, None),
+    # 3D SiLU
+    (TestCase.BOTH, (2, 4, 8), None, None),
+    (TestCase.BOTH, (4, 48, 6), None, None),
+    # Strided tensors
+    (TestCase.BOTH, (1, 2048), (4096, 1), (4096, 1)),
+    (TestCase.BOTH, (6, 2560), (2048, 1), (2560, 1)),
+    # Mixed cases
+    (TestCase.BOTH, (8, 16, 32), None, None),
+    # Large tensors
+    (TestCase.BOTH, (16, 5632), None, None),
+    (TestCase.BOTH, (4, 4, 5632), None, None),
+]
+
+def parse_test_cases(data):
+    """
+    Parse silu test case data according to format:
+    (operation_mode, shape, input_strides, output_strides)
+    """
+    operation_mode = data[0]
+    shape = data[1]
+    input_strides = data[2] if len(data) > 2 else None
+    output_strides = data[3] if len(data) > 3 else None
+
+    # Create input specifications
+    inputs = []
+
+    # Tensor input
+    if input_strides is not None:
+        inputs.append(TensorSpec.from_strided_tensor(shape, input_strides))
+    else:
+        inputs.append(TensorSpec.from_tensor(shape))
+
+    # Output tensor
+    if output_strides is not None:
+        output = TensorSpec.from_strided_tensor(shape, output_strides)
+    else:
+        output = TensorSpec.from_tensor(shape)
+
+    return TestCase(operation_mode, inputs, output)
+
+
+# Parse test cases
+_TEST_CASES = [parse_test_cases(data) for data in _TEST_CASES_DATA]
+```
+
+对于支持多种精度的算子，你可以指定测试通过的误差范围。
+
+```python
+_TENSOR_DTYPES = [infinicore.float16, infinicore.bfloat16, infinicore.float32]
+
+
+_TOLERANCE_MAP = {
+    infinicore.float16: {"atol": 1e-3, "rtol": 1e-3},
+    infinicore.float32: {"atol": 1e-5, "rtol": 1e-5},
+    infinicore.bfloat16: {"atol": 5e-3, "rtol": 1e-2},
+}
+```
+
+运行测试指令检查算子的正确性和性能：
+
+```bash
+python test/infinicore/run.py --ops matmul --nvidia --verbose --bench
+```
diff --git a/src/infiniop/README.md b/src/infiniop/README.md
new file mode 100644
index 000000000..b4d4059e1
--- /dev/null
+++ b/src/infiniop/README.md
@@ -0,0 +1,48 @@
+# InfiniOP 开发者文档
+
+InfiniOP 是 InfiniCore 下属的统一底层算子框架，为相同算子在不同平台提供统一的 C 语言多段式接口。
+
+## 开发流程
+
+1. 根据算子定义设计算子接口，在 [`InfiniCore文档`](https://github.com/InfiniTensor/InfiniCore-Documentation) 中添加算子文档。提交文档 PR 。
+
+2. 在 `include/infiniop/` 中添加算子头文件，并 include 到 `include/infiniop.h` 中。每个算子暴露的接口包括：创建算子描述、获取工作空间大小、执行算子、销毁算子描述。比如：
+
+    ```c
+    #ifndef __INFINIOP_ADD_API_H__
+    #define __INFINIOP_ADD_API_H__
+
+    #include "../operator_descriptor.h"
+
+    typedef struct InfiniopDescriptor *infiniopAddDescriptor_t;
+
+    __C __export infiniStatus_t infiniopCreateAddDescriptor(infiniopHandle_t handle,
+                                                            infiniopAddDescriptor_t *desc_ptr,
+                                                            infiniopTensorDescriptor_t c,
+                                                            infiniopTensorDescriptor_t a,
+                                                            infiniopTensorDescriptor_t b);
+
+    __C __export infiniStatus_t infiniopGetAddWorkspaceSize(infiniopAddDescriptor_t desc, size_t *size);
+
+    __C __export infiniStatus_t infiniopAdd(infiniopAddDescriptor_t desc,
+                                            void *workspace,
+                                            size_t workspace_size,
+                                            void *c,
+                                            const void *a,
+                                            const void *b,
+                                            void *stream);
+
+    __C __export infiniStatus_t infiniopDestroyAddDescriptor(infiniopAddDescriptor_t desc);
+
+    #endif
+    ```
+
+    在任何平台都不需要工作空间的算子也可以不提供获取工作空间大小接口。
+
+3. 在 `src/infiniop/ops/` 中添加算子实现目录，并在目录中创建 `operator.cc` 文件实现头文件中的接口，并根据硬件环境分发至不同平台的核函数。你还可以在目录中创建该算子在全平台通用的代码，比如 `causal_softmax/info.h` 中就包含了对 Causal Softmax 算子在创建算子描述时的一些通用的信息获取和输入输出检查。像逐元素类的算子除了计算内核以外大部分逻辑都是一样的，你可以使用 `src/infiniop/elementwise/` 中的通用代码快速适配算子。
+
+4. 在 `src/infiniop/ops/[op]/[device]/` 中添加平台算子实现。注意复用平台公共代码，比如规约计算（`src/infiniop/reduce/`），开发过程中把未来可复用的代码写在相应公用代码目录里。
+
+    一些 CUDA kernel 可以被多个支持 CUDA 的平台公用，可以考虑在头文件中实现，并在多个源文件中使用。 比如 `mul/cuda/kernel.cuh` 中只有 device 测代码，会被多个支持 CUDA 的平台源代码引用。
+
+5. 算子实现可以成功编译安装后，在 `test/infiniop/` 中添加单测脚本，与 PyTorch 实现进行正确性和性能比较。你可以仿照已有的测试脚本进行开发，以使用各种通用的测试功能。测例应覆盖算子常用类型和形状。测试成功之后可以将测例添加至 `scripts/python_test.py` 一键测试脚本中（这样 Github 自动测试也会包含该算子）。