Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 5th No.28】为 Paddle 新增 slice_scatter API #668

Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
210 changes: 210 additions & 0 deletions rfcs/APIs/20230928_api_design_for_slice_scatter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# paddle.slice_scatter 设计文档

|API名称 | paddle.slice_scatter |
|---|---|
|提交作者 | VOIDMalkuth |
|提交时间 | 2023-09-28 |
|版本号 | V1.0 |
|依赖飞桨版本 | develop |
|文件名 | 20200928_api_design_for_slice_scatter.md |


# 一、概述
## 1、相关背景

为了提升飞桨API丰富度,需要为飞桨扩充API `paddle.slice_scatter`

本API属于飞桨开源个人贡献赛API开发任务[No.28:为 Paddle 新增 slice_scatter API](https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_5th/%E3%80%90PaddlePaddle%20Hackathon%205th%E3%80%91%E5%BC%80%E6%BA%90%E8%B4%A1%E7%8C%AE%E4%B8%AA%E4%BA%BA%E6%8C%91%E6%88%98%E8%B5%9B%20API%20%E5%BC%80%E5%8F%91%E4%BB%BB%E5%8A%A1%E5%90%88%E9%9B%86.md#no28%E4%B8%BA-paddle-%E6%96%B0%E5%A2%9E-slice_scatter-api)的任务。

## 2、功能目标

对于一个Tensor,根据给定的轴和切片表示的索引范围,返回一个新的Tensor,其结果等价于将value 中的值填充到该Tensor上,例如当指定轴为1,索引为start=1, end=5, step=2的切片时,与x[:, 1:5:2] = value 结果相似,但与前者不同的是,不会直接修改x的值,而是返回预期赋值后的结果,即额外进行了拷贝操作

预期该API支持

- paddle.slice_scatter 作为独立的函数调用
- Tensor.slice_scatter,作为 Tensor 的方法使用

## 3、意义

为飞桨增加根据给定的轴和切片索引范围,返回替换掉部分切片索引值的新Tensor,提升飞桨API丰富度。

# 二、飞桨现状

目前飞桨缺少相关功能实现

# 三、业内方案调研

## PyTorch

PyTorch 中有 API `torch.slice_scatter(input, src, dim=0, start=None, end=None, step=1) → Tensor` 以及对应的`torch.Tensor.slice_scatter`

其介绍为:

> Embeds the values of the src tensor into input at the given dimension. This function returns a tensor with fresh storage; it does not create a view.

参数表为:
- `Tensor` input: the input tensor.
- `Tensor` src: The tensor to embed into input
- `int` dim: the dimension to insert the slice into
- `Optional[int]` start: the start index of where to insert the slice
- `Optional[int]` end: the end index of where to insert the slice
- `int` step: the how many elements to skip in

### 前向实现

PyTorch在通常的训练过程中通过调用ATen算子来实现相关操作,可以参考[aten/src/ATen/native/TensorShape.cpp#L3956](https://github.com/pytorch/pytorch/blob/34ded743998e2f6fa9677db0cb02658bcc657f05/aten/src/ATen/native/TensorShape.cpp#L3956)。

```cpp
at::Tensor slice_scatter(const at::Tensor& self, const at::Tensor& src, int64_t dim, c10::optional<int64_t> start, c10::optional<int64_t> end, int64_t step) {
// See Note [*_scatter ops preserve strides]
auto output = clone_preserve_strides(self);
auto slice = output.slice(dim, start, end, step);
TORCH_CHECK(slice.sizes() == src.sizes(), "expected src to have a size equal to the slice of self. src size = ", src.sizes(), ", slice size = ", slice.sizes());
slice.copy_(src);
return output;
}
```

在Torch 2.0版本新推出的TorchDynamo中,其默认后端Inductor针对`slice_scatter`进行生成的代码如下,具体代码可以参考[torch/_inductor/lowering.py#L2108](https://github.com/pytorch/pytorch/blob/34ded743998e2f6fa9677db0cb02658bcc657f05/torch/_inductor/lowering.py#L2108)
```python
@register_lowering(aten.slice_scatter, type_promotion_kind=None)
def slice_scatter(x, src, dim=0, start=None, end=None, step=1):
assert x.get_dtype() == src.get_dtype()
x_loader = x.make_loader()
dim = _validate_dim(x, dim, 0)
dim_size = x.get_size()[dim]
if start is not None and V.graph.sizevars.evaluate_expr(sympy.Lt(start, 0)):
start = start + dim_size
if end is not None and V.graph.sizevars.evaluate_expr(sympy.Lt(end, 0)):
end = end + dim_size
if start is None:
start = 0
if end is None or V.graph.sizevars.statically_known_leq(x.get_size()[dim], end):
end = dim_size

src_size = list(x.get_size())
src_size[dim] = FloorDiv(sympy.expand(end - start), sympy.expand(step))
src = expand(src, src_size)
src_loader = src.make_loader()

def inner_fn(idx):
if start == 0 and end == dim_size and step == 1:
# selecting every element is the same as just src.clone()
return src_loader(idx)

idx_dim = ops.index_expr(idx[dim], torch.int64)
src_idx = list(idx)
src_idx[dim] = FloorDiv(idx[dim] - start, step)

mask = []
if start != 0:
mask.append(
ops.ge(
idx_dim,
ops.index_expr(sympy.expand(start), torch.int64),
)
)
if end != dim_size:
mask.append(
ops.lt(
idx_dim,
ops.index_expr(sympy.expand(end), torch.int64),
)
)
if step != 1:
mask.append(
ops.eq(
ops.index_expr(
ModularIndexing(idx[dim] - start, 1, step), torch.int64
),
ops.constant(0, torch.torch.int64),
)
)
assert mask
mask = functools.reduce(ops.and_, mask)
src_val = ops.masked(
mask,
lambda: src_loader(src_idx),
0 if is_integer_type(x) else 0.0,
)
return ops.where(
mask,
src_val,
x_loader(idx),
)

return Pointwise.create(
device=x.get_device(),
dtype=x.get_dtype(),
inner_fn=inner_fn,
ranges=list(x.get_size()),
)
```

### 反向实现

PyTorch的Autograd中定义了相应的反向实现,可以参考[tools/autograd/derivatives.yaml#L1491](https://github.com/pytorch/pytorch/blob/34ded743998e2f6fa9677db0cb02658bcc657f05/tools/autograd/derivatives.yaml#L1491)

```yaml
- name: slice_scatter(Tensor self, Tensor src, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1) -> Tensor
self: slice_scatter_symint(grad, zeros_like(src), dim, start, end, step)
src: grad.slice_symint(dim, start, end, step)
result: auto_linear
```

可以看到,对于两个输入的反向梯度均有相关定义描述

# 四、对比分析

Torch中默认使用C++实现算子,并且手动添加了对应的反向信息等。但C++中的算子仍然以Tensor级别进行操作,调用的也是Tensor的API完成操作,因此完全可以在Python中实现,并且使用autograd技术自动完成反向操作。而Torch的Inductor直接生成IR则仅适用于TorchDynamo场景。

# 五、设计思路与实现方案

## 命名与参数设计

添加 Python API:
```python
paddle.slice_scatter(input, src, dim=0, start=None, end=None, step=1)
```

参数表:

- input: (Tensor) 输入的 tensor。数据类型支持 `float16`、`float32`、`float64`、`int8`、`int16`、`int32`、`int64`、`bfloat16`。
- src: (Tensor) 输入的 tensor。数据类型与input一致,形状与`[start:end:step]`构成的slice一致。
- dim: (int) 将要插入src slice的维度。
- start: (Optional[int]) 待插入slice位置的起始index。
- end: (Optional[int]) 待插入slice位置的结束index。
- step: (int) 待插入slice的步长。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


## 底层OP设计

直接使用 Python API 实现,无需设计底层 OP。

## API实现方案

- 使用`paddle.Tensor.clone`拷贝一个Tensor
- 使用PythonAPI组合,以索引或切片方式在新Tensor上赋值以完成功能。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不建议直接使用索引切片的方式实现,首先有额外的不必要解析耗时,其次索引赋值是原地的(因此这里才提到需要clone),建议参考set_value OP

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的方案本质上是:

a = src.clone()
a[...,start:end:step,...] = input

事实上Torch的实现也比较类似,但是Torch中是使用Aten在C++中实现的。

set_value我注意到文档中写道This API is ONLY available in Dygraph mode,似乎和我们的要求不符?想麻烦老师说明如何参考。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a = src.clone()
a[...,start:end:step,...] = input

这个写法底层的本质仍然是调用的set_value OP,这个目前没暴露成api,可以参考下 OP定义https://github.com/PaddlePaddle/Paddle/blob/864779c75bc9d57a67442935c49e7903943edbc0/paddle/phi/kernels/cpu/set_value_kernel.cc#L15

以及调用方法 https://github.com/PaddlePaddle/Paddle/blob/864779c75bc9d57a67442935c49e7903943edbc0/python/paddle/base/variable_index.py#L555

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,我去看一看

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看了一下,set_value OP其实完成的就是将Tensor的某些位置赋值给另一个Tensor。示例代码这里是inplace完成的。

老师的意思是,我们实现时仿照_setitem_impl_函数,通过set_value的C++ OP调用完成,同时取消inplace设置么?这样做应该就可以省去老师所说的解析开销?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_value的C++ OP调用完成,同时取消inplace设置么?这样做应该就可以省去老师所说的解析开销?

是的,解析开销主要指从x[:] = 1的调用形式,逐步确认要调用哪个OP,以及准备这个OP的输入。这里由于我们已经知道应该调用哪个OP了,直接调用即可。此外,静态图还是需要append_op的方式,这个请参考上述代码或其他API的写法


# 六、测试和验收的考量

- 覆盖动态图和静态图的测试场景
- 覆盖 CPU、GPU 两种测试场景
- 支持各种Tensor精度,FP32、FP64、FP16、INT8、INT16、INT32、INT64
- 需要检查前向和反向计算的精度正确性,但由于Python 组合方式新增的 API 反向计算已经在各组合 API 单测中分别验证了,因此不必过多关注
- 处理0维输入数据
- 处理可选参数不存在或不一致的情况

# 七、可行性分析和排期规划

有业内方案实现作为参考,相关PythonAPI均有实现,可以在开源贡献个人挑战赛期间完成。

# 八、影响面

对其他模块暂无影响

# 名词解释

# 附件及参考资料

[PyTorch slice_scatter 文档](https://pytorch.org/docs/stable/generated/torch.slice_scatter.html)