Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement as strided #7275

Merged
merged 12 commits into from
Jan 26, 2022
Merged

implement as strided #7275

merged 12 commits into from
Jan 26, 2022

Conversation

lcylcy
Copy link
Contributor

@lcylcy lcylcy commented Jan 17, 2022

1

const std::vector<int32_t>& stride,
const int32_t& storage_offset) const {
MutableAttrMap attrs;
JUST(attrs.SetAttr<std::vector<int32_t>>("size", size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要检查size和stride的vector size和element value的合法性吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@hjchen2
Copy link
Contributor

hjchen2 commented Jan 21, 2022

这个标题Lcy as strided确实不大规范,改成implement as_strided也好一些,标题中也基本上不会出现个人的GitHub id。

add_docstr(
oneflow.as_strided,
r"""
Create a view of an existing torch.Tensor input with specified size, stride and storage_offset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.Tensor -> oneflow.Tensor,这种基本上从pytorch那里copy过来的文档,最好加注一下该文档来自pytorch。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@lcylcy
Copy link
Contributor Author

lcylcy commented Jan 21, 2022 via email

@lcylcy
Copy link
Contributor Author

lcylcy commented Jan 21, 2022

这个标题Lcy as strided确实不大规范,改成implement as_strided也好一些,标题中也基本上不会出现个人的GitHub id。

在哪里改啊

@lcylcy lcylcy closed this Jan 21, 2022
const std::shared_ptr<one::Tensor>& input,
const std::vector<int32_t>& size,
const std::vector<int32_t>& stride,
const int32_t& storage_offset) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里和上面的AsStridedFunctor中的const int32_t& storage_offset感觉改成int32_t storage_offset更好,

  • 如果从传引用比传值效率高的层面考虑,那对于基础数据类型,没有这种收益,因为传引用底层实际还是传的指针,64位的sizeof(指针)是8,反而比直接传int32_t开销大。
  • 如果是想加个const qualifier,避免函数体内部的修改,那么不用引用修饰符也可以,但一般没必要
  • 仅仅从少打一些字符的角度,也是建议语句尽可能简短
  • 我知道functor里有很多类似的写法,这样完全没问题,但也不表示就一定是最优

上面仅仅是个人看法,仅用来探讨,改不改都行

Comment on lines 58 to 59
ctx->size = JUST(composed_attrs.GetAttr<std::vector<int32_t> >("size"));
ctx->stride = JUST(composed_attrs.GetAttr<std::vector<int32_t> >("stride"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的两个> >中间的空格,是c++98时代的遗留问题,现在已经不存在啦

@lcylcy lcylcy reopened this Jan 21, 2022
template<typename T>
struct AsStridedFunctor final {
Maybe<void> operator()(ep::Stream* stream, const T* input_buf, T* output_buf, const int64_t* dest_dims, const int32_t* stride,
const int32_t dest_num_dims, const int32_t storage_offset, const int32_t input_num, const int32_t output_num) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里对于基础数据类型的传值,建议不用const qualifier,函数很短,可以不用如此小心,你自己不会去改,别人在用到的时候也不会去随便改
这里可以不用改的,仅作探讨,看下你是怎么想的哈~

template<typename T>
struct AsStridedGradFunctor final {
Maybe<void> operator()(ep::Stream* stream, const T* dy_buf, T* dx_buf, const int64_t* dy_dims, const int32_t* stride,
const int32_t dy_num_dims, const int32_t storage_offset, const int32_t dx_num, const int32_t dy_num) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@lcylcy lcylcy changed the title Lcy as strided implement as strided Jan 21, 2022
~CpuAsStridedKernel() = default;

private:
void Compute(user_op::KernelComputeContext* ctx) const override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是应该加上这句声明:

using user_op::OpKernel::Compute;

不然会出现比较大段的warning,cpp文件可能不会,我在用clang编译cu文件时遇到过(大概是你写的diagonal_kernel.cu这个文件),这个问题的原因可以参考这个这个链接:https://stackoverflow.com/questions/21462908/warning-overloaded-virtual-function-baseprocess-is-only-partially-overridde

Comment on lines 82 to 89
const auto size = ctx->Attr<std::vector<int32_t>>("size");
const auto stride = ctx->Attr<std::vector<int32_t>>("stride");
const int32_t storage_offset = ctx->Attr<int32_t>("storage_offset");

size_t dest_num_dims = output->shape().NumAxes();
const int64_t *dest_dims = output->shape().ptr();
const size_t input_num = input->shape().Count(0);
const size_t output_num = output->shape().Count(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这里可以全部不用const,并且对于类型的指定可以尽量不用显式的指定而多用auto,具体的原因可以参考effective modern c++的大概从第一篇到第六篇,当然也可以不改,这里仅提供一下个人看法作为探讨

Comment on lines 107 to 114
const auto size = ctx->Attr<std::vector<int32_t>>("size");
const auto stride = ctx->Attr<std::vector<int32_t>>("stride");
const int32_t storage_offset = ctx->Attr<int32_t>("storage_offset");

size_t dy_num_dims = dy->shape().NumAxes();
const int64_t *dy_dims = dy->shape().ptr();
const size_t dx_num = dx->shape().Count(0);
const size_t dy_num = dy->shape().Count(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于const和auto的使用如前所述,仅作探讨

~GpuAsStridedGradKernel() = default;

private:
void Compute(user_op::KernelComputeContext* ctx) const override {
Copy link
Contributor

@wyushun wyushun Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里参考前面所说,可能存在warning的问题(clang编译的时候,gcc我没试过)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个warning大概长这个样子,你参考一下:

../oneflow/user/kernels/diagonal_kernel.cu(109): warning: overloaded virtual function "oneflow::user_op::OpKernel::Compute" is only partially overridden in class "oneflow::GpuDiagonalBackwardKernel<half>"
          detected during:
            instantiation of class "oneflow::GpuDiagonalBackwardKernel<T> [with T=half]" 
../oneflow/core/framework/op_kernel.h(330): here
            instantiation of "oneflow::user_op::OpKernel *oneflow::user_op::NewOpKernel<T>() [with T=oneflow::GpuDiagonalBackwardKernel<half>]" 
../oneflow/core/framework/user_op_kernel_registry.h(84): here
            instantiation of "oneflow::user_op::OpKernelRegistry &oneflow::user_op::OpKernelRegistry::SetCreateFn<T>() [with T=oneflow::GpuDiagonalBackwardKernel<half>]" 
(152): here

../oneflow/user/kernels/diagonal_kernel.cu(109): warning: overloaded virtual function "oneflow::user_op::OpKernel::Compute" is only partially overridden in class "oneflow::GpuDiagonalBackwardKernel<float>"
          detected during:
            instantiation of class "oneflow::GpuDiagonalBackwardKernel<T> [with T=float]" 
../oneflow/core/framework/op_kernel.h(330): here
            instantiation of "oneflow::user_op::OpKernel *oneflow::user_op::NewOpKernel<T>() [with T=oneflow::GpuDiagonalBackwardKernel<float>]" 
../oneflow/core/framework/user_op_kernel_registry.h(84): here
            instantiation of "oneflow::user_op::OpKernelRegistry &oneflow::user_op::OpKernelRegistry::SetCreateFn<T>() [with T=oneflow::GpuDiagonalBackwardKernel<float>]" 
(153): here

../oneflow/user/kernels/diagonal_kernel.cu(109): warning: overloaded virtual function "oneflow::user_op::OpKernel::Compute" is only partially overridden in class "oneflow::GpuDiagonalBackwardKernel<double>"
          detected during:
            instantiation of class "oneflow::GpuDiagonalBackwardKernel<T> [with T=double]" 
../oneflow/core/framework/op_kernel.h(330): here
            instantiation of "oneflow::user_op::OpKernel *oneflow::user_op::NewOpKernel<T>() [with T=oneflow::GpuDiagonalBackwardKernel<double>]" 
../oneflow/core/framework/user_op_kernel_registry.h(84): here
            instantiation of "oneflow::user_op::OpKernelRegistry &oneflow::user_op::OpKernelRegistry::SetCreateFn<T>() [with T=oneflow::GpuDiagonalBackwardKernel<double>]" 
(154): here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,这个就是我之前写的,原来如此

Comment on lines 25 to 32
device = random_device()
x = random_pytorch_tensor(
ndim=4,
dim1=random(3, 6),
dim2=random(3, 6),
dim3=random(3, 6),
dim4=random(3, 6),
).to(device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个op我不知道是不是1、2、3、4、5维都支持,如果都支持的话,可以用随机的方法,把每个维度都测一下,可以参考这段代码:

    # random ndim in range [1,5]
    ndim = np.random.randint(1, 6)
    dim0 = np.random.randint(4, 10) * 8
    dim1 = np.random.randint(4, 10) * 8
    dim2 = np.random.randint(4, 10) * 8
    dim3 = np.random.randint(4, 10) * 8
    dim4 = np.random.randint(4, 10) * 8
    if ndim==1:
        x = random_pytorch_tensor(1, dim0)
    elif ndim==2:
        x = random_pytorch_tensor(2, dim0, dim1)
    elif ndim==3:
        x = random_pytorch_tensor(3, dim0, dim1, dim2)
    elif ndim==4:
        x = random_pytorch_tensor(4, dim0, dim1, dim2, dim3)
    elif ndim==5:
        x = random_pytorch_tensor(5, dim0, dim1, dim2, dim3, dim4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

>>> import oneflow as flow

>>> input = flow.rand(2,3,5)
>>> output = flow.as_strided(input, (2,3,3), (1,2,3), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档这部分需要把input和output的内容show出来吗?我看pytorch的torch.as_strided会显示出来,咱们的规定是怎么样的呢,根据咱们这边的要求来就行,仅作参考

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个好像没有规定

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 26, 2022 06:52
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 26, 2022 07:53
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 26, 2022 08:53
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 26, 2022 11:42
@lcylcy lcylcy requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 26, 2022 13:25
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 26, 2022 13:47
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 136.6ms (= 13661.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.3ms (= 13830.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.01 (= 138.3ms / 136.6ms)

OneFlow resnet50 time: 78.7ms (= 7874.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.6ms (= 8464.6ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.07 (= 84.6ms / 78.7ms)

OneFlow resnet50 time: 54.5ms (= 10894.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.8ms (= 11552.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.06 (= 57.8ms / 54.5ms)

OneFlow resnet50 time: 42.3ms (= 8456.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 43.7ms (= 8730.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.03 (= 43.7ms / 42.3ms)

OneFlow resnet50 time: 41.4ms (= 8270.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 37.8ms (= 7560.7ms / 200, input_shape=[1, 3, 224, 224])
❌ Relative speed: 0.91 (= 37.8ms / 41.4ms)

OneFlow resnet50 time: 147.4ms (= 14739.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 157.6ms (= 15761.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 157.6ms / 147.4ms)

OneFlow resnet50 time: 89.7ms (= 8967.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 97.2ms (= 9720.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.08 (= 97.2ms / 89.7ms)

OneFlow resnet50 time: 63.4ms (= 12687.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.0ms (= 14605.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 73.0ms / 63.4ms)

OneFlow resnet50 time: 68.1ms (= 13616.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.3ms (= 13062.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.96 (= 65.3ms / 68.1ms)

OneFlow resnet50 time: 62.9ms (= 12582.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 57.3ms (= 11459.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.91 (= 57.3ms / 62.9ms)

@github-actions
Copy link
Contributor

CI failed when running job: cuda-speed-test. PR label automerge has been removed

@lcylcy lcylcy removed the request for review from oneflow-ci-bot January 26, 2022 16:00
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 136.6ms (= 13655.7ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.6ms (= 14060.5ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.03 (= 140.6ms / 136.6ms)

OneFlow resnet50 time: 78.2ms (= 7820.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.8ms (= 8484.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 84.8ms / 78.2ms)

OneFlow resnet50 time: 51.8ms (= 10353.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 53.6ms (= 10720.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.04 (= 53.6ms / 51.8ms)

OneFlow resnet50 time: 41.7ms (= 8330.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 46.3ms (= 9262.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.11 (= 46.3ms / 41.7ms)

OneFlow resnet50 time: 37.9ms (= 7577.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.8ms (= 8757.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.16 (= 43.8ms / 37.9ms)

OneFlow resnet50 time: 147.1ms (= 14707.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 155.0ms (= 15497.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.05 (= 155.0ms / 147.1ms)

OneFlow resnet50 time: 89.3ms (= 8930.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 98.8ms (= 9875.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 98.8ms / 89.3ms)

OneFlow resnet50 time: 64.2ms (= 12837.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.7ms (= 14139.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.10 (= 70.7ms / 64.2ms)

OneFlow resnet50 time: 58.9ms (= 11774.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 62.8ms (= 12551.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 62.8ms / 58.9ms)

OneFlow resnet50 time: 60.7ms (= 12138.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 59.7ms (= 11935.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.98 (= 59.7ms / 60.7ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants