Support tensor.to()/to_local() #5271

clackhan · 2021-06-22T09:22:42Z

支持local_tensor.to(...)，以及对称模式下 gpu版本local_tensor.to_consistent(...)，示例代码如下：
0号进程演示

>>> import oneflow.experimental as flow
>>> import numpy as np
>>> ndarr = np.asarray([[7, 9, 5], [3,6,9]])
>>> x = flow.Tensor(ndarr)
>>> y = x.to("cuda")
>>> y
tensor([[7., 9., 5.],
        [3., 6., 9.]], device='cuda:0', dtype=oneflow.float32)
>>> p = flow.placement("cuda", {0:range(2)})
>>> z = y.to_consistent([flow.sbp.broadcast], p)
>>> z.to_local()
tensor([[7., 9., 5.],
        [3., 6., 9.]], device='cuda:0', dtype=oneflow.float32)
>>> m = y.to_consistent([flow.sbp.split(0)], p)
>>> m.shape
flow.Size([4, 3])
>>> m.to_local()
tensor([[7., 9., 5.],
        [3., 6., 9.]], device='cuda:0', dtype=oneflow.float32)
>>> n = y.to_consistent([flow.sbp.partial_sum], p)
>>> n.to_local()
tensor([[7., 9., 5.],
        [3., 6., 9.]], device='cuda:0', dtype=oneflow.float32)
>>>

1号进程演示

>>> import oneflow.experimental as flow
>>> import numpy as np
>>> ndarr = np.asarray([[1,-4, 5], [2,-3,7]])
>>> x = flow.Tensor(ndarr)
>>> y = x.to("cuda")
>>> y
tensor([[ 1., -4.,  5.],
        [ 2., -3.,  7.]], device='cuda:1', dtype=oneflow.float32)
>>> p = flow.placement("cuda", {0:range(2)})
>>> z = y.to_consistent([flow.sbp.broadcast], p)
>>> z.to_local()
tensor([[7., 9., 5.],
        [3., 6., 9.]], device='cuda:1', dtype=oneflow.float32)
>>> m = y.to_consistent([flow.sbp.split(0)], p)
>>> m.shape
flow.Size([4, 3])
>>> m.to_local()
tensor([[ 1., -4.,  5.],
        [ 2., -3.,  7.]], device='cuda:1', dtype=oneflow.float32)
>>> n = y.to_consistent([flow.sbp.partial_sum], p)
>>> n.to_local()
tensor([[0., 0., 0.],
        [0., 0., 0.]], device='cuda:1', dtype=oneflow.float32)
>>>

oneflow/api/python/framework/tensor.cpp

oneflow/core/framework/op_expr.h

oneflow/api/python/framework/tensor.cpp

oneflow/python/framework/distribute.py

oneflow/core/framework/op_interpreter/eager_consistent_op_interpreter.cpp

…support_tensor_to/to_local Conflicts: oneflow/core/framework/tensor.h

…support_tensor_to/to_local

…support_tensor_to/to_local Conflicts: oneflow/core/framework/op_expr.cpp oneflow/core/framework/tensor.h

oneflow/core/autograd/gradient_funcs/consistent_cast.cpp

oneflow/core/functional/impl/consistent_cast.cpp

oneflow/core/autograd/gradient_funcs/eager_nccl_broadcast.cpp

github-actions · 2021-08-02T11:20:59Z

Speed stats:

GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 139.3ms (= 6966.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.8ms (= 6340.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.10 (= 139.3ms / 126.8ms)

PyTorch resnet50 time: 85.4ms (= 4271.5ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.1ms (= 3707.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.15 (= 85.4ms / 74.1ms)

PyTorch resnet50 time: 57.8ms (= 2888.2ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.2ms (= 2362.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.22 (= 57.8ms / 47.2ms)

PyTorch resnet50 time: 47.7ms (= 2386.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 41.6ms (= 2078.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.15 (= 47.7ms / 41.6ms)

PyTorch resnet50 time: 43.5ms (= 2174.4ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 45.6ms (= 2280.0ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.95 (= 43.5ms / 45.6ms)

daquexian · 2021-08-02T16:04:33Z

oneflow/user/kernels/eager_nccl_kernels.cu

+    int64_t cur_parallel_id =
+        CHECK_JUST(parallel_desc->ParallelId4MachineDeviceId(cur_machine_id, cur_machine_id));
+    if (cur_parallel_id != root) {
+      Memset<DeviceType::kGPU>(ctx->device_ctx(), out->mut_dptr(), 0,


这里是什么原因，这个 op 名是 reduce，但 reduce 本身是不包含这个置零的操作的

这个逻辑要删除，之前陈旧的想法有点问题

github-actions · 2021-08-02T16:44:09Z

CI failed, removing label automerge

github-actions · 2021-08-03T01:23:12Z

Speed stats:

GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 146.5ms (= 7322.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.7ms (= 6337.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.16 (= 146.5ms / 126.7ms)

PyTorch resnet50 time: 84.0ms (= 4199.7ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.1ms (= 3704.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 84.0ms / 74.1ms)

PyTorch resnet50 time: 58.6ms (= 2931.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.7ms (= 2384.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.23 (= 58.6ms / 47.7ms)

PyTorch resnet50 time: 49.8ms (= 2489.8ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 42.6ms (= 2132.2ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.17 (= 49.8ms / 42.6ms)

PyTorch resnet50 time: 43.1ms (= 2154.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 43.5ms (= 2176.6ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.99 (= 43.1ms / 43.5ms)

clackhan added 5 commits June 22, 2021 14:39

support_tensor_to/to_local

00cda8d

export consistent_tensor.to_local()

3a38a74

refine code

3053004

export tensor.to()...

65ad1c5

refine code

76fcd6f

clackhan added enhancement system labels Jun 22, 2021

clackhan requested review from lixinqi, strint, daquexian, hjchen2 and oneflow-ci-bot June 22, 2021 09:22

oneflow-ci-bot removed their request for review June 22, 2021 10:39