Optimize ReduceSumFloatCudaKernel with GEMM #7684

liujuncheng · 2022-03-04T10:09:08Z

No description provided.

github-actions · 2022-03-07T06:06:58Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.3ms (= 12828.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.5ms (= 13848.1ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.08 (= 138.5ms / 128.3ms)

✔️ OneFlow resnet50 time: 77.7ms (= 7770.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.9ms (= 8390.6ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 83.9ms / 77.7ms)

OneFlow resnet50 time: 49.1ms (= 9828.7ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.6ms (= 11524.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 57.6ms / 49.1ms)

OneFlow resnet50 time: 40.3ms (= 8055.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 46.7ms (= 9340.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.16 (= 46.7ms / 40.3ms)

OneFlow resnet50 time: 39.5ms (= 7892.6ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.0ms (= 7794.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.99 (= 39.0ms / 39.5ms)

✔️ OneFlow resnet50 time: 142.1ms (= 14214.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.8ms (= 16379.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 163.8ms / 142.1ms)

OneFlow resnet50 time: 89.1ms (= 8914.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 106.8ms (= 10677.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 106.8ms / 89.1ms)

OneFlow resnet50 time: 59.3ms (= 11851.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.7ms (= 15138.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 75.7ms / 59.3ms)

OneFlow resnet50 time: 53.3ms (= 10666.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.3ms (= 13860.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 69.3ms / 53.3ms)

OneFlow resnet50 time: 47.1ms (= 9420.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.6ms (= 12916.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 64.6ms / 47.1ms)

* Optimize ReduceSumFloatCudaKernel with GEMM * Fix WITH_CUDA

Optimize ReduceSumFloatCudaKernel with GEMM

f4bdaa1

liujuncheng added op enhancement labels Mar 4, 2022

liujuncheng requested review from guo-ran and MARD1NO March 4, 2022 10:09

Fix WITH_CUDA

5ffd20b

guo-ran approved these changes Mar 4, 2022

View reviewed changes

MARD1NO approved these changes Mar 4, 2022

View reviewed changes

liujuncheng added 2 commits March 7, 2022 10:21

Merge branch 'master' into dev_gemm_reduce_sum

bd23446

Merge branch 'master' into dev_gemm_reduce_sum

8281f34

liujuncheng requested a review from oneflow-ci-bot March 7, 2022 04:05

liujuncheng merged commit 687dcdd into master Mar 7, 2022

liujuncheng deleted the dev_gemm_reduce_sum branch March 7, 2022 07:15

marigoold pushed a commit that referenced this pull request Mar 15, 2022

Optimize ReduceSumFloatCudaKernel with GEMM (#7684)

4d491f6

* Optimize ReduceSumFloatCudaKernel with GEMM * Fix WITH_CUDA

wyg1997 pushed a commit that referenced this pull request Mar 17, 2022

Optimize ReduceSumFloatCudaKernel with GEMM (#7684)

6645f2a

* Optimize ReduceSumFloatCudaKernel with GEMM * Fix WITH_CUDA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ReduceSumFloatCudaKernel with GEMM #7684

Optimize ReduceSumFloatCudaKernel with GEMM #7684

liujuncheng commented Mar 4, 2022

github-actions bot commented Mar 7, 2022

Optimize ReduceSumFloatCudaKernel with GEMM #7684

Optimize ReduceSumFloatCudaKernel with GEMM #7684

Conversation

liujuncheng commented Mar 4, 2022

github-actions bot commented Mar 7, 2022