avoid data transfer #25810

zhangting2020 · 2020-07-29T12:34:40Z

PR types

Performance optimization

PR changes

OPs

Describe

avoid data transfer

range的python接口中start、end、和step可以是（float32 | float64 | int32 | int64 | Variable），但是c++接口只接受Tensor，并且要求Tensor在CPU上。

case 1

如果start、end、和step都是常数，GPU模式下的行为：

python端构造3个tensor，但未设置在CPU上，因此GPU模式下这3个tensor都在GPU上
compute阶段需要将3个tensor拷贝到CPU
测试代码

import paddle.fluid as fluid
from paddle.fluid import profiler

times = 1000
place = fluid.CUDAPlace(0)
res = fluid.layers.range(0, 10, 2, "int32")
exe = fluid.Executor(place)

exe.run(fluid.default_startup_program())
profiler.start_profiler("All", "OpDetail")
for i in range(times):
    out = exe.run(fluid.default_main_program(),
                  fetch_list=[res])
profiler.stop_profiler("total", "./profile/test")

以下profiling report中，range/compute 阶段发生了数据拷贝

-------------------------     Overhead Summary      -------------------------

Total time: 300.892
  Computation time       Total: 209.609     Ratio: 69.6625%
  Framework overhead     Total: 91.283      Ratio: 30.3375%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 4000        Total: 115.244     Ratio: 38.3008%
  GpuMemcpyAsync         Calls: 3000        Total: 81.1267     Ratio: 26.9621%
  GpuMemcpySync          Calls: 1000        Total: 34.1173     Ratio: 11.3387%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::range                             1000        130.601     123.384089 (0.944742)   7.216688 (0.055258)     0.120114    1.19753     0.130601    0.434046
  thread0::range/compute                   1000        117.693     110.476077 (0.938682)   7.216688 (0.061318)     0.108894    1.1703      0.117693    0.391147
    GpuMemcpyAsync:GPU->CPU                3000        81.1267     75.253854 (0.927609)    5.872823 (0.072391)     0.020564    0.744466    0.0270422   0.269621
  thread0::range/infer_shape               1000        2.89403     2.894027 (1.000000)     0.000000 (0.000000)     0.00236     0.019389    0.00289403  0.00961817
  thread0::range/prepare_data              1000        2.11671     2.116714 (1.000000)     0.000000 (0.000000)     0.001657    0.019414    0.00211671  0.00703481
thread0::fill_constant                     3000        127.082     123.004245 (0.967915)   4.077426 (0.032085)     0.024857    3.61854     0.0423606   0.42235
  thread0::fill_constant/compute           3000        91.9158     87.838392 (0.955640)    4.077426 (0.044360)     0.017812    3.545       0.0306386   0.305478
  thread0::fill_constant/infer_shape       3000        7.75713     7.757128 (1.000000)     0.000000 (0.000000)     0.001481    0.030565    0.00258571  0.0257805
  thread0::fill_constant/prepare_data      3000        6.41637     6.416367 (1.000000)     0.000000 (0.000000)     0.001127    0.017772    0.00213879  0.0213245
thread0::fetch                             1000        43.2092     41.269054 (0.955099)    1.940120 (0.044901)     0.03906     0.335819    0.0432092   0.143604
  GpuMemcpySync:GPU->CPU                   1000        34.1173     32.177135 (0.943134)    1.940120 (0.056866)     0.031359    0.054941    0.0341173   0.113387

case 2

如果start、end、和step是常数和tensor的混合，例如start、end是常数，而step是一个tensor并且在CPU上，GPU模式下的行为：

python端构造start、end 2个tensor，它们将在GPU上；但是step是在CPU上
prepare data阶段，step会被拷贝到GPU上
compute阶段，start、end 、step 3个tensor会被拷贝到CPU
测试代码

import paddle.fluid as fluid
from paddle.fluid import profiler

times = 1000
place = fluid.CUDAPlace(0)
with fluid.device_guard("cpu"):
    step = fluid.layers.fill_constant(shape=[1], dtype="int32", value="2")
res = fluid.layers.range(0, 10, step, "int32")
exe = fluid.Executor(place)

exe.run(fluid.default_startup_program())
profiler.start_profiler("All", "OpDetail")
for i in range(times):
    out = exe.run(fluid.default_main_program(),
                  fetch_list=[res])
profiler.stop_profiler("total", "./profile/test")

range/prepare_data 和range/compute阶段都发生了数据拷贝

-------------------------     Overhead Summary      -------------------------

Total time: 331.715
  Computation time       Total: 188.172     Ratio: 56.727%
  Framework overhead     Total: 143.543     Ratio: 43.273%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 5000        Total: 138.619     Ratio: 41.7886%
  GpuMemcpyAsync         Calls: 3000        Total: 71.2764     Ratio: 21.4873%
  GpuMemcpySync          Calls: 2000        Total: 67.3424     Ratio: 20.3013%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::range                             1000        173.659     164.589584 (0.947772)   9.069838 (0.052228)     0.161416    0.931827    0.173659    0.52352
  thread0::range/compute                   1000        106.423     99.196153 (0.932095)    7.226681 (0.067905)     0.098196    0.177043    0.106423    0.320826
    GpuMemcpyAsync:GPU->CPU                3000        71.2764     65.405500 (0.917631)    5.870941 (0.082369)     0.01986     0.053954    0.0237588   0.214873
  thread0::range/prepare_data              1000        55.2602     53.417038 (0.966646)    1.843157 (0.033354)     0.05008     0.746606    0.0552602   0.16659
    GpuMemcpySync:CPU->GPU                 1000        35.1423     33.299096 (0.947552)    1.843157 (0.052448)     0.03157     0.656173    0.0351423   0.105941
  thread0::range/infer_shape               1000        3.52006     3.520062 (1.000000)     0.000000 (0.000000)     0.002808    0.024192    0.00352006  0.0106117
thread0::fill_constant                     3000        117.459     114.746186 (0.976905)   2.712756 (0.023095)     0.02662     3.45638     0.039153    0.354096
  thread0::fill_constant/compute           3000        81.7491     79.036379 (0.966816)    2.712756 (0.033184)     0.014608    3.4386      0.0272497   0.246444
  thread0::fill_constant/infer_shape       3000        7.49599     7.495994 (1.000000)     0.000000 (0.000000)     0.00146     0.026778    0.00249866  0.0225977
  thread0::fill_constant/prepare_data      3000        6.59352     6.593523 (1.000000)     0.000000 (0.000000)     0.00106     0.019621    0.00219784  0.0198771
thread0::fetch                             1000        40.5964     38.653231 (0.952135)    1.943129 (0.047865)     0.036823    0.066273    0.0405964   0.122383
  GpuMemcpySync:GPU->CPU                   1000        32.2001     30.257006 (0.939655)    1.943129 (0.060345)     0.029047    0.055709    0.0322001   0.0970718

PR效果

PR改动点如下：

在python 端使用device_guard将start、end、和step都设置在CPU上
重写GetKernelTypeForVar，在prepare data阶段避免进行data transform
case 1的结果

-------------------------     Overhead Summary      -------------------------

Total time: 172.298
  Computation time       Total: 74.1934     Ratio: 43.061%
  Framework overhead     Total: 98.105      Ratio: 56.939%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 1000        Total: 43.8563     Ratio: 25.4537%
  GpuMemcpySync          Calls: 1000        Total: 43.8563     Ratio: 25.4537%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fill_constant                     3000        60.86       60.860046 (1.000000)    0.000000 (0.000000)     0.010521    0.451017    0.0202867   0.353225
  thread0::fill_constant/compute           3000        27.5878     27.587828 (1.000000)    0.000000 (0.000000)     0.004371    0.357691    0.00919594  0.160117
  thread0::fill_constant/infer_shape       3000        7.46453     7.464531 (1.000000)     0.000000 (0.000000)     0.001366    0.020888    0.00248818  0.0433233
  thread0::fill_constant/prepare_data      3000        6.23739     6.237388 (1.000000)     0.000000 (0.000000)     0.001089    0.018195    0.00207913  0.0362011
thread0::range                             1000        59.3519     57.977650 (0.976845)    1.374296 (0.023155)     0.051529    3.33877     0.0593519   0.344472
  thread0::range/compute                   1000        46.6056     45.231303 (0.970512)    1.374296 (0.029488)     0.03995     3.31218     0.0466056   0.270494
  thread0::range/infer_shape               1000        2.82845     2.828454 (1.000000)     0.000000 (0.000000)     0.002314    0.018182    0.00282845  0.016416
  thread0::range/prepare_data              1000        2.11326     2.113256 (1.000000)     0.000000 (0.000000)     0.001704    0.021893    0.00211326  0.0122651
thread0::fetch                             1000        52.0864     50.119871 (0.962245)    1.966519 (0.037755)     0.047372    0.901293    0.0520864   0.302303
  GpuMemcpySync:GPU->CPU                   1000        43.8563     41.889790 (0.955160)    1.966519 (0.044840)     0.039766    0.866008    0.0438563   0.254537

case 2的结果

-------------------------     Overhead Summary      -------------------------

Total time: 168.058
  Computation time       Total: 72.2687     Ratio: 43.0021%
  Framework overhead     Total: 95.7898     Ratio: 56.9979%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 1000        Total: 42.8717     Ratio: 25.51%
  GpuMemcpySync          Calls: 1000        Total: 42.8717     Ratio: 25.51%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fill_constant                     3000        59.4739     59.473936 (1.000000)    0.000000 (0.000000)     0.010374    0.529456    0.0198246   0.353888
  thread0::fill_constant/compute           3000        27.0991     27.099101 (1.000000)    0.000000 (0.000000)     0.004413    0.435549    0.00903303  0.161248
  thread0::fill_constant/infer_shape       3000        7.0175      7.017504 (1.000000)     0.000000 (0.000000)     0.001326    0.021101    0.00233917  0.0417563
  thread0::fill_constant/prepare_data      3000        6.21491     6.214908 (1.000000)     0.000000 (0.000000)     0.000966    0.022756    0.00207164  0.0369806
thread0::range                             1000        57.6061     56.197842 (0.975554)    1.408245 (0.024446)     0.049659    3.34316     0.0576061   0.342774
  thread0::range/compute                   1000        45.1696     43.761325 (0.968823)    1.408245 (0.031177)     0.038493    3.31839     0.0451696   0.268773
  thread0::range/infer_shape               1000        2.67173     2.671733 (1.000000)     0.000000 (0.000000)     0.002283    0.018722    0.00267173  0.0158976
  thread0::range/prepare_data              1000        2.05729     2.057289 (1.000000)     0.000000 (0.000000)     0.001587    0.022467    0.00205729  0.0122415
thread0::fetch                             1000        50.9784     49.009426 (0.961376)    1.968981 (0.038624)     0.047114    0.800315    0.0509784   0.303337
  GpuMemcpySync:GPU->CPU                   1000        42.8717     40.902721 (0.954073)    1.968981 (0.045927)     0.039616    0.764595    0.0428717   0.2551

CLAassistant · 2020-07-29T12:34:45Z

All committers have signed the CLA.

paddle-bot-old · 2020-07-29T12:34:47Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot-old · 2020-07-29T12:34:49Z

✅ This PR's description meets the template requirements!
Please wait for other CI results.

zhangting2020 force-pushed the range_op branch 2 times, most recently from 1f681df to 82a7947 Compare July 29, 2020 17:02

avoid data transfer, test=develop

c440723

zhangting2020 force-pushed the range_op branch from 82a7947 to c440723 Compare July 31, 2020 06:33

luotao1 approved these changes Jul 31, 2020

View reviewed changes

zhangting2020 merged commit 2d24f56 into PaddlePaddle:develop Jul 31, 2020

zhangting2020 deleted the range_op branch July 31, 2020 12:20

wangchaochaohu mentioned this pull request Sep 21, 2020

avoid data transform for linspace OP #27444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid data transfer #25810

avoid data transfer #25810

zhangting2020 commented Jul 29, 2020 •

edited

CLAassistant commented Jul 29, 2020 •

edited

paddle-bot-old bot commented Jul 29, 2020

paddle-bot-old bot commented Jul 29, 2020 •

edited

avoid data transfer #25810

avoid data transfer #25810

Conversation

zhangting2020 commented Jul 29, 2020 • edited

PR types

PR changes

Describe

case 1

case 2

PR效果

CLAassistant commented Jul 29, 2020 • edited

paddle-bot-old bot commented Jul 29, 2020

paddle-bot-old bot commented Jul 29, 2020 • edited

zhangting2020 commented Jul 29, 2020 •

edited

CLAassistant commented Jul 29, 2020 •

edited

paddle-bot-old bot commented Jul 29, 2020 •

edited