Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid data transfer #25810

Merged
merged 1 commit into from
Jul 31, 2020
Merged

Conversation

zhangting2020
Copy link
Contributor

@zhangting2020 zhangting2020 commented Jul 29, 2020

PR types

Performance optimization

PR changes

OPs

Describe

avoid data transfer

range的python接口中start、end、和step可以是(float32 | float64 | int32 | int64 | Variable),但是c++接口只接受Tensor,并且要求Tensor在CPU上。

case 1

如果start、end、和step都是常数,GPU模式下的行为:

  • python端构造3个tensor,但未设置在CPU上,因此GPU模式下这3个tensor都在GPU上
  • compute阶段需要将3个tensor拷贝到CPU
  • 测试代码
import paddle.fluid as fluid
from paddle.fluid import profiler

times = 1000
place = fluid.CUDAPlace(0)
res = fluid.layers.range(0, 10, 2, "int32")
exe = fluid.Executor(place)

exe.run(fluid.default_startup_program())
profiler.start_profiler("All", "OpDetail")
for i in range(times):
    out = exe.run(fluid.default_main_program(),
                  fetch_list=[res])
profiler.stop_profiler("total", "./profile/test")
  • 以下profiling report中,range/compute 阶段发生了数据拷贝
-------------------------     Overhead Summary      -------------------------

Total time: 300.892
  Computation time       Total: 209.609     Ratio: 69.6625%
  Framework overhead     Total: 91.283      Ratio: 30.3375%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 4000        Total: 115.244     Ratio: 38.3008%
  GpuMemcpyAsync         Calls: 3000        Total: 81.1267     Ratio: 26.9621%
  GpuMemcpySync          Calls: 1000        Total: 34.1173     Ratio: 11.3387%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::range                             1000        130.601     123.384089 (0.944742)   7.216688 (0.055258)     0.120114    1.19753     0.130601    0.434046
  thread0::range/compute                   1000        117.693     110.476077 (0.938682)   7.216688 (0.061318)     0.108894    1.1703      0.117693    0.391147
    GpuMemcpyAsync:GPU->CPU                3000        81.1267     75.253854 (0.927609)    5.872823 (0.072391)     0.020564    0.744466    0.0270422   0.269621
  thread0::range/infer_shape               1000        2.89403     2.894027 (1.000000)     0.000000 (0.000000)     0.00236     0.019389    0.00289403  0.00961817
  thread0::range/prepare_data              1000        2.11671     2.116714 (1.000000)     0.000000 (0.000000)     0.001657    0.019414    0.00211671  0.00703481
thread0::fill_constant                     3000        127.082     123.004245 (0.967915)   4.077426 (0.032085)     0.024857    3.61854     0.0423606   0.42235
  thread0::fill_constant/compute           3000        91.9158     87.838392 (0.955640)    4.077426 (0.044360)     0.017812    3.545       0.0306386   0.305478
  thread0::fill_constant/infer_shape       3000        7.75713     7.757128 (1.000000)     0.000000 (0.000000)     0.001481    0.030565    0.00258571  0.0257805
  thread0::fill_constant/prepare_data      3000        6.41637     6.416367 (1.000000)     0.000000 (0.000000)     0.001127    0.017772    0.00213879  0.0213245
thread0::fetch                             1000        43.2092     41.269054 (0.955099)    1.940120 (0.044901)     0.03906     0.335819    0.0432092   0.143604
  GpuMemcpySync:GPU->CPU                   1000        34.1173     32.177135 (0.943134)    1.940120 (0.056866)     0.031359    0.054941    0.0341173   0.113387

case 2

如果start、end、和step是常数和tensor的混合,例如start、end是常数,而step是一个tensor并且在CPU上,GPU模式下的行为:

  • python端构造start、end 2个tensor,它们将在GPU上;但是step是在CPU上
  • prepare data阶段,step会被拷贝到GPU上
  • compute阶段,start、end 、step 3个tensor会被拷贝到CPU
  • 测试代码
import paddle.fluid as fluid
from paddle.fluid import profiler

times = 1000
place = fluid.CUDAPlace(0)
with fluid.device_guard("cpu"):
    step = fluid.layers.fill_constant(shape=[1], dtype="int32", value="2")
res = fluid.layers.range(0, 10, step, "int32")
exe = fluid.Executor(place)

exe.run(fluid.default_startup_program())
profiler.start_profiler("All", "OpDetail")
for i in range(times):
    out = exe.run(fluid.default_main_program(),
                  fetch_list=[res])
profiler.stop_profiler("total", "./profile/test")
  • range/prepare_data 和range/compute阶段都发生了数据拷贝
-------------------------     Overhead Summary      -------------------------

Total time: 331.715
  Computation time       Total: 188.172     Ratio: 56.727%
  Framework overhead     Total: 143.543     Ratio: 43.273%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 5000        Total: 138.619     Ratio: 41.7886%
  GpuMemcpyAsync         Calls: 3000        Total: 71.2764     Ratio: 21.4873%
  GpuMemcpySync          Calls: 2000        Total: 67.3424     Ratio: 20.3013%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::range                             1000        173.659     164.589584 (0.947772)   9.069838 (0.052228)     0.161416    0.931827    0.173659    0.52352
  thread0::range/compute                   1000        106.423     99.196153 (0.932095)    7.226681 (0.067905)     0.098196    0.177043    0.106423    0.320826
    GpuMemcpyAsync:GPU->CPU                3000        71.2764     65.405500 (0.917631)    5.870941 (0.082369)     0.01986     0.053954    0.0237588   0.214873
  thread0::range/prepare_data              1000        55.2602     53.417038 (0.966646)    1.843157 (0.033354)     0.05008     0.746606    0.0552602   0.16659
    GpuMemcpySync:CPU->GPU                 1000        35.1423     33.299096 (0.947552)    1.843157 (0.052448)     0.03157     0.656173    0.0351423   0.105941
  thread0::range/infer_shape               1000        3.52006     3.520062 (1.000000)     0.000000 (0.000000)     0.002808    0.024192    0.00352006  0.0106117
thread0::fill_constant                     3000        117.459     114.746186 (0.976905)   2.712756 (0.023095)     0.02662     3.45638     0.039153    0.354096
  thread0::fill_constant/compute           3000        81.7491     79.036379 (0.966816)    2.712756 (0.033184)     0.014608    3.4386      0.0272497   0.246444
  thread0::fill_constant/infer_shape       3000        7.49599     7.495994 (1.000000)     0.000000 (0.000000)     0.00146     0.026778    0.00249866  0.0225977
  thread0::fill_constant/prepare_data      3000        6.59352     6.593523 (1.000000)     0.000000 (0.000000)     0.00106     0.019621    0.00219784  0.0198771
thread0::fetch                             1000        40.5964     38.653231 (0.952135)    1.943129 (0.047865)     0.036823    0.066273    0.0405964   0.122383
  GpuMemcpySync:GPU->CPU                   1000        32.2001     30.257006 (0.939655)    1.943129 (0.060345)     0.029047    0.055709    0.0322001   0.0970718

PR效果

PR改动点如下:

  • 在python 端使用device_guard将start、end、和step都设置在CPU上
  • 重写GetKernelTypeForVar,在prepare data阶段避免进行data transform
  • case 1的结果
-------------------------     Overhead Summary      -------------------------

Total time: 172.298
  Computation time       Total: 74.1934     Ratio: 43.061%
  Framework overhead     Total: 98.105      Ratio: 56.939%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 1000        Total: 43.8563     Ratio: 25.4537%
  GpuMemcpySync          Calls: 1000        Total: 43.8563     Ratio: 25.4537%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fill_constant                     3000        60.86       60.860046 (1.000000)    0.000000 (0.000000)     0.010521    0.451017    0.0202867   0.353225
  thread0::fill_constant/compute           3000        27.5878     27.587828 (1.000000)    0.000000 (0.000000)     0.004371    0.357691    0.00919594  0.160117
  thread0::fill_constant/infer_shape       3000        7.46453     7.464531 (1.000000)     0.000000 (0.000000)     0.001366    0.020888    0.00248818  0.0433233
  thread0::fill_constant/prepare_data      3000        6.23739     6.237388 (1.000000)     0.000000 (0.000000)     0.001089    0.018195    0.00207913  0.0362011
thread0::range                             1000        59.3519     57.977650 (0.976845)    1.374296 (0.023155)     0.051529    3.33877     0.0593519   0.344472
  thread0::range/compute                   1000        46.6056     45.231303 (0.970512)    1.374296 (0.029488)     0.03995     3.31218     0.0466056   0.270494
  thread0::range/infer_shape               1000        2.82845     2.828454 (1.000000)     0.000000 (0.000000)     0.002314    0.018182    0.00282845  0.016416
  thread0::range/prepare_data              1000        2.11326     2.113256 (1.000000)     0.000000 (0.000000)     0.001704    0.021893    0.00211326  0.0122651
thread0::fetch                             1000        52.0864     50.119871 (0.962245)    1.966519 (0.037755)     0.047372    0.901293    0.0520864   0.302303
  GpuMemcpySync:GPU->CPU                   1000        43.8563     41.889790 (0.955160)    1.966519 (0.044840)     0.039766    0.866008    0.0438563   0.254537
  • case 2的结果
-------------------------     Overhead Summary      -------------------------

Total time: 168.058
  Computation time       Total: 72.2687     Ratio: 43.0021%
  Framework overhead     Total: 95.7898     Ratio: 56.9979%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 1000        Total: 42.8717     Ratio: 25.51%
  GpuMemcpySync          Calls: 1000        Total: 42.8717     Ratio: 25.51%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fill_constant                     3000        59.4739     59.473936 (1.000000)    0.000000 (0.000000)     0.010374    0.529456    0.0198246   0.353888
  thread0::fill_constant/compute           3000        27.0991     27.099101 (1.000000)    0.000000 (0.000000)     0.004413    0.435549    0.00903303  0.161248
  thread0::fill_constant/infer_shape       3000        7.0175      7.017504 (1.000000)     0.000000 (0.000000)     0.001326    0.021101    0.00233917  0.0417563
  thread0::fill_constant/prepare_data      3000        6.21491     6.214908 (1.000000)     0.000000 (0.000000)     0.000966    0.022756    0.00207164  0.0369806
thread0::range                             1000        57.6061     56.197842 (0.975554)    1.408245 (0.024446)     0.049659    3.34316     0.0576061   0.342774
  thread0::range/compute                   1000        45.1696     43.761325 (0.968823)    1.408245 (0.031177)     0.038493    3.31839     0.0451696   0.268773
  thread0::range/infer_shape               1000        2.67173     2.671733 (1.000000)     0.000000 (0.000000)     0.002283    0.018722    0.00267173  0.0158976
  thread0::range/prepare_data              1000        2.05729     2.057289 (1.000000)     0.000000 (0.000000)     0.001587    0.022467    0.00205729  0.0122415
thread0::fetch                             1000        50.9784     49.009426 (0.961376)    1.968981 (0.038624)     0.047114    0.800315    0.0509784   0.303337
  GpuMemcpySync:GPU->CPU                   1000        42.8717     40.902721 (0.954073)    1.968981 (0.045927)     0.039616    0.764595    0.0428717   0.2551

@CLAassistant
Copy link

CLAassistant commented Jul 29, 2020

CLA assistant check
All committers have signed the CLA.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jul 29, 2020

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@zhangting2020 zhangting2020 force-pushed the range_op branch 2 times, most recently from 1f681df to 82a7947 Compare July 29, 2020 17:02
@zhangting2020 zhangting2020 merged commit 2d24f56 into PaddlePaddle:develop Jul 31, 2020
@zhangting2020 zhangting2020 deleted the range_op branch July 31, 2020 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants