fix the bug of profile update #22207

wangchaochaohu · 2020-01-10T06:57:38Z

Fix the output not correct in the OP level print. (using "_op" name to identify the level is wrong in some condition)

------------------------>     Profiling Report     <-------------------------

Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by  in descending order in the same thread

Event                                                                            Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
recurrent_grad                                                                   10          436.392     402.093838 (0.921405)   34.298122 (0.078595)    43.1854     44.113      43.6392     0.56828
  recurrent_grad/rnn_memory_helper_grad                                          1800        42.7983     40.060860 (0.936039)    2.737399 (0.063961)     0.017887    0.10409     0.0237768   0.0557329
    recurrent_grad/rnn_memory_helper_grad/fill_constant                          40          1.32887     1.272839 (0.957835)     0.056032 (0.042165)     0.022293    0.067966    0.0332218   0.00173049
    recurrent_grad/rnn_memory_helper_grad/GpuMemcpyAsync(same_gpu):GPU->GPU      1760        30.503      27.821594 (0.912095)    2.681367 (0.087905)     0.01313     0.043434    0.0173312   0.0397217
  recurrent_grad/sum                                                             2000        91.2966     82.077039 (0.899015)    9.219606 (0.100985)     0.023783    0.085409    0.0456483   0.118889
  recurrent_grad/elementwise_mul_grad                                            1200        34.1634     31.600653 (0.924986)    2.562742 (0.075014)     0.023277    0.055969    0.0284695   0.0444884
  recurrent_grad/sigmoid_grad                                                    1200        30.5571     28.620562 (0.936627)    1.936503 (0.063373)     0.020975    0.051665    0.0254642   0.0397922
  recurrent_grad/tanh_grad                                                       800         20.2586     18.982027 (0.936986)    1.276572 (0.063014)     0.02176     0.047516    0.0253232   0.0263812
  recurrent_grad/elementwise_add_grad                                            800         23.8116     22.299915 (0.936517)    1.511640 (0.063483)     0.023671    0.050221    0.0297644   0.031008
  recurrent_grad/slice_grad                                                      1600        46.3089     42.952063 (0.927511)    3.356878 (0.072489)     0.0237      0.072194    0.0289431   0.0603046
  recurrent_grad/matmul_grad                                                     400         38.7708     28.053024 (0.723561)    10.717781 (0.276439)    0.087226    0.14261     0.096927    0.0504883
  recurrent_grad/concat_grad                                                     400         14.007      13.164969 (0.939886)    0.842009 (0.060114)     0.030201    0.05642     0.0350174   0.0182402
  recurrent_grad/fill_constant                                                   40          1.12651     1.060274 (0.941199)     0.066240 (0.058801)     0.022679    0.042416    0.0281628   0.00146697
  recurrent_grad/GpuMemcpyAsync(same_gpu):GPU->GPU                               50          0.877214    0.806462 (0.919345)     0.070752 (0.080655)     0.013038    0.029127    0.0175443   0.00114233
recurrent                                                                        10          262.144     244.144337 (0.931336)   17.999832 (0.068664)    26.023      26.4119     26.2144     0.34137
  recurrent/concat                                                               400         13.4365     12.603564 (0.938010)    0.832922 (0.061990)     0.026359    0.090599    0.0335912   0.0174973
  recurrent/matmul                                                               400         23.5848     18.438992 (0.781817)    5.145800 (0.218183)     0.051514    0.138007    0.058962    0.0307127
  recurrent/elementwise_add                                                      800         25.4895     24.109520 (0.945859)    1.380024 (0.054141)     0.020585    0.097694    0.0318619   0.0331931
  recurrent/slice                                                                1600        47.4693     44.473107 (0.936882)    2.996178 (0.063118)     0.023984    0.168199    0.0296683   0.0618157
  recurrent/sigmoid                                                              1200        29.6012     27.497583 (0.928936)    2.103578 (0.071064)     0.020077    0.079791    0.0246676   0.0385474
  recurrent/elementwise_mul                                                      1200        29.0849     27.336452 (0.939884)    1.748472 (0.060116)     0.020059    0.044415    0.0242374   0.0378751
  recurrent/tanh                                                                 800         18.6475     17.366281 (0.931291)    1.281246 (0.068709)     0.01927     0.088796    0.0233094   0.0242833
  recurrent/rnn_memory_helper                                                    1800        38.8382     36.394516 (0.937080)    2.443708 (0.062920)     0.01686     0.053759    0.0215768   0.0505761
    recurrent/rnn_memory_helper/GpuMemcpyAsync(same_gpu):GPU->GPU                1800        29.9421     27.498431 (0.918386)    2.443708 (0.081614)     0.012972    0.04417     0.0166345   0.0389914
  recurrent/GpuMemcpyAsync(same_gpu):GPU->GPU                                    50          0.761822    0.693918 (0.910866)     0.067904 (0.089134)     0.013035    0.024854    0.0152364   0.000992063
matmul_grad                                                                      10          13.4956     0.743462 (0.055089)     12.752136 (0.944911)    1.33943     1.35865     1.34956     0.0175743
matmul                                                                           10          8.66756     0.456752 (0.052697)     8.210811 (0.947303)     0.860508    0.873576    0.866756    0.0112871
reduce_sum                                                                       80          4.71451     4.200428 (0.890958)     0.514078 (0.109042)     0.032381    0.11502     0.0589313   0.00613934
lookup_table                                                                     10          4.1311      4.029816 (0.975484)     0.101280 (0.024516)     0.060565    3.46527     0.41311     0.00537961
slice                                                                            80          2.91346     2.734096 (0.938437)     0.179360 (0.061563)     0.027227    0.201424    0.0364182   0.00379397
Fetch                                                                            40          2.85583     2.732436 (0.956793)     0.123392 (0.043207)     0.030818    0.146202    0.0713957   0.00371893
  Fetch/GpuMemcpyAsync:GPU->CPU                                                  40          1.39677     1.273379 (0.911659)     0.123392 (0.088341)     0.022792    0.051174    0.0349193   0.00181891
elementwise_mul                                                                  70          2.78601     2.184320 (0.784031)     0.601692 (0.215969)     0.029019    0.058539    0.0398002   0.00362801
transpose2                                                                       60          2.7026      2.549483 (0.943344)     0.153120 (0.056656)     0.024803    0.258202    0.0450434   0.0035194
softmax_with_cross_entropy                                                       10          2.53699     0.684673 (0.269877)     1.852312 (0.730123)     0.249168    0.265694    0.253698    0.00330372
square                                                                           70          2.30585     1.769183 (0.767258)     0.536669 (0.232742)     0.021543    0.067092    0.0329407   0.00300274
reshape2                                                                         100         1.96014     1.960142 (1.000000)     0.000000 (0.000000)     0.012659    0.260715    0.0196014   0.00255254
BufferedReader:MemoryCopy                                                        10          1.93579     1.891955 (0.977353)     0.043840 (0.022647)     0.144695    0.334621    0.193579    0.00252084
  BufferedReader:MemoryCopy/GpuMemcpyAsync:CUDAPinned->GPU                       20          1.20644     1.162603 (0.963662)     0.043840 (0.036338)     0.01371     0.219806    0.0603222   0.00157106
eager_deletion                                                                   590         1.59629     1.596294 (1.000000)     0.000000 (0.000000)     0.000834    0.03695     0.00270558  0.00207873
GpuMemcpyAsync:CPU->GPU                                                          30          1.55189     1.426967 (0.919500)     0.124928 (0.080500)     0.009081    0.931243    0.0517298   0.00202092
slice_grad                                                                       40          1.30013     1.193727 (0.918163)     0.106399 (0.081837)     0.025923    0.055385    0.0325032   0.00169306
reshape2_grad                                                                    100         1.24655     1.246548 (1.000000)     0.000000 (0.000000)     0.009139    0.027106    0.0124655   0.00162329
fill_zeros_like                                                                  40          1.12792     1.073166 (0.951458)     0.054751 (0.048542)     0.020381    0.049453    0.0281979   0.0014688
sum                                                                              30          1.07655     0.990853 (0.920397)     0.085696 (0.079603)     0.022263    0.060042    0.035885    0.00140191
sgd                                                                              10          1.02794     0.336584 (0.327436)     0.691355 (0.672564)     0.100568    0.108045    0.102794    0.00133861
concat                                                                           20          1.0055      0.951136 (0.945930)     0.054368 (0.054070)     0.044418    0.061511    0.0502752   0.00130939
elementwise_add_grad                                                             10          0.934496    0.383394 (0.410268)     0.551102 (0.589732)     0.087899    0.113423    0.0934496   0.00121692
softmax_with_cross_entropy_grad                                                  10          0.90735     0.446359 (0.491937)     0.460991 (0.508063)     0.087241    0.094091    0.090735    0.00118157
transpose2_grad                                                                  20          0.818608    0.757360 (0.925180)     0.061248 (0.074820)     0.030318    0.062308    0.0409304   0.00106601
elementwise_add                                                                  10          0.79809     0.388715 (0.487057)     0.409375 (0.512943)     0.078665    0.081977    0.079809    0.00103929
lookup_table_grad                                                                10          0.750815    0.541855 (0.721689)     0.208960 (0.278311)     0.071066    0.082765    0.0750815   0.00097773
FastThreadedSSAGraphExecutorPrepare                                              10          0.581394    0.581394 (1.000000)     0.000000 (0.000000)     0.050093    0.074508    0.0581394   0.000757105
reduce_mean                                                                      10          0.513593    0.477145 (0.929033)     0.036448 (0.070967)     0.047377    0.062268    0.0513593   0.000668813
fill_constant                                                                    10          0.491622    0.478470 (0.973248)     0.013152 (0.026752)     0.044415    0.067598    0.0491622   0.000640202
elementwise_max                                                                  10          0.474872    0.457688 (0.963813)     0.017184 (0.036187)     0.044684    0.060225    0.0474872   0.00061839
reduce_sum_grad                                                                  10          0.418248    0.397832 (0.951187)     0.020416 (0.048813)     0.03994     0.047595    0.0418248   0.000544653
reduce_mean_grad                                                                 10          0.403936    0.369216 (0.914046)     0.034720 (0.085954)     0.038054    0.042925    0.0403936   0.000526015
Scale LossGrad                                                                   10          0.348069    0.329541 (0.946769)     0.018528 (0.053231)     0.031085    0.04493     0.0348069   0.000453264
  Scale LossGrad/GpuMemcpyAsync:CPU->GPU                                         10          0.23927     0.220742 (0.922564)     0.018528 (0.077436)     0.021873    0.030115    0.023927    0.000311583
elementwise_div                                                                  10          0.347669    0.331477 (0.953427)     0.016192 (0.046573)     0.029805    0.044416    0.0347669   0.000452743
sqrt                                                                             10          0.280445    0.263838 (0.940783)     0.016607 (0.059217)     0.026611    0.031311    0.0280445   0.000365202
read                                                                             10          0.236643    0.236643 (1.000000)     0.000000 (0.000000)     0.020287    0.033824    0.0236643   0.000308162
  read/read                                                                      10          0.184806    0.184806 (1.000000)     0.000000 (0.000000)     0.016098    0.02902     0.0184806   0.000240659
create_double_buffer_reader                                                      10          0.073456    0.073456 (1.000000)     0.000000 (0.000000)     0.004812    0.012155    0.0073456   9.56562e-05
ScopeBufferedMonitor::post_local_exec_scopes_process                             10          0.045795    0.045795 (1.000000)     0.000000 (0.000000)     0.003517    0.007116    0.0045795   5.96354e-05
ScopeBufferedMonitor::pre_local_exec_scopes_process                              10          0.017431    0.017431 (1.000000)     0.000000 (0.000000)     0.00142     0.002097    0.0017431   2.26991e-05

------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by  in descending order in the same thread

Event                                                                                     Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread1::BufferedReader:MemoryCopy                                                        10          1.93579     1.891955 (0.977353)     0.043840 (0.022647)     0.144695    0.334621    0.193579    1
  thread1::BufferedReader:MemoryCopy/GpuMemcpyAsync:CUDAPinned->GPU                       20          1.20644     1.162603 (0.963662)     0.043840 (0.036338)     0.01371     0.219806    0.0603222   0.623229
thread0::recurrent_grad                                                                   10          436.392     402.093838 (0.921405)   34.298122 (0.078595)    43.1854     44.113      43.6392     0.569716
  thread0::recurrent_grad/rnn_memory_helper_grad                                          1800        42.7983     40.060860 (0.936039)    2.737399 (0.063961)     0.017887    0.10409     0.0237768   0.0558738
    thread0::recurrent_grad/rnn_memory_helper_grad/fill_constant                          40          1.32887     1.272839 (0.957835)     0.056032 (0.042165)     0.022293    0.067966    0.0332218   0.00173486
    thread0::recurrent_grad/rnn_memory_helper_grad/GpuMemcpyAsync(same_gpu):GPU->GPU      1760        30.503      27.821594 (0.912095)    2.681367 (0.087905)     0.01313     0.043434    0.0173312   0.0398221
  thread0::recurrent_grad/sum                                                             2000        91.2966     82.077039 (0.899015)    9.219606 (0.100985)     0.023783    0.085409    0.0456483   0.119189
  thread0::recurrent_grad/elementwise_mul_grad                                            1200        34.1634     31.600653 (0.924986)    2.562742 (0.075014)     0.023277    0.055969    0.0284695   0.0446008
  thread0::recurrent_grad/sigmoid_grad                                                    1200        30.5571     28.620562 (0.936627)    1.936503 (0.063373)     0.020975    0.051665    0.0254642   0.0398927
  thread0::recurrent_grad/tanh_grad                                                       800         20.2586     18.982027 (0.936986)    1.276572 (0.063014)     0.02176     0.047516    0.0253232   0.0264479
  thread0::recurrent_grad/elementwise_add_grad                                            800         23.8116     22.299915 (0.936517)    1.511640 (0.063483)     0.023671    0.050221    0.0297644   0.0310863
  thread0::recurrent_grad/slice_grad                                                      1600        46.3089     42.952063 (0.927511)    3.356878 (0.072489)     0.0237      0.072194    0.0289431   0.060457
  thread0::recurrent_grad/matmul_grad                                                     400         38.7708     28.053024 (0.723561)    10.717781 (0.276439)    0.087226    0.14261     0.096927    0.0506159
  thread0::recurrent_grad/concat_grad                                                     400         14.007      13.164969 (0.939886)    0.842009 (0.060114)     0.030201    0.05642     0.0350174   0.0182863
  thread0::recurrent_grad/fill_constant                                                   40          1.12651     1.060274 (0.941199)     0.066240 (0.058801)     0.022679    0.042416    0.0281628   0.00147068
  thread0::recurrent_grad/GpuMemcpyAsync(same_gpu):GPU->GPU                               50          0.877214    0.806462 (0.919345)     0.070752 (0.080655)     0.013038    0.029127    0.0175443   0.00114522
thread0::recurrent                                                                        10          262.144     244.144337 (0.931336)   17.999832 (0.068664)    26.023      26.4119     26.2144     0.342233
  thread0::recurrent/concat                                                               400         13.4365     12.603564 (0.938010)    0.832922 (0.061990)     0.026359    0.090599    0.0335912   0.0175415
  thread0::recurrent/matmul                                                               400         23.5848     18.438992 (0.781817)    5.145800 (0.218183)     0.051514    0.138007    0.058962    0.0307903
  thread0::recurrent/elementwise_add                                                      800         25.4895     24.109520 (0.945859)    1.380024 (0.054141)     0.020585    0.097694    0.0318619   0.033277
  thread0::recurrent/slice                                                                1600        47.4693     44.473107 (0.936882)    2.996178 (0.063118)     0.023984    0.168199    0.0296683   0.0619719
  thread0::recurrent/sigmoid                                                              1200        29.6012     27.497583 (0.928936)    2.103578 (0.071064)     0.020077    0.079791    0.0246676   0.0386448
  thread0::recurrent/elementwise_mul                                                      1200        29.0849     27.336452 (0.939884)    1.748472 (0.060116)     0.020059    0.044415    0.0242374   0.0379708
  thread0::recurrent/tanh                                                                 800         18.6475     17.366281 (0.931291)    1.281246 (0.068709)     0.01927     0.088796    0.0233094   0.0243446
  thread0::recurrent/rnn_memory_helper                                                    1800        38.8382     36.394516 (0.937080)    2.443708 (0.062920)     0.01686     0.053759    0.0215768   0.0507039
    thread0::recurrent/rnn_memory_helper/GpuMemcpyAsync(same_gpu):GPU->GPU                1800        29.9421     27.498431 (0.918386)    2.443708 (0.081614)     0.012972    0.04417     0.0166345   0.0390899
  thread0::recurrent/GpuMemcpyAsync(same_gpu):GPU->GPU                                    50          0.761822    0.693918 (0.910866)     0.067904 (0.089134)     0.013035    0.024854    0.0152364   0.00099457
thread0::matmul_grad                                                                      10          13.4956     0.743462 (0.055089)     12.752136 (0.944911)    1.33943     1.35865     1.34956     0.0176187
thread0::matmul                                                                           10          8.66756     0.456752 (0.052697)     8.210811 (0.947303)     0.860508    0.873576    0.866756    0.0113156
thread0::reduce_sum                                                                       80          4.71451     4.200428 (0.890958)     0.514078 (0.109042)     0.032381    0.11502     0.0589313   0.00615486
thread0::lookup_table                                                                     10          4.1311      4.029816 (0.975484)     0.101280 (0.024516)     0.060565    3.46527     0.41311     0.00539321
thread0::slice                                                                            80          2.91346     2.734096 (0.938437)     0.179360 (0.061563)     0.027227    0.201424    0.0364182   0.00380356
thread0::Fetch                                                                            40          2.85583     2.732436 (0.956793)     0.123392 (0.043207)     0.030818    0.146202    0.0713957   0.00372833
  thread0::Fetch/GpuMemcpyAsync:GPU->CPU                                                  40          1.39677     1.273379 (0.911659)     0.123392 (0.088341)     0.022792    0.051174    0.0349193   0.00182351
thread0::elementwise_mul                                                                  70          2.78601     2.184320 (0.784031)     0.601692 (0.215969)     0.029019    0.058539    0.0398002   0.00363718
thread0::transpose2                                                                       60          2.7026      2.549483 (0.943344)     0.153120 (0.056656)     0.024803    0.258202    0.0450434   0.00352829
thread0::softmax_with_cross_entropy                                                       10          2.53699     0.684673 (0.269877)     1.852312 (0.730123)     0.249168    0.265694    0.253698    0.00331207
thread0::square                                                                           70          2.30585     1.769183 (0.767258)     0.536669 (0.232742)     0.021543    0.067092    0.0329407   0.00301033
thread0::reshape2                                                                         100         1.96014     1.960142 (1.000000)     0.000000 (0.000000)     0.012659    0.260715    0.0196014   0.002559
thread0::eager_deletion                                                                   590         1.59629     1.596294 (1.000000)     0.000000 (0.000000)     0.000834    0.03695     0.00270558  0.00208399
thread0::GpuMemcpyAsync:CPU->GPU                                                          30          1.55189     1.426967 (0.919500)     0.124928 (0.080500)     0.009081    0.931243    0.0517298   0.00202602
thread0::slice_grad                                                                       40          1.30013     1.193727 (0.918163)     0.106399 (0.081837)     0.025923    0.055385    0.0325032   0.00169733
thread0::reshape2_grad                                                                    100         1.24655     1.246548 (1.000000)     0.000000 (0.000000)     0.009139    0.027106    0.0124655   0.00162739
thread0::fill_zeros_like                                                                  40          1.12792     1.073166 (0.951458)     0.054751 (0.048542)     0.020381    0.049453    0.0281979   0.00147251
thread0::sum                                                                              30          1.07655     0.990853 (0.920397)     0.085696 (0.079603)     0.022263    0.060042    0.035885    0.00140545
thread0::sgd                                                                              10          1.02794     0.336584 (0.327436)     0.691355 (0.672564)     0.100568    0.108045    0.102794    0.00134199
thread0::concat                                                                           20          1.0055      0.951136 (0.945930)     0.054368 (0.054070)     0.044418    0.061511    0.0502752   0.0013127
thread0::elementwise_add_grad                                                             10          0.934496    0.383394 (0.410268)     0.551102 (0.589732)     0.087899    0.113423    0.0934496   0.00122
thread0::softmax_with_cross_entropy_grad                                                  10          0.90735     0.446359 (0.491937)     0.460991 (0.508063)     0.087241    0.094091    0.090735    0.00118456
thread0::transpose2_grad                                                                  20          0.818608    0.757360 (0.925180)     0.061248 (0.074820)     0.030318    0.062308    0.0409304   0.00106871
thread0::elementwise_add                                                                  10          0.79809     0.388715 (0.487057)     0.409375 (0.512943)     0.078665    0.081977    0.079809    0.00104192
thread0::lookup_table_grad                                                                10          0.750815    0.541855 (0.721689)     0.208960 (0.278311)     0.071066    0.082765    0.0750815   0.0009802
thread0::FastThreadedSSAGraphExecutorPrepare                                              10          0.581394    0.581394 (1.000000)     0.000000 (0.000000)     0.050093    0.074508    0.0581394   0.000759019
thread0::reduce_mean                                                                      10          0.513593    0.477145 (0.929033)     0.036448 (0.070967)     0.047377    0.062268    0.0513593   0.000670504
thread0::fill_constant                                                                    10          0.491622    0.478470 (0.973248)     0.013152 (0.026752)     0.044415    0.067598    0.0491622   0.00064182
thread0::elementwise_max                                                                  10          0.474872    0.457688 (0.963813)     0.017184 (0.036187)     0.044684    0.060225    0.0474872   0.000619953
thread0::reduce_sum_grad                                                                  10          0.418248    0.397832 (0.951187)     0.020416 (0.048813)     0.03994     0.047595    0.0418248   0.000546029
thread0::reduce_mean_grad                                                                 10          0.403936    0.369216 (0.914046)     0.034720 (0.085954)     0.038054    0.042925    0.0403936   0.000527345
thread0::Scale LossGrad                                                                   10          0.348069    0.329541 (0.946769)     0.018528 (0.053231)     0.031085    0.04493     0.0348069   0.000454409
  thread0::Scale LossGrad/GpuMemcpyAsync:CPU->GPU                                         10          0.23927     0.220742 (0.922564)     0.018528 (0.077436)     0.021873    0.030115    0.023927    0.000312371
thread0::elementwise_div                                                                  10          0.347669    0.331477 (0.953427)     0.016192 (0.046573)     0.029805    0.044416    0.0347669   0.000453887
thread0::sqrt                                                                             10          0.280445    0.263838 (0.940783)     0.016607 (0.059217)     0.026611    0.031311    0.0280445   0.000366125
thread0::read                                                                             10          0.236643    0.236643 (1.000000)     0.000000 (0.000000)     0.020287    0.033824    0.0236643   0.000308941
  thread0::read/read                                                                      10          0.184806    0.184806 (1.000000)     0.000000 (0.000000)     0.016098    0.02902     0.0184806   0.000241267
thread0::create_double_buffer_reader                                                      10          0.073456    0.073456 (1.000000)     0.000000 (0.000000)     0.004812    0.012155    0.0073456   9.58979e-05
thread0::ScopeBufferedMonitor::post_local_exec_scopes_process                             10          0.045795    0.045795 (1.000000)     0.000000 (0.000000)     0.003517    0.007116    0.0045795   5.97861e-05
thread0::ScopeBufferedMonitor::pre_local_exec_scopes_process                              10          0.017431    0.017431 (1.000000)     0.000000 (0.000000)     0.00142     0.002097    0.0017431   2.27564e-05

zhaoyuchen2018 · 2020-01-10T09:30:37Z

paddle/fluid/platform/profiler.cc

      for (auto it = child_map.begin(); it != child_map.end(); it++) {
        if (it->first == event_item.name) {
          table.push_back(it->second);
-          do_next = it->second.name.rfind(op_end_str) ==
-                    (it->second.name.length() - op_end_str.length());
+          if (!do_next)


这一段代码是为了干嘛，另外其实可以写成：do_next = it->second.name.rfind(op_end_str) ！=
(it->second.name.length() - op_end_str.length())

if (do_next)
print_depth_next = print_depth + 1; 这里为什么print_depth只加1，前面不是push了很多个进去吗

是为了不打印OP 下comupte 等detail信息做的控制 (后续通过python 添加参数进行控制)

zhaoyuchen2018

LGTM

Xreki · 2020-01-13T01:47:46Z

不要再以命名后缀的方式区分了，每个event应该有一个角色定位。

* Add the first implememtation of fusion_group op PaddlePaddle#19621 (#3) * Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc. test=develop * Call CUDA driver api to launch the kernel compiled by nvrtc. test=develop * Disable for mac and windows. test=develop * Refine the codes to support manually specified num_threads and workload_per_thread. test=develop * Refine the CUDA kernel to support large dims. test=develop * Add DeviceCodePool to manage all device codes. * Add the first implementation fusion_group op. * Add unit-test for fusion_group op. * Add the check of result. * Add the check of nvrtc in unit-test. test=develop * Add comment to explain the inputs, outputs and features of fusion_group op. test=develop * Disable fusion_group op for mac and windows. test=develop * Make the compiling of device code return status instead of hanging up. test=develop * Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API. * Unify fusion_group_op's input and output names. test=develop * Add the check of CUDA driver library in unittest. test=develop * Enable generating code for a given subgraph. PaddlePaddle#21126 (#4) * Enable generating code for a given subgraph. * Support sorting the subgraph. * Remove the rearange of expressions because we use the sorted subgraph directly. * Enable generating code for a subgraph which is composed of grad ops. * Use expression information to check the accuracy in unittest. * Separate load and store from computation expressions. test=develop * Improve the loading statements in generated codes. test=develop * Remove unused arguments from formal list. test=develop * Enable the detection of subgraph of grad ops. * Generate code for detected subgraph in fusion_group_pass. * Add an option in BuildStrategy to enable fusion_group_pass and add unittest. test=develop * Fix a bug when checking whether the shape of all inputs are the same. * Add debug information. * Remove subgraph_detector from inference/analysis to the common framework/ir directory. (#5) test=develop * Call subgraph_detector in fusion_group pass. test=develop * Disable fusion_group when WITH_GPU is OFF. test=develop * Refine all PADDLE_ENFORCE message. test=develop * Fix the case that some inputs are not defined in grad ops, and set op_role for fused op. test=develop * add backward gradient computation for op argsort (PaddlePaddle#22203) * add backward gradient computation for op argsort test=developo * use pre-commit test=develop * fix the bug of profile update (PaddlePaddle#22207) * fix the bug of profile update test=develop * add NotImplementedError for multi optimizers (PaddlePaddle#22181) * add NotImplementedError for multi optimizers used on multi-places . test=develop * assert error only if num_devices>1. test=develop * set test_optimizer_in_control_flow in CMakeLists for using multi-GPU.test=develop * support fluid-lite subgraph run resnet test=develop (PaddlePaddle#22191) - 添加了fluid-lite子图方式运行resnet的单测 - 修改了依赖Lite的git commit id * fix bug fot test_dygraph_mnist_fp16.py, test=develop (PaddlePaddle#22222) * Check dygraph weight name (PaddlePaddle#22140) * add parameter check; test=develop * change parameter name checker in dygraph guard; test=develop * fix test layers error; test=develop * revert some code to develop; test=develop * fix exampel error; test=develop * fix comment error; test=develop * fix comment error; test=develop * only import used test case and function(PaddlePaddle#22208) Co-authored-by: FlyingQianMM <245467267@qq.com> Co-authored-by: wangchaochaohu <wangchao66@baidu.com> Co-authored-by: liym27 <33742067+liym27@users.noreply.github.com> Co-authored-by: Wilber <jiweibo1028@outlook.com> Co-authored-by: zhongpu <2013000149@qq.com> Co-authored-by: hong <43953930+phlrain@users.noreply.github.com> Co-authored-by: Zhang Ting <709968123@qq.com>

fix the bug of profile bug test=develop

127012b

wangchaochaohu requested review from Xreki, zhaoyuchen2018 and luotao1 January 10, 2020 06:57

fix print error test=develop

380850b

wangchaochaohu changed the title ~~fix the bug of profile bug~~ fix the bug of profile update Jan 10, 2020

zhaoyuchen2018 reviewed Jan 10, 2020

View reviewed changes

zhaoyuchen2018 self-requested a review January 10, 2020 09:37

zhaoyuchen2018 approved these changes Jan 10, 2020

View reviewed changes

wangchaochaohu merged commit 621d3e0 into PaddlePaddle:develop Jan 10, 2020

Xreki mentioned this pull request Jan 19, 2020

add flag to control profile level in python API #22319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the bug of profile update #22207

fix the bug of profile update #22207

wangchaochaohu commented Jan 10, 2020 •

edited

Loading

zhaoyuchen2018 Jan 10, 2020

zhaoyuchen2018 Jan 10, 2020

wangchaochaohu Jan 10, 2020

zhaoyuchen2018 left a comment

Xreki commented Jan 13, 2020

fix the bug of profile update #22207

fix the bug of profile update #22207

Conversation

wangchaochaohu commented Jan 10, 2020 • edited Loading

zhaoyuchen2018 Jan 10, 2020

Choose a reason for hiding this comment

zhaoyuchen2018 Jan 10, 2020

Choose a reason for hiding this comment

wangchaochaohu Jan 10, 2020

Choose a reason for hiding this comment

zhaoyuchen2018 left a comment

Choose a reason for hiding this comment

Xreki commented Jan 13, 2020

wangchaochaohu commented Jan 10, 2020 •

edited

Loading