Compare Inference Perf BTW CPU and MKLDNN #10651

tensor-tang · 2018-05-15T03:18:44Z

Give a quick comparison on test_inference_image_classification to have a baseline, at least.

compare op by op and total time.
calculate the creating time of MKLDNN primitives.

Based on dfdcb7e

The text was updated successfully, but these errors were encountered:

tensor-tang · 2018-05-15T14:23:47Z

ResNet Inference Results

name	multi-threads	single thread
MKLDNN	138.282	719.249
CPU	564.056	709.05

Server: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets, 6 cores per socket

Details:

BatchSize: 64

Repeat: 100

MKLDNN multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.20932     0.009643    0.062216    0.0177841
thread0::init_program    1           2.68691     2.68691     2.68691     2.68691


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.299036    0.002371    0.004543    0.00299036
thread0::conv2d             3300        9000.44     0.195592    30.1177     2.72741
thread0::elementwise_add    4900        4070.66     0.008136    20.9259     0.830748
thread0::relu               3100        606.868     0.074293    4.75193     0.195764
thread0::pool2d             100         20.8408     0.123946    1.13076     0.208408
thread0::mul                100         7.87092     0.072069    0.109052    0.0787092
thread0::softmax            100         8.86674     0.042954    1.37156     0.0886674
thread0::fetch              100         0.731889    0.00641     0.017074    0.00731889
thread0::run_inference      100         13828.2     115.068     402.65      138.282

CPU multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.14652     0.009442    0.053964    0.0168606
thread0::init_program    1           2.61936     2.61936     2.61936     2.61936


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.322211    0.00269     0.008291    0.00322211
thread0::conv2d             3300        51572.6     1.13987     225.607     15.6281
thread0::elementwise_add    4900        4071.9      0.006965    11.3786     0.831
thread0::relu               3100        590.259     0.073982    4.52276     0.190406
thread0::pool2d             100         31.1838     0.306259    0.401829    0.311838
thread0::mul                100         3.02121     0.026938    0.076059    0.0302121
thread0::softmax            100         25.8391     0.254502    0.295943    0.258391
thread0::fetch              100         0.680716    0.005706    0.017241    0.00680716
thread0::run_inference      100         56405.6     454.744     1020.46     564.056

MKLDNN single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.11259     0.009476    0.0491      0.0163616
thread0::init_program    1           2.58213     2.58213     2.58213     2.58213


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.308051    0.002513    0.004302    0.00308051
thread0::conv2d             3300        67372.5     1.09394     37.2228     20.4159
thread0::elementwise_add    4900        3785.25     0.00826     4.89141     0.772501
thread0::relu               3100        578.755     0.073902    0.583639    0.186695
thread0::pool2d             100         53.1591     0.518617    0.572609    0.531591
thread0::mul                100         8.33622     0.076171    0.103348    0.0833622
thread0::softmax            100         17.0244     0.167629    0.179503    0.170244
thread0::fetch              100         0.713337    0.006472    0.013599    0.00713337
thread0::run_inference      100         71924.9     711.922     762.468     719.249

CPU single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.1284      0.009774    0.04971     0.0165941
thread0::init_program    1           2.63472     2.63472     2.63472     2.63472


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.304634    0.002511    0.00463     0.00304634
thread0::conv2d             3300        66386.6     1.1244      35.5913     20.1172
thread0::elementwise_add    4900        3774.94     0.006815    5.1618      0.770395
thread0::relu               3100        581.059     0.07387     0.830908    0.187438
thread0::pool2d             100         31.0233     0.306008    0.362504    0.310233
thread0::mul                100         2.73551     0.025045    0.036044    0.0273551
thread0::softmax            100         26.2478     0.258148    0.299739    0.262478
thread0::fetch              100         0.653839    0.005754    0.015972    0.00653839
thread0::run_inference      100         70905       700.366     750.701     709.05

How to reproduce

# 1\. Compile
git clone https://github.com/tensor-tang/Paddle paddle
cd paddle
git checkout compare
mkdir build && cd build
cmake .. -DWITH_GPU=OFF -DWITH_TESTING=ON -DCMAKE_INSTALL_PREFIX=./tmp
make -j `nproc`
make install

# �2\. Run Python test, this will save ResNet model
make test ARGS="-R test_image_classification -V"

# 3\. Run the Inference test
cd ..
./build/paddle/fluid/inference/tests/book/test_inference_image_classification_resnet --dirname=./build/python/paddle/fluid/tests/book/image_classification_resnet.inference.model --batch_size=64 --repeat=100

This will give the result of MKLDNN multi-threads, with single threads please try:

taskset -c 0 ./build/paddle/fluid/inference/tests/book/test_inference_image_classification_resnet --dirname=./build/python/paddle/fluid/tests/book/image_classification_resnet.inference.model --batch_size=64 --repeat=100

As for CPU performance please change the use_mkldnn=False at ./build/python/paddle/fluid/tests/book/test_image_classification.py:313, then run Step 2 and Step 3.

The changes

develop...tensor-tang:compare

Next move

Will give a PR to change the ProgramDesc directly from test_inference_*.cc.

luotao1 · 2018-05-16T01:46:24Z

Could you give a PR URL, which we can easily know the changes between you and develop branch?
Besides, could you give a result on batchsize=1?

tensor-tang · 2018-05-16T11:59:30Z

As for batch size 1:

name	multi-threads	single thread
MKLDNN	11.8795	15.8887
CPU	10.1359	12.3644

MKLDNN multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.03365     0.009594    0.033669    0.0152007
thread0::init_program    1           2.51944     2.51944     2.51944     2.51944


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.319115    0.00236     0.00419     0.00319115
thread0::conv2d             3300        957.289     0.128773    2.30153     0.290088
thread0::elementwise_add    4900        98.6889     0.006231    0.487361    0.0201406
thread0::relu               3100        19.9029     0.004581    0.082508    0.00642029
thread0::pool2d             100         13.2406     0.082838    1.40292     0.132406
thread0::mul                100         6.07581     0.051283    0.127585    0.0607581
thread0::softmax            100         6.89751     0.030691    0.678422    0.0689751
thread0::fetch              100         0.661207    0.004258    0.018983    0.00661207
thread0::run_inference      100         1187.95     9.49811     20.9113     11.8795

MKLDNN single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.01823     0.00948     0.032522    0.014974
thread0::init_program    1           2.5322      2.5322      2.5322      2.5322


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.210431    0.001891    0.007563    0.00210431
thread0::conv2d             3300        1417.37     0.088962    1.31571     0.429505
thread0::elementwise_add    4900        80.7512     0.00611     0.691986    0.0164798
thread0::relu               3100        18.3715     0.004308    0.025278    0.00592629
thread0::pool2d             100         6.167       0.059295    0.080151    0.06167
thread0::mul                100         4.34947     0.042111    0.051461    0.0434947
thread0::softmax            100         1.72373     0.016533    0.022925    0.0172373
thread0::fetch              100         0.334084    0.003117    0.010444    0.00334084
thread0::run_inference      100         1588.87     15.7445     17.2864     15.8887

CPU single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.02967     0.009555    0.032814    0.0151422
thread0::init_program    1           2.5184      2.5184      2.5184      2.5184


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.186704    0.001724    0.002616    0.00186704
thread0::conv2d             3300        1085.46     0.033286    0.968292    0.328928
thread0::elementwise_add    4900        75.329      0.004157    0.046556    0.0153733
thread0::relu               3100        17.7532     0.004074    0.01435     0.00572683
thread0::pool2d             100         1.14296     0.010944    0.016306    0.0114296
thread0::mul                100         0.762121    0.00725     0.011288    0.00762121
thread0::softmax            100         0.94436     0.009064    0.014038    0.0094436
thread0::fetch              100         0.252252    0.002306    0.00582     0.00252252
thread0::run_inference      100         1236.44     12.2918     13.2701     12.3644

CPU multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.05451     0.009667    0.034995    0.0155075
thread0::init_program    1           2.57237     2.57237     2.57237     2.57237


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.195629    0.001808    0.00283     0.00195629
thread0::conv2d             3300        855.584     0.034638    4.23663     0.259268
thread0::elementwise_add    4900        78.6017     0.004243    0.049165    0.0160412
thread0::relu               3100        18.6085     0.004336    0.013074    0.00600273
thread0::pool2d             100         1.18143     0.01134     0.015879    0.0118143
thread0::mul                100         0.813435    0.00767     0.012044    0.00813435
thread0::softmax            100         0.952499    0.009162    0.013097    0.00952499
thread0::fetch              100         0.260175    0.002331    0.005984    0.00260175
thread0::run_inference      100         1013.59     8.99383     17.3052     10.1359

shanyi15 · 2018-08-15T10:26:14Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

tensor-tang self-assigned this May 15, 2018

luotao1 added the Intel label May 15, 2018

luotao1 mentioned this issue May 16, 2018

Compare Inference Perf BTW CPU and MKLDNN [OCR CRNN_CTC model] #10685

Closed

luotao1 added this to Done in Intel Optimization on Fluid May 16, 2018

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare Inference Perf BTW CPU and MKLDNN #10651

Compare Inference Perf BTW CPU and MKLDNN #10651

tensor-tang commented May 15, 2018 •

edited

Loading

tensor-tang commented May 15, 2018 •

edited

Loading

luotao1 commented May 16, 2018

tensor-tang commented May 16, 2018

shanyi15 commented Aug 15, 2018

Compare Inference Perf BTW CPU and MKLDNN #10651

Compare Inference Perf BTW CPU and MKLDNN #10651

Comments

tensor-tang commented May 15, 2018 • edited Loading

tensor-tang commented May 15, 2018 • edited Loading

ResNet Inference Results

Details:

MKLDNN multi-threads

CPU multi-threads

MKLDNN single thread

CPU single thread

How to reproduce

The changes

Next move

luotao1 commented May 16, 2018

tensor-tang commented May 16, 2018

MKLDNN multi-threads

MKLDNN single thread

CPU single thread

CPU multi-threads

shanyi15 commented Aug 15, 2018

tensor-tang commented May 15, 2018 •

edited

Loading

tensor-tang commented May 15, 2018 •

edited

Loading