Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare Inference Perf BTW CPU and MKLDNN #10651

Closed
tensor-tang opened this issue May 15, 2018 · 4 comments
Closed

Compare Inference Perf BTW CPU and MKLDNN #10651

tensor-tang opened this issue May 15, 2018 · 4 comments

Comments

@tensor-tang
Copy link
Contributor

tensor-tang commented May 15, 2018

Give a quick comparison on test_inference_image_classification to have a baseline, at least.

  1. compare op by op and total time.
  2. calculate the creating time of MKLDNN primitives.

Based on dfdcb7e

@tensor-tang tensor-tang self-assigned this May 15, 2018
@luotao1 luotao1 added the Intel label May 15, 2018
@tensor-tang
Copy link
Contributor Author

tensor-tang commented May 15, 2018

ResNet Inference Results

name multi-threads single thread
MKLDNN 138.282 719.249
CPU 564.056 709.05

Server: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets, 6 cores per socket

Details:

BatchSize: 64

Repeat: 100

MKLDNN multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.20932     0.009643    0.062216    0.0177841
thread0::init_program    1           2.68691     2.68691     2.68691     2.68691


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.299036    0.002371    0.004543    0.00299036
thread0::conv2d             3300        9000.44     0.195592    30.1177     2.72741
thread0::elementwise_add    4900        4070.66     0.008136    20.9259     0.830748
thread0::relu               3100        606.868     0.074293    4.75193     0.195764
thread0::pool2d             100         20.8408     0.123946    1.13076     0.208408
thread0::mul                100         7.87092     0.072069    0.109052    0.0787092
thread0::softmax            100         8.86674     0.042954    1.37156     0.0886674
thread0::fetch              100         0.731889    0.00641     0.017074    0.00731889
thread0::run_inference      100         13828.2     115.068     402.65      138.282

CPU multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.14652     0.009442    0.053964    0.0168606
thread0::init_program    1           2.61936     2.61936     2.61936     2.61936


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.322211    0.00269     0.008291    0.00322211
thread0::conv2d             3300        51572.6     1.13987     225.607     15.6281
thread0::elementwise_add    4900        4071.9      0.006965    11.3786     0.831
thread0::relu               3100        590.259     0.073982    4.52276     0.190406
thread0::pool2d             100         31.1838     0.306259    0.401829    0.311838
thread0::mul                100         3.02121     0.026938    0.076059    0.0302121
thread0::softmax            100         25.8391     0.254502    0.295943    0.258391
thread0::fetch              100         0.680716    0.005706    0.017241    0.00680716
thread0::run_inference      100         56405.6     454.744     1020.46     564.056

MKLDNN single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.11259     0.009476    0.0491      0.0163616
thread0::init_program    1           2.58213     2.58213     2.58213     2.58213


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.308051    0.002513    0.004302    0.00308051
thread0::conv2d             3300        67372.5     1.09394     37.2228     20.4159
thread0::elementwise_add    4900        3785.25     0.00826     4.89141     0.772501
thread0::relu               3100        578.755     0.073902    0.583639    0.186695
thread0::pool2d             100         53.1591     0.518617    0.572609    0.531591
thread0::mul                100         8.33622     0.076171    0.103348    0.0833622
thread0::softmax            100         17.0244     0.167629    0.179503    0.170244
thread0::fetch              100         0.713337    0.006472    0.013599    0.00713337
thread0::run_inference      100         71924.9     711.922     762.468     719.249

CPU single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.1284      0.009774    0.04971     0.0165941
thread0::init_program    1           2.63472     2.63472     2.63472     2.63472


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.304634    0.002511    0.00463     0.00304634
thread0::conv2d             3300        66386.6     1.1244      35.5913     20.1172
thread0::elementwise_add    4900        3774.94     0.006815    5.1618      0.770395
thread0::relu               3100        581.059     0.07387     0.830908    0.187438
thread0::pool2d             100         31.0233     0.306008    0.362504    0.310233
thread0::mul                100         2.73551     0.025045    0.036044    0.0273551
thread0::softmax            100         26.2478     0.258148    0.299739    0.262478
thread0::fetch              100         0.653839    0.005754    0.015972    0.00653839
thread0::run_inference      100         70905       700.366     750.701     709.05

How to reproduce

# 1\. Compile
git clone https://github.com/tensor-tang/Paddle paddle
cd paddle
git checkout compare
mkdir build && cd build
cmake .. -DWITH_GPU=OFF -DWITH_TESTING=ON -DCMAKE_INSTALL_PREFIX=./tmp
make -j `nproc`
make install

# �2\. Run Python test, this will save ResNet model
make test ARGS="-R test_image_classification -V"

# 3\. Run the Inference test
cd ..
./build/paddle/fluid/inference/tests/book/test_inference_image_classification_resnet --dirname=./build/python/paddle/fluid/tests/book/image_classification_resnet.inference.model --batch_size=64 --repeat=100

This will give the result of MKLDNN multi-threads, with single threads please try:

taskset -c 0 ./build/paddle/fluid/inference/tests/book/test_inference_image_classification_resnet --dirname=./build/python/paddle/fluid/tests/book/image_classification_resnet.inference.model --batch_size=64 --repeat=100

As for CPU performance please change the use_mkldnn=False at ./build/python/paddle/fluid/tests/book/test_image_classification.py:313, then run Step 2 and Step 3.

The changes

develop...tensor-tang:compare

Next move

Will give a PR to change the ProgramDesc directly from test_inference_*.cc.

@luotao1
Copy link
Contributor

luotao1 commented May 16, 2018

Could you give a PR URL, which we can easily know the changes between you and develop branch?
Besides, could you give a result on batchsize=1?

@tensor-tang
Copy link
Contributor Author

As for batch size 1:

name multi-threads single thread
MKLDNN 11.8795 15.8887
CPU 10.1359 12.3644

MKLDNN multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.03365     0.009594    0.033669    0.0152007
thread0::init_program    1           2.51944     2.51944     2.51944     2.51944


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.319115    0.00236     0.00419     0.00319115
thread0::conv2d             3300        957.289     0.128773    2.30153     0.290088
thread0::elementwise_add    4900        98.6889     0.006231    0.487361    0.0201406
thread0::relu               3100        19.9029     0.004581    0.082508    0.00642029
thread0::pool2d             100         13.2406     0.082838    1.40292     0.132406
thread0::mul                100         6.07581     0.051283    0.127585    0.0607581
thread0::softmax            100         6.89751     0.030691    0.678422    0.0689751
thread0::fetch              100         0.661207    0.004258    0.018983    0.00661207
thread0::run_inference      100         1187.95     9.49811     20.9113     11.8795

MKLDNN single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.01823     0.00948     0.032522    0.014974
thread0::init_program    1           2.5322      2.5322      2.5322      2.5322


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.210431    0.001891    0.007563    0.00210431
thread0::conv2d             3300        1417.37     0.088962    1.31571     0.429505
thread0::elementwise_add    4900        80.7512     0.00611     0.691986    0.0164798
thread0::relu               3100        18.3715     0.004308    0.025278    0.00592629
thread0::pool2d             100         6.167       0.059295    0.080151    0.06167
thread0::mul                100         4.34947     0.042111    0.051461    0.0434947
thread0::softmax            100         1.72373     0.016533    0.022925    0.0172373
thread0::fetch              100         0.334084    0.003117    0.010444    0.00334084
thread0::run_inference      100         1588.87     15.7445     17.2864     15.8887

CPU single thread

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.02967     0.009555    0.032814    0.0151422
thread0::init_program    1           2.5184      2.5184      2.5184      2.5184


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.186704    0.001724    0.002616    0.00186704
thread0::conv2d             3300        1085.46     0.033286    0.968292    0.328928
thread0::elementwise_add    4900        75.329      0.004157    0.046556    0.0153733
thread0::relu               3100        17.7532     0.004074    0.01435     0.00572683
thread0::pool2d             100         1.14296     0.010944    0.016306    0.0114296
thread0::mul                100         0.762121    0.00725     0.011288    0.00762121
thread0::softmax            100         0.94436     0.009064    0.014038    0.0094436
thread0::fetch              100         0.252252    0.002306    0.00582     0.00252252
thread0::run_inference      100         1236.44     12.2918     13.2701     12.3644

CPU multi-threads

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                    Calls       Total       Min.        Max.        Ave.
thread0::load            68          1.05451     0.009667    0.034995    0.0155075
thread0::init_program    1           2.57237     2.57237     2.57237     2.57237


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.
thread0::feed               100         0.195629    0.001808    0.00283     0.00195629
thread0::conv2d             3300        855.584     0.034638    4.23663     0.259268
thread0::elementwise_add    4900        78.6017     0.004243    0.049165    0.0160412
thread0::relu               3100        18.6085     0.004336    0.013074    0.00600273
thread0::pool2d             100         1.18143     0.01134     0.015879    0.0118143
thread0::mul                100         0.813435    0.00767     0.012044    0.00813435
thread0::softmax            100         0.952499    0.009162    0.013097    0.00952499
thread0::fetch              100         0.260175    0.002331    0.005984    0.00260175
thread0::run_inference      100         1013.59     8.99383     17.3052     10.1359

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants