Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance gain ith ACL #2

Open
kaishijeng opened this issue Jul 1, 2017 · 26 comments
Open

performance gain ith ACL #2

kaishijeng opened this issue Jul 1, 2017 · 26 comments

Comments

@kaishijeng
Copy link

I did performance profiling on classification with BVLC model between original caffe and caffeonacl and saw some gain, but as big as I am hoping. Is this also what you observe on your platform?
I use the following command on firefly 3399:

./build/examples/cpp_classification/classification.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffene
t/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg

and measure time spent below:

std::vector Classifier::Classify(const cv::Mat& img, int N) {
std::vector output = Predict(img);
std::clock_t begin = std::clock();
output = Predict(img);

N = std::min(labels_.size(), N);
std::vector maxN = Argmax(output, N);
std::vector predictions;
for (int i = 0; i < N; ++i) {
int idx = maxN[i];
predictions.push_back(std::make_pair(labels_[idx], output[idx]));
}
std::clock_t end = std::clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout <<"Time spent: " << elapsed_secs <<std::endl;

return predictions;
}

The time measurement for Caffe and CaffeOnACL are below:

CaffeonACL
Time spent: 4.53536
0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

Original Caffe
Time spent: 5.5306
0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

@honggui
Copy link
Contributor

honggui commented Jul 2, 2017

Yes, kaishijeng. The performance gain percentage we got is just like what you got. Due to the time of loading model's parameters, the time in real classification application will be much fatser(only need load the parameters once).

@kaishijeng
Copy link
Author

kaishijeng commented Jul 2, 2017 via email

@honggui
Copy link
Contributor

honggui commented Jul 2, 2017

Kaishijeng, your measure time is much longer than what I measured. In Arm Compute Library, there's a line "force_number_of_threads(0)" in the file src\runtime\CPP\CPPScheduler.cpp. You may change the line to "force_number_of_threads(1)", and have a try again.

@kaishijeng
Copy link
Author

kaishijeng commented Jul 2, 2017 via email

@honggui
Copy link
Contributor

honggui commented Jul 3, 2017

Hi Kaishijeng,
I make a mistake, the line is not in ACL 17.06. You may use CPPScheduler::set_num_threads(1) to have the try.
To enable GPU mode with Caffe::set_mode(Caffe::GPU)<see examples/cpp_classification/classification_gpu.cpp as the example>
best regards
Honggui

@kaishijeng
Copy link
Author

Reduce 0.3sec from 4.5 to 4.2 with set_num_threads(1).
What numbers do you get in your test?

Thanks,

@honggui
Copy link
Contributor

honggui commented Jul 6, 2017

kaishijeng,
the log was listed below:(Include setup time, it's 1.794151s. If not count setup time, it is 0.62415s per forward)
Regards,
Honggui

firefly@firefly:~/caffeOnACL$ ./build/examples/cpp_classification/classification_profiling.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg
LOGACL<0>
LOGACL: 0
---------- Prediction for examples/images/cat.jpg ----------
used time: 1795
Input/output shape for each layer ... total: 24

LAYER IDX: 23 name: prob type: Softmax
bottom fc8: 1 1000
top prob: 1 1000

LAYER IDX: 22 name: fc8 type: InnerProduct
bottom fc7: 1 4096
top fc8: 1 1000

LAYER IDX: 21 name: drop7 type: Dropout
bottom fc7: 1 4096
top fc7: 1 4096

LAYER IDX: 20 name: relu7 type: ReLU
bottom fc7: 1 4096
top fc7: 1 4096

LAYER IDX: 19 name: fc7 type: InnerProduct
bottom fc6: 1 4096
top fc7: 1 4096

LAYER IDX: 18 name: drop6 type: Dropout
bottom fc6: 1 4096
top fc6: 1 4096

LAYER IDX: 17 name: relu6 type: ReLU
bottom fc6: 1 4096
top fc6: 1 4096

LAYER IDX: 16 name: fc6 type: InnerProduct
bottom pool5: 1 256 6 6
top fc6: 1 4096

LAYER IDX: 15 name: pool5 type: Pooling
bottom conv5: 1 256 13 13
top pool5: 1 256 6 6

LAYER IDX: 14 name: relu5 type: ReLU
bottom conv5: 1 256 13 13
top conv5: 1 256 13 13

LAYER IDX: 13 name: conv5 type: Convolution
bottom conv4: 1 384 13 13
top conv5: 1 256 13 13

LAYER IDX: 12 name: relu4 type: ReLU
bottom conv4: 1 384 13 13
top conv4: 1 384 13 13

LAYER IDX: 11 name: conv4 type: Convolution
bottom conv3: 1 384 13 13
top conv4: 1 384 13 13

LAYER IDX: 10 name: relu3 type: ReLU
bottom conv3: 1 384 13 13
top conv3: 1 384 13 13

LAYER IDX: 9 name: conv3 type: Convolution
bottom norm2: 1 256 13 13
top conv3: 1 384 13 13

LAYER IDX: 8 name: norm2 type: LRN
bottom pool2: 1 256 13 13
top norm2: 1 256 13 13

LAYER IDX: 7 name: pool2 type: Pooling
bottom conv2: 1 256 27 27
top pool2: 1 256 13 13

LAYER IDX: 6 name: relu2 type: ReLU
bottom conv2: 1 256 27 27
top conv2: 1 256 27 27

LAYER IDX: 5 name: conv2 type: Convolution
bottom norm1: 1 96 27 27
top conv2: 1 256 27 27

LAYER IDX: 4 name: norm1 type: LRN
bottom pool1: 1 96 27 27
top norm1: 1 96 27 27

LAYER IDX: 3 name: pool1 type: Pooling
bottom conv1: 1 96 55 55
top pool1: 1 96 27 27

LAYER IDX: 2 name: relu1 type: ReLU
bottom conv1: 1 96 55 55
top conv1: 1 96 55 55

LAYER IDX: 1 name: conv1 type: Convolution
bottom data: 1 3 227 227
top conv1: 1 96 55 55

LAYER IDX: 0 name: data type: Input
top data: 1 3 227 227
Time for each layer ... sum of all layers is : 1794151

LAYER IDX: 23 name: prob type: Softmax ratio: 0
time stat: total: 0 count: 1 average: 0 start: 597045632 end: 597045632

LAYER IDX: 22 name: fc8 type: InnerProduct ratio: 4.23632
time stat: total: 76006 count: 1 average: 76006 start: 596969626 end: 597045632

LAYER IDX: 21 name: drop7 type: Dropout ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596969626 end: 596969626

LAYER IDX: 20 name: relu7 type: ReLU ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596969626 end: 596969626

LAYER IDX: 19 name: fc7 type: InnerProduct ratio: 20.903
time stat: total: 375031 count: 1 average: 375031 start: 596594595 end: 596969626

LAYER IDX: 18 name: drop6 type: Dropout ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596594595 end: 596594595

LAYER IDX: 17 name: relu6 type: ReLU ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596594595 end: 596594595

LAYER IDX: 16 name: fc6 type: InnerProduct ratio: 42.5307
time stat: total: 763065 count: 1 average: 763065 start: 595831530 end: 596594595

LAYER IDX: 15 name: pool5 type: Pooling ratio: 1.05905
time stat: total: 19001 count: 1 average: 19001 start: 595811528 end: 595830529

LAYER IDX: 14 name: relu5 type: ReLU ratio: 0
time stat: total: 0 count: 1 average: 0 start: 595811528 end: 595811528

LAYER IDX: 13 name: conv5 type: Convolution ratio: 1.61653
time stat: total: 29003 count: 1 average: 29003 start: 595782525 end: 595811528

LAYER IDX: 12 name: relu4 type: ReLU ratio: 0.0557367
time stat: total: 1000 count: 1 average: 1000 start: 595781525 end: 595782525

LAYER IDX: 11 name: conv4 type: Convolution ratio: 2.73132
time stat: total: 49004 count: 1 average: 49004 start: 595732521 end: 595781525

LAYER IDX: 10 name: relu3 type: ReLU ratio: 0.0557367
time stat: total: 1000 count: 1 average: 1000 start: 595731521 end: 595732521

LAYER IDX: 9 name: conv3 type: Convolution ratio: 10.7581
time stat: total: 193016 count: 1 average: 193016 start: 595538505 end: 595731521

LAYER IDX: 8 name: norm2 type: LRN ratio: 0.334476
time stat: total: 6001 count: 1 average: 6001 start: 595532504 end: 595538505

LAYER IDX: 7 name: pool2 type: Pooling ratio: 1.95095
time stat: total: 35003 count: 1 average: 35003 start: 595497501 end: 595532504

LAYER IDX: 6 name: relu2 type: ReLU ratio: 0.222947
time stat: total: 4000 count: 1 average: 4000 start: 595493501 end: 595497501

LAYER IDX: 5 name: conv2 type: Convolution ratio: 8.97438
time stat: total: 161014 count: 1 average: 161014 start: 595332487 end: 595493501

LAYER IDX: 4 name: norm1 type: LRN ratio: 0.390212
time stat: total: 7001 count: 1 average: 7001 start: 595325486 end: 595332487

LAYER IDX: 3 name: pool1 type: Pooling ratio: 1.11484
time stat: total: 20002 count: 1 average: 20002 start: 595305484 end: 595325486

LAYER IDX: 2 name: relu1 type: ReLU ratio: 0.33442
time stat: total: 6000 count: 1 average: 6000 start: 595299484 end: 595305484

LAYER IDX: 1 name: conv1 type: Convolution ratio: 2.73132
time stat: total: 49004 count: 1 average: 49004 start: 595250480 end: 595299484

LAYER IDX: 0 name: data type: Input ratio: 0
time stat: total: 0 count: 1 average: 0 start: 595250480 end: 595250480


STATS for 10 reptitions: ...
Total time: 624150 per forward
Each layer stats: ...
23: used time: 100 ratio: 0.0160218 enter count: 1
22: used time: 18001 ratio: 2.88416 enter count: 1
21: used time: 0 ratio: 0 enter count: 1
20: used time: 0 ratio: 0 enter count: 1
19: used time: 68005 ratio: 10.8957 enter count: 1
18: used time: 0 ratio: 0 enter count: 1
17: used time: 0 ratio: 0 enter count: 1
16: used time: 181514 ratio: 29.0819 enter count: 1
15: used time: 23601 ratio: 3.78145 enter count: 1
14: used time: 200 ratio: 0.0320596 enter count: 1
13: used time: 22701 ratio: 3.63722 enter count: 1
12: used time: 200 ratio: 0.0320436 enter count: 1
11: used time: 42503 ratio: 6.80979 enter count: 1
10: used time: 400 ratio: 0.0640872 enter count: 1
9: used time: 67305 ratio: 10.7835 enter count: 1
8: used time: 4200 ratio: 0.672979 enter count: 1
7: used time: 26802 ratio: 4.29418 enter count: 1
6: used time: 1100 ratio: 0.17624 enter count: 1
5: used time: 109508 ratio: 17.5453 enter count: 1
4: used time: 5100 ratio: 0.817159 enter count: 1
3: used time: 15501 ratio: 2.48357 enter count: 1
2: used time: 2400 ratio: 0.384587 enter count: 1
1: used time: 35002 ratio: 5.60806 enter count: 1
0: used time: 0 ratio: 0 enter count: 1

time cost top 10 layers are: ...
16: used time: 181514 ratio: 29.0819 enter count: 1
5: used time: 109508 ratio: 17.5453 enter count: 1
19: used time: 68005 ratio: 10.8957 enter count: 1
9: used time: 67305 ratio: 10.7835 enter count: 1
11: used time: 42503 ratio: 6.80979 enter count: 1
1: used time: 35002 ratio: 5.60806 enter count: 1
7: used time: 26802 ratio: 4.29418 enter count: 1
15: used time: 23601 ratio: 3.78145 enter count: 1
13: used time: 22701 ratio: 3.63722 enter count: 1
22: used time: 18001 ratio: 2.88416 enter count: 1
Top cost layers occupied: 95.3213

0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

@kaishijeng
Copy link
Author

How do you get the log?
I ran the same command below and got only classification result, no profiling log.

./build/examples/cpp_classification/classification_profiling.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg

---------- Prediction for examples/images/cat.jpg ----------
0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

@honggui
Copy link
Contributor

honggui commented Jul 6, 2017

hi kaishijeng,
We can change the "USE_PROFILING" in Makefile.config to enable the profiling messages.
CPU_ONLY := 1
USE_PROFILING := 1
USE_ACL :=1
Reagards,
Honggui

@kaishijeng
Copy link
Author

See below is my profiling result which is very similar to you get:

STATS for 10 reptitions: ...
Total time: 607204 per forward
Each layer stats: ...
23: used time: 4400 ratio: 0.724633 enter count: 1
22: used time: 4800 ratio: 0.790509 enter count: 1
21: used time: 0 ratio: 0 enter count: 1
20: used time: 400 ratio: 0.0658757 enter count: 1
19: used time: 18400 ratio: 3.03028 enter count: 1
18: used time: 0 ratio: 0 enter count: 1
17: used time: 800 ratio: 0.131751 enter count: 1
16: used time: 53200 ratio: 8.76154 enter count: 1
15: used time: 114400 ratio: 18.8406 enter count: 1
14: used time: 2000 ratio: 0.329379 enter count: 1
13: used time: 13600 ratio: 2.23979 enter count: 1
12: used time: 2800 ratio: 0.461147 enter count: 1
11: used time: 16800 ratio: 2.7668 enter count: 1
10: used time: 1200 ratio: 0.197644 enter count: 1
9: used time: 46400 ratio: 7.64165 enter count: 1
8: used time: 34400 ratio: 5.66533 enter count: 1
7: used time: 126800 ratio: 20.8828 enter count: 1
6: used time: 3600 ratio: 0.592882 enter count: 1
5: used time: 55600 ratio: 9.15677 enter count: 1
4: used time: 15200 ratio: 2.50329 enter count: 1
3: used time: 53600 ratio: 8.82741 enter count: 1
2: used time: 4800 ratio: 0.790509 enter count: 1
1: used time: 34000 ratio: 5.59949 enter count: 1
0: used time: 0 ratio: 0 enter count: 1

time cost top 10 layers are: ...
7: used time: 126800 ratio: 20.8828 enter count: 1
15: used time: 114400 ratio: 18.8406 enter count: 1
5: used time: 55600 ratio: 9.15677 enter count: 1
3: used time: 53600 ratio: 8.82741 enter count: 1
16: used time: 53200 ratio: 8.76154 enter count: 1
9: used time: 46400 ratio: 7.64165 enter count: 1
8: used time: 34400 ratio: 5.66533 enter count: 1
1: used time: 34000 ratio: 5.59949 enter count: 1
19: used time: 18400 ratio: 3.03028 enter count: 1
11: used time: 16800 ratio: 2.7668 enter count: 1
Top cost layers occupied: 91.1726

@kaishijeng
Copy link
Author

STATS for 10 reptitions: ...
Total time: 607204 per forward

Does it mean time per forward is 607msec?

Thanks,

@honggui
Copy link
Contributor

honggui commented Jul 7, 2017

Hi Kaishijeng,
Yes,you are right.
Regards,
Honggui

@kaishijeng
Copy link
Author

kaishijeng commented Jul 7, 2017 via email

@austingg
Copy link

austingg commented Jul 7, 2017

@kaishijeng you may use

your/caffe/binary/caffe time -model alexnet.prototxt 

@kaishijeng @honggui by the way, you guys test the performance on desktop processor? Is there some statistics from mobile devices? and In the doc, the ACL_NEON seems slower than offical caffe with openblas. Which devices are tested? seems a long way to go if test on 32 bit platform, since 32 bit openblas don't use neon to speed up.

@kaishijeng
Copy link
Author

kaishijeng commented Jul 7, 2017 via email

@kaishijeng
Copy link
Author

If the above numbers are fair comparison, then ACL has 2.5 speedup over a pure CPU on a firefly platform.

I saw there is a MxnetonACL on github. Not sure whether it will have a plan to have TensorflowOnACL.
because I use Tensorflow framework for most of my ML projects.

@xhbdahai
Copy link
Contributor

Hi Kaishijeng:
There is no clear plan so far with TensorFlow so far.

@kaishijeng
Copy link
Author

Thanks for an update.

@psyhtest
Copy link

@xhbdahai, @honggui, @openailab-sh You will invariably end up with more questions about benchmarking Caffe-on-ACL against Caffe (or indeed other frameworks). Have you considered using / contributing to CK-Caffe? It's part of a growing suite of AI benchmarking tools based on Collective Knowledge, also including e.g. CK-Caffe2, CK-TensorFlow, CK-TensorRT, CK-KaNN.

For example, we have released benchmarking data for the Firefly-RK3399 platform that @kaishijeng uses.

For example, for the batch size of 2 (the smallest we have measured) on AlexNet (the closest to CaffeNet we have measured), we have obtained the following data for forward propagation (inference):

  • OpenBLAS: 695 ms
  • clBLAS: ~3700 ms
  • ViennaCL: ~3650 ms
  • CLBlast: ~4500 ms
  • libDNN w/ CLBlast: ~2160 ms
  • CLBlast (tuned by dividiti): ~1320 ms

(I can easily benchmark CaffeNet with the batch size of 1 if you are interested.)

Would you be interested in collaborating on adding Caffe-on-ACL to CK-Caffe?

@psyhtest
Copy link

psyhtest commented Jul 11, 2017

As an added bonus, we already support ACL package and crowdbenchmarking across mobile devices.

@OAIL
Copy link

OAIL commented Jul 13, 2017

@psyhtest adding caffeOnACL to CK-Caffe is a good idea. will give you feedback after checking effort.

@psyhtest
Copy link

@OAIL How is the effort looking to you? :)

@baynaa7
Copy link

baynaa7 commented Feb 26, 2018

Hello @honggui I am testing caffeACL vs caffe on tx2 board.
however classification example on alexnet gives following result.
arguments are exactly same with kaishijeng.
caffeACL: elapsed time: [2.28925] seconds
caffe: elapsed time: [1.2105] seconds

note both running on cpu version

any possible hypothesis for these results?
Thanks in advance.

@honggui
Copy link
Contributor

honggui commented Feb 27, 2018

Hi pcub,
The performance of the different layers, some may be better with ACL, and others may be better with OpenBLAS. To refer https://github.com/OAID/Caffe-HRT/blob/master/acl_openailab/user_manual.pdf, we could know the proper library for the operator.
BTW, OpenBLAS's threads seems affect ACL's threads much. Sometimes we could use "export OPENBLAS_NUM_THREADS=1" to lower the side effect.
Best Regards,
Honggui

@baynaa7
Copy link

baynaa7 commented Feb 28, 2018

Thanks @honggui
export OPENBLAS_NUM_THREADS=1 works.

@Steven9402
Copy link

Hi honggui, I want to test the performance of face recognition application on multi threads. Where I can add "CPPScheduler::set_num_threads(x)" to enable multi thread test for ACL? The executable that I use is OAID/FaceRecognition/bin/face-recognition.cpp. Also, I want to know if there is any interface to modify the number of threads used. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants