performance gain ith ACL #2

kaishijeng · 2017-07-01T21:26:01Z

I did performance profiling on classification with BVLC model between original caffe and caffeonacl and saw some gain, but as big as I am hoping. Is this also what you observe on your platform?
I use the following command on firefly 3399:

./build/examples/cpp_classification/classification.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffene
t/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg

and measure time spent below:

std::vector Classifier::Classify(const cv::Mat& img, int N) {
std::vector output = Predict(img);
std::clock_t begin = std::clock();
output = Predict(img);

N = std::min(labels_.size(), N);
std::vector maxN = Argmax(output, N);
std::vector predictions;
for (int i = 0; i < N; ++i) {
int idx = maxN[i];
predictions.push_back(std::make_pair(labels_[idx], output[idx]));
}
std::clock_t end = std::clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout <<"Time spent: " << elapsed_secs <<std::endl;

return predictions;
}

The time measurement for Caffe and CaffeOnACL are below:

CaffeonACL
Time spent: 4.53536
0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

Original Caffe
Time spent: 5.5306
0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

honggui · 2017-07-02T00:05:33Z

Yes, kaishijeng. The performance gain percentage we got is just like what you got. Due to the time of loading model's parameters, the time in real classification application will be much fatser(only need load the parameters once).

kaishijeng · 2017-07-02T04:17:22Z

If you see how I measure time duration, I actually measure time starting from the 2nd time of prediction. So loading parameter should not have an effect of my time measurement Thanks,

…

On Sat, Jul 1, 2017 at 5:05 PM, honggui ***@***.***> wrote: Yes, kaishijeng. The performance gain percentage we got is just like what you got. Due to the time of loading model's parameters, the time in real classification application will be much fatser(only need load the parameters once). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3nEZGz6QmHTMvmal5eq1tIruPMHrks5sJt7OgaJpZM4OLa-n> .

honggui · 2017-07-02T08:15:42Z

Kaishijeng, your measure time is much longer than what I measured. In Arm Compute Library, there's a line "force_number_of_threads(0)" in the file src\runtime\CPP\CPPScheduler.cpp. You may change the line to "force_number_of_threads(1)", and have a try again.

kaishijeng · 2017-07-02T16:23:23Z

honggui I can't find force_number_of_threads function on src/runtime/CPP/CPPScheduler.cpp in the computelibrary. Can you check it? I have a couple questions of your measurements: 1) What platform do you use and what is time spent you get on your platform? 2) Which portion of code do you measure time spent? 3) How do I know GU has been used? I modify the .the arm_gpu_mode function in include/caffe/common.hpp below and not sure it is correct or not to force using gpu mode: //inline static bool arm_gpu_mode() {return Get().use_mali_gpu_;} inline static bool arm_gpu_mode() {return true;} Thanks,

…

On Sun, Jul 2, 2017 at 1:15 AM, honggui ***@***.***> wrote: Kaishijeng, you time is much longer than what I measured. there's a line "force_number_of_threads(0)" in the file src\runtime\CPP\CPPScheduler.cpp. You may change the line to "force_number_of_threads(1)", and have a try again. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3iSp9i_UcMhOD90AK9f42Y5aOYbYks5sJ1GvgaJpZM4OLa-n> .

honggui · 2017-07-03T01:09:09Z

Hi Kaishijeng，
I make a mistake, the line is not in ACL 17.06. You may use CPPScheduler::set_num_threads(1) to have the try.
To enable GPU mode with Caffe::set_mode(Caffe::GPU)<see examples/cpp_classification/classification_gpu.cpp as the example>
best regards
Honggui

kaishijeng · 2017-07-04T02:21:46Z

Reduce 0.3sec from 4.5 to 4.2 with set_num_threads(1).
What numbers do you get in your test?

Thanks,

honggui · 2017-07-06T05:58:30Z

kaishijeng，
the log was listed below:(Include setup time, it's 1.794151s. If not count setup time, it is 0.62415s per forward)
Regards,
Honggui

firefly@firefly:~/caffeOnACL$ ./build/examples/cpp_classification/classification_profiling.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg
LOGACL<0>
LOGACL: 0
---------- Prediction for examples/images/cat.jpg ----------
used time: 1795
Input/output shape for each layer ... total: 24

LAYER IDX: 23 name: prob type: Softmax
bottom fc8: 1 1000
top prob: 1 1000

LAYER IDX: 22 name: fc8 type: InnerProduct
bottom fc7: 1 4096
top fc8: 1 1000

LAYER IDX: 21 name: drop7 type: Dropout
bottom fc7: 1 4096
top fc7: 1 4096

LAYER IDX: 20 name: relu7 type: ReLU
bottom fc7: 1 4096
top fc7: 1 4096

LAYER IDX: 19 name: fc7 type: InnerProduct
bottom fc6: 1 4096
top fc7: 1 4096

LAYER IDX: 18 name: drop6 type: Dropout
bottom fc6: 1 4096
top fc6: 1 4096

LAYER IDX: 17 name: relu6 type: ReLU
bottom fc6: 1 4096
top fc6: 1 4096

LAYER IDX: 16 name: fc6 type: InnerProduct
bottom pool5: 1 256 6 6
top fc6: 1 4096

LAYER IDX: 15 name: pool5 type: Pooling
bottom conv5: 1 256 13 13
top pool5: 1 256 6 6

LAYER IDX: 14 name: relu5 type: ReLU
bottom conv5: 1 256 13 13
top conv5: 1 256 13 13

LAYER IDX: 13 name: conv5 type: Convolution
bottom conv4: 1 384 13 13
top conv5: 1 256 13 13

LAYER IDX: 12 name: relu4 type: ReLU
bottom conv4: 1 384 13 13
top conv4: 1 384 13 13

LAYER IDX: 11 name: conv4 type: Convolution
bottom conv3: 1 384 13 13
top conv4: 1 384 13 13

LAYER IDX: 10 name: relu3 type: ReLU
bottom conv3: 1 384 13 13
top conv3: 1 384 13 13

LAYER IDX: 9 name: conv3 type: Convolution
bottom norm2: 1 256 13 13
top conv3: 1 384 13 13

LAYER IDX: 8 name: norm2 type: LRN
bottom pool2: 1 256 13 13
top norm2: 1 256 13 13

LAYER IDX: 7 name: pool2 type: Pooling
bottom conv2: 1 256 27 27
top pool2: 1 256 13 13

LAYER IDX: 6 name: relu2 type: ReLU
bottom conv2: 1 256 27 27
top conv2: 1 256 27 27

LAYER IDX: 5 name: conv2 type: Convolution
bottom norm1: 1 96 27 27
top conv2: 1 256 27 27

LAYER IDX: 4 name: norm1 type: LRN
bottom pool1: 1 96 27 27
top norm1: 1 96 27 27

LAYER IDX: 3 name: pool1 type: Pooling
bottom conv1: 1 96 55 55
top pool1: 1 96 27 27

LAYER IDX: 2 name: relu1 type: ReLU
bottom conv1: 1 96 55 55
top conv1: 1 96 55 55

LAYER IDX: 1 name: conv1 type: Convolution
bottom data: 1 3 227 227
top conv1: 1 96 55 55

LAYER IDX: 0 name: data type: Input
top data: 1 3 227 227
Time for each layer ... sum of all layers is : 1794151

LAYER IDX: 23 name: prob type: Softmax ratio: 0
time stat: total: 0 count: 1 average: 0 start: 597045632 end: 597045632

LAYER IDX: 22 name: fc8 type: InnerProduct ratio: 4.23632
time stat: total: 76006 count: 1 average: 76006 start: 596969626 end: 597045632

LAYER IDX: 21 name: drop7 type: Dropout ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596969626 end: 596969626

LAYER IDX: 20 name: relu7 type: ReLU ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596969626 end: 596969626

LAYER IDX: 19 name: fc7 type: InnerProduct ratio: 20.903
time stat: total: 375031 count: 1 average: 375031 start: 596594595 end: 596969626

LAYER IDX: 18 name: drop6 type: Dropout ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596594595 end: 596594595

LAYER IDX: 17 name: relu6 type: ReLU ratio: 0
time stat: total: 0 count: 1 average: 0 start: 596594595 end: 596594595

LAYER IDX: 16 name: fc6 type: InnerProduct ratio: 42.5307
time stat: total: 763065 count: 1 average: 763065 start: 595831530 end: 596594595

LAYER IDX: 15 name: pool5 type: Pooling ratio: 1.05905
time stat: total: 19001 count: 1 average: 19001 start: 595811528 end: 595830529

LAYER IDX: 14 name: relu5 type: ReLU ratio: 0
time stat: total: 0 count: 1 average: 0 start: 595811528 end: 595811528

LAYER IDX: 13 name: conv5 type: Convolution ratio: 1.61653
time stat: total: 29003 count: 1 average: 29003 start: 595782525 end: 595811528

LAYER IDX: 12 name: relu4 type: ReLU ratio: 0.0557367
time stat: total: 1000 count: 1 average: 1000 start: 595781525 end: 595782525

LAYER IDX: 11 name: conv4 type: Convolution ratio: 2.73132
time stat: total: 49004 count: 1 average: 49004 start: 595732521 end: 595781525

LAYER IDX: 10 name: relu3 type: ReLU ratio: 0.0557367
time stat: total: 1000 count: 1 average: 1000 start: 595731521 end: 595732521

LAYER IDX: 9 name: conv3 type: Convolution ratio: 10.7581
time stat: total: 193016 count: 1 average: 193016 start: 595538505 end: 595731521

LAYER IDX: 8 name: norm2 type: LRN ratio: 0.334476
time stat: total: 6001 count: 1 average: 6001 start: 595532504 end: 595538505

LAYER IDX: 7 name: pool2 type: Pooling ratio: 1.95095
time stat: total: 35003 count: 1 average: 35003 start: 595497501 end: 595532504

LAYER IDX: 6 name: relu2 type: ReLU ratio: 0.222947
time stat: total: 4000 count: 1 average: 4000 start: 595493501 end: 595497501

LAYER IDX: 5 name: conv2 type: Convolution ratio: 8.97438
time stat: total: 161014 count: 1 average: 161014 start: 595332487 end: 595493501

LAYER IDX: 4 name: norm1 type: LRN ratio: 0.390212
time stat: total: 7001 count: 1 average: 7001 start: 595325486 end: 595332487

LAYER IDX: 3 name: pool1 type: Pooling ratio: 1.11484
time stat: total: 20002 count: 1 average: 20002 start: 595305484 end: 595325486

LAYER IDX: 2 name: relu1 type: ReLU ratio: 0.33442
time stat: total: 6000 count: 1 average: 6000 start: 595299484 end: 595305484

LAYER IDX: 1 name: conv1 type: Convolution ratio: 2.73132
time stat: total: 49004 count: 1 average: 49004 start: 595250480 end: 595299484

LAYER IDX: 0 name: data type: Input ratio: 0
time stat: total: 0 count: 1 average: 0 start: 595250480 end: 595250480

STATS for 10 reptitions: ...
Total time: 624150 per forward
Each layer stats: ...
23: used time: 100 ratio: 0.0160218 enter count: 1
22: used time: 18001 ratio: 2.88416 enter count: 1
21: used time: 0 ratio: 0 enter count: 1
20: used time: 0 ratio: 0 enter count: 1
19: used time: 68005 ratio: 10.8957 enter count: 1
18: used time: 0 ratio: 0 enter count: 1
17: used time: 0 ratio: 0 enter count: 1
16: used time: 181514 ratio: 29.0819 enter count: 1
15: used time: 23601 ratio: 3.78145 enter count: 1
14: used time: 200 ratio: 0.0320596 enter count: 1
13: used time: 22701 ratio: 3.63722 enter count: 1
12: used time: 200 ratio: 0.0320436 enter count: 1
11: used time: 42503 ratio: 6.80979 enter count: 1
10: used time: 400 ratio: 0.0640872 enter count: 1
9: used time: 67305 ratio: 10.7835 enter count: 1
8: used time: 4200 ratio: 0.672979 enter count: 1
7: used time: 26802 ratio: 4.29418 enter count: 1
6: used time: 1100 ratio: 0.17624 enter count: 1
5: used time: 109508 ratio: 17.5453 enter count: 1
4: used time: 5100 ratio: 0.817159 enter count: 1
3: used time: 15501 ratio: 2.48357 enter count: 1
2: used time: 2400 ratio: 0.384587 enter count: 1
1: used time: 35002 ratio: 5.60806 enter count: 1
0: used time: 0 ratio: 0 enter count: 1

time cost top 10 layers are: ...
16: used time: 181514 ratio: 29.0819 enter count: 1
5: used time: 109508 ratio: 17.5453 enter count: 1
19: used time: 68005 ratio: 10.8957 enter count: 1
9: used time: 67305 ratio: 10.7835 enter count: 1
11: used time: 42503 ratio: 6.80979 enter count: 1
1: used time: 35002 ratio: 5.60806 enter count: 1
7: used time: 26802 ratio: 4.29418 enter count: 1
15: used time: 23601 ratio: 3.78145 enter count: 1
13: used time: 22701 ratio: 3.63722 enter count: 1
22: used time: 18001 ratio: 2.88416 enter count: 1
Top cost layers occupied: 95.3213

0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

kaishijeng · 2017-07-06T06:12:50Z

How do you get the log?
I ran the same command below and got only classification result, no profiling log.

./build/examples/cpp_classification/classification_profiling.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg

---------- Prediction for examples/images/cat.jpg ----------
0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

honggui · 2017-07-06T06:50:23Z

hi kaishijeng,
We can change the "USE_PROFILING" in Makefile.config to enable the profiling messages.
CPU_ONLY := 1
USE_PROFILING := 1
USE_ACL :=1
Reagards,
Honggui

kaishijeng · 2017-07-07T03:22:45Z

See below is my profiling result which is very similar to you get:

STATS for 10 reptitions: ...
Total time: 607204 per forward
Each layer stats: ...
23: used time: 4400 ratio: 0.724633 enter count: 1
22: used time: 4800 ratio: 0.790509 enter count: 1
21: used time: 0 ratio: 0 enter count: 1
20: used time: 400 ratio: 0.0658757 enter count: 1
19: used time: 18400 ratio: 3.03028 enter count: 1
18: used time: 0 ratio: 0 enter count: 1
17: used time: 800 ratio: 0.131751 enter count: 1
16: used time: 53200 ratio: 8.76154 enter count: 1
15: used time: 114400 ratio: 18.8406 enter count: 1
14: used time: 2000 ratio: 0.329379 enter count: 1
13: used time: 13600 ratio: 2.23979 enter count: 1
12: used time: 2800 ratio: 0.461147 enter count: 1
11: used time: 16800 ratio: 2.7668 enter count: 1
10: used time: 1200 ratio: 0.197644 enter count: 1
9: used time: 46400 ratio: 7.64165 enter count: 1
8: used time: 34400 ratio: 5.66533 enter count: 1
7: used time: 126800 ratio: 20.8828 enter count: 1
6: used time: 3600 ratio: 0.592882 enter count: 1
5: used time: 55600 ratio: 9.15677 enter count: 1
4: used time: 15200 ratio: 2.50329 enter count: 1
3: used time: 53600 ratio: 8.82741 enter count: 1
2: used time: 4800 ratio: 0.790509 enter count: 1
1: used time: 34000 ratio: 5.59949 enter count: 1
0: used time: 0 ratio: 0 enter count: 1

time cost top 10 layers are: ...
7: used time: 126800 ratio: 20.8828 enter count: 1
15: used time: 114400 ratio: 18.8406 enter count: 1
5: used time: 55600 ratio: 9.15677 enter count: 1
3: used time: 53600 ratio: 8.82741 enter count: 1
16: used time: 53200 ratio: 8.76154 enter count: 1
9: used time: 46400 ratio: 7.64165 enter count: 1
8: used time: 34400 ratio: 5.66533 enter count: 1
1: used time: 34000 ratio: 5.59949 enter count: 1
19: used time: 18400 ratio: 3.03028 enter count: 1
11: used time: 16800 ratio: 2.7668 enter count: 1
Top cost layers occupied: 91.1726

kaishijeng · 2017-07-07T03:25:35Z

STATS for 10 reptitions: ...
Total time: 607204 per forward

Does it mean time per forward is 607msec?

Thanks,

honggui · 2017-07-07T03:30:08Z

Hi Kaishijeng,
Yes，you are right.
Regards,
Honggui

kaishijeng · 2017-07-07T06:11:58Z

Honggui How do you do profile with original caffe so that I can compare performance with caffeOnACL? Thanks

…

On Thu, Jul 6, 2017 at 8:30 PM, honggui ***@***.***> wrote: Hi Kaishijeng, Yes，you are right. Regards, Honggui — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3vwngpkK70UA2cZtnw3l42V3wQmTks5sLaZBgaJpZM4OLa-n> .

austingg · 2017-07-07T06:29:13Z

@kaishijeng you may use

your/caffe/binary/caffe time -model alexnet.prototxt

@kaishijeng @honggui by the way, you guys test the performance on desktop processor? Is there some statistics from mobile devices? and In the doc, the ACL_NEON seems slower than offical caffe with openblas. Which devices are tested? seems a long way to go if test on 32 bit platform, since 32 bit openblas don't use neon to speed up.

kaishijeng · 2017-07-07T07:04:45Z

Not sure performance number from the following command is a fair comparison to CaffeOnACL: I0707 06:58:46.069133 9441 caffe.cpp:417] Average Forward pass: 1580.33 ms. firefly@firefly:~/2TB/src/caffe$ ./.build_release/tools/caffe time -model models/bvlc_reference_caffenet/deploy.prototxt I0707 06:56:26.991888 9441 caffe.cpp:352] Use CPU. I0707 06:56:27.031245 9441 net.cpp:51] Initializing net from parameters: name: "CaffeNet" state { phase: TRAIN level: 0 stage: "" } layer { name: "data" type: "Input" top: "data" input_param { shape { dim: 10 dim: 3 dim: 227 dim: 227 } } } layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 96 kernel_size: 11 stride: 4 } } layer { name: "relu1" type: "ReLU" bottom: "conv1" top: "conv1" } layer { name: "pool1" type: "Pooling" bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "norm1" type: "LRN" bottom: "pool1" top: "norm1" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 } } layer { name: "conv2" type: "Convolution" bottom: "norm1" top: "conv2" convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 2 } } layer { name: "relu2" type: "ReLU" bottom: "conv2" top: "conv2" } layer { name: "pool2" type: "Pooling" bottom: "conv2" top: "pool2" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "norm2" type: "LRN" bottom: "pool2" top: "norm2" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 } } layer { name: "conv3" type: "Convolution" bottom: "norm2" top: "conv3" convolution_param { num_output: 384 pad: 1 kernel_size: 3 } } layer { name: "relu3" type: "ReLU" bottom: "conv3" top: "conv3" } layer { name: "conv4" type: "Convolution" bottom: "conv3" top: "conv4" convolution_param { num_output: 384 pad: 1 kernel_size: 3 group: 2 } } layer { name: "relu4" type: "ReLU" bottom: "conv4" top: "conv4" } layer { name: "conv5" type: "Convolution" bottom: "conv4" top: "conv5" convolution_param { num_output: 256 pad: 1 kernel_size: 3 group: 2 } } layer { name: "relu5" type: "ReLU" bottom: "conv5" top: "conv5" } layer { name: "pool5" type: "Pooling" bottom: "conv5" top: "pool5" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "fc6" type: "InnerProduct" bottom: "pool5" top: "fc6" inner_product_param { num_output: 4096 } } layer { name: "relu6" type: "ReLU" bottom: "fc6" top: "fc6" } layer { name: "drop6" type: "Dropout" bottom: "fc6" top: "fc6" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc7" type: "InnerProduct" bottom: "fc6" top: "fc7" inner_product_param { num_output: 4096 } } layer { name: "relu7" type: "ReLU" bottom: "fc7" top: "fc7" } layer { name: "drop7" type: "Dropout" bottom: "fc7" top: "fc7" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc8" type: "InnerProduct" bottom: "fc7" top: "fc8" inner_product_param { num_output: 1000 } } layer { name: "prob" type: "Softmax" bottom: "fc8" top: "prob" } I0707 06:56:27.254829 9441 caffe.cpp:360] Performing Forward I0707 06:56:28.812577 9441 caffe.cpp:365] Initial loss: 0 I0707 06:56:28.812671 9441 caffe.cpp:366] Performing Backward I0707 06:56:28.812688 9441 caffe.cpp:374] *** Benchmark begins *** I0707 06:56:28.812697 9441 caffe.cpp:375] Testing for 50 iterations. I0707 06:56:31.571130 9441 caffe.cpp:403] Iteration: 1 forward-backward time: 2758 ms. I0707 06:56:34.219300 9441 caffe.cpp:403] Iteration: 2 forward-backward time: 2647 ms. I0707 06:56:36.851164 9441 caffe.cpp:403] Iteration: 3 forward-backward time: 2631 ms. I0707 06:56:39.500258 9441 caffe.cpp:403] Iteration: 4 forward-backward time: 2648 ms. I0707 06:56:42.151398 9441 caffe.cpp:403] Iteration: 5 forward-backward time: 2650 ms. I0707 06:56:44.799932 9441 caffe.cpp:403] Iteration: 6 forward-backward time: 2648 ms. I0707 06:56:47.448256 9441 caffe.cpp:403] Iteration: 7 forward-backward time: 2648 ms. I0707 06:56:50.095988 9441 caffe.cpp:403] Iteration: 8 forward-backward time: 2647 ms. I0707 06:56:52.744285 9441 caffe.cpp:403] Iteration: 9 forward-backward time: 2648 ms. I0707 06:56:55.396378 9441 caffe.cpp:403] Iteration: 10 forward-backward time: 2651 ms. I0707 06:56:58.047657 9441 caffe.cpp:403] Iteration: 11 forward-backward time: 2651 ms. I0707 06:57:00.724208 9441 caffe.cpp:403] Iteration: 12 forward-backward time: 2676 ms. I0707 06:57:03.415966 9441 caffe.cpp:403] Iteration: 13 forward-backward time: 2691 ms. I0707 06:57:06.115960 9441 caffe.cpp:403] Iteration: 14 forward-backward time: 2699 ms. I0707 06:57:08.835702 9441 caffe.cpp:403] Iteration: 15 forward-backward time: 2719 ms. I0707 06:57:11.555269 9441 caffe.cpp:403] Iteration: 16 forward-backward time: 2719 ms. I0707 06:57:14.274786 9441 caffe.cpp:403] Iteration: 17 forward-backward time: 2719 ms. I0707 06:57:17.010529 9441 caffe.cpp:403] Iteration: 18 forward-backward time: 2735 ms. I0707 06:57:19.747344 9441 caffe.cpp:403] Iteration: 19 forward-backward time: 2736 ms. I0707 06:57:22.479828 9441 caffe.cpp:403] Iteration: 20 forward-backward time: 2732 ms. I0707 06:57:25.228466 9441 caffe.cpp:403] Iteration: 21 forward-backward time: 2748 ms. I0707 06:57:27.979506 9441 caffe.cpp:403] Iteration: 22 forward-backward time: 2750 ms. I0707 06:57:30.732939 9441 caffe.cpp:403] Iteration: 23 forward-backward time: 2753 ms. I0707 06:57:33.488718 9441 caffe.cpp:403] Iteration: 24 forward-backward time: 2755 ms. I0707 06:57:36.250659 9441 caffe.cpp:403] Iteration: 25 forward-backward time: 2761 ms. I0707 06:57:38.991574 9441 caffe.cpp:403] Iteration: 26 forward-backward time: 2740 ms. I0707 06:57:41.754909 9441 caffe.cpp:403] Iteration: 27 forward-backward time: 2763 ms. I0707 06:57:44.510370 9441 caffe.cpp:403] Iteration: 28 forward-backward time: 2755 ms. I0707 06:57:47.282030 9441 caffe.cpp:403] Iteration: 29 forward-backward time: 2771 ms. I0707 06:57:50.053514 9441 caffe.cpp:403] Iteration: 30 forward-backward time: 2771 ms. I0707 06:57:53.114980 9441 caffe.cpp:403] Iteration: 31 forward-backward time: 3061 ms. I0707 06:57:56.100261 9441 caffe.cpp:403] Iteration: 32 forward-backward time: 2985 ms. I0707 06:57:58.875066 9441 caffe.cpp:403] Iteration: 33 forward-backward time: 2774 ms. I0707 06:58:01.651820 9441 caffe.cpp:403] Iteration: 34 forward-backward time: 2776 ms. I0707 06:58:04.404618 9441 caffe.cpp:403] Iteration: 35 forward-backward time: 2752 ms. I0707 06:58:07.187002 9441 caffe.cpp:403] Iteration: 36 forward-backward time: 2782 ms. I0707 06:58:09.971091 9441 caffe.cpp:403] Iteration: 37 forward-backward time: 2783 ms. I0707 06:58:12.750619 9441 caffe.cpp:403] Iteration: 38 forward-backward time: 2779 ms. I0707 06:58:15.513088 9441 caffe.cpp:403] Iteration: 39 forward-backward time: 2762 ms. I0707 06:58:18.293782 9441 caffe.cpp:403] Iteration: 40 forward-backward time: 2780 ms. I0707 06:58:21.070822 9441 caffe.cpp:403] Iteration: 41 forward-backward time: 2776 ms. I0707 06:58:23.830873 9441 caffe.cpp:403] Iteration: 42 forward-backward time: 2759 ms. I0707 06:58:26.594636 9441 caffe.cpp:403] Iteration: 43 forward-backward time: 2763 ms. I0707 06:58:29.376324 9441 caffe.cpp:403] Iteration: 44 forward-backward time: 2781 ms. I0707 06:58:32.151278 9441 caffe.cpp:403] Iteration: 45 forward-backward time: 2774 ms. I0707 06:58:34.932479 9441 caffe.cpp:403] Iteration: 46 forward-backward time: 2780 ms. I0707 06:58:37.702002 9441 caffe.cpp:403] Iteration: 47 forward-backward time: 2769 ms. I0707 06:58:40.484354 9441 caffe.cpp:403] Iteration: 48 forward-backward time: 2782 ms. I0707 06:58:43.274502 9441 caffe.cpp:403] Iteration: 49 forward-backward time: 2789 ms. I0707 06:58:46.065948 9441 caffe.cpp:403] Iteration: 50 forward-backward time: 2791 ms. I0707 06:58:46.066244 9441 caffe.cpp:406] Average time per layer: I0707 06:58:46.066313 9441 caffe.cpp:409] data forward: 0.00226 ms. I0707 06:58:46.066375 9441 caffe.cpp:412] data backward: 0.0033 ms. I0707 06:58:46.066432 9441 caffe.cpp:409] conv1 forward: 151.357 ms. I0707 06:58:46.066490 9441 caffe.cpp:412] conv1 backward: 134.551 ms. I0707 06:58:46.066547 9441 caffe.cpp:409] relu1 forward: 7.30002 ms. I0707 06:58:46.066602 9441 caffe.cpp:412] relu1 backward: 0.00226 ms. I0707 06:58:46.066658 9441 caffe.cpp:409] pool1 forward: 36.679 ms. I0707 06:58:46.066712 9441 caffe.cpp:412] pool1 backward: 0.0037 ms. I0707 06:58:46.066767 9441 caffe.cpp:409] norm1 forward: 67.7754 ms. I0707 06:58:46.066823 9441 caffe.cpp:412] norm1 backward: 69.7601 ms. I0707 06:58:46.066876 9441 caffe.cpp:409] conv2 forward: 354.68 ms. I0707 06:58:46.066968 9441 caffe.cpp:412] conv2 backward: 339.333 ms. I0707 06:58:46.067028 9441 caffe.cpp:409] relu2 forward: 4.3349 ms. I0707 06:58:46.067081 9441 caffe.cpp:412] relu2 backward: 0.00196 ms. I0707 06:58:46.067137 9441 caffe.cpp:409] pool2 forward: 23.469 ms. I0707 06:58:46.067190 9441 caffe.cpp:412] pool2 backward: 0.00356 ms. I0707 06:58:46.067245 9441 caffe.cpp:409] norm2 forward: 44.1165 ms. I0707 06:58:46.067299 9441 caffe.cpp:412] norm2 backward: 45.2355 ms. I0707 06:58:46.067378 9441 caffe.cpp:409] conv3 forward: 182.216 ms. I0707 06:58:46.067433 9441 caffe.cpp:412] conv3 backward: 146.802 ms. I0707 06:58:46.067489 9441 caffe.cpp:409] relu3 forward: 1.48994 ms. I0707 06:58:46.067543 9441 caffe.cpp:412] relu3 backward: 0.0036 ms. I0707 06:58:46.067597 9441 caffe.cpp:409] conv4 forward: 145.296 ms. I0707 06:58:46.067652 9441 caffe.cpp:412] conv4 backward: 121.937 ms. I0707 06:58:46.067708 9441 caffe.cpp:409] relu4 forward: 1.4964 ms. I0707 06:58:46.067761 9441 caffe.cpp:412] relu4 backward: 0.00316 ms. I0707 06:58:46.067816 9441 caffe.cpp:409] conv5 forward: 122.753 ms. I0707 06:58:46.067870 9441 caffe.cpp:412] conv5 backward: 111.253 ms. I0707 06:58:46.067925 9441 caffe.cpp:409] relu5 forward: 0.9969 ms. I0707 06:58:46.067980 9441 caffe.cpp:412] relu5 backward: 0.00196 ms. I0707 06:58:46.068033 9441 caffe.cpp:409] pool5 forward: 6.49218 ms. I0707 06:58:46.068087 9441 caffe.cpp:412] pool5 backward: 0.00324 ms. I0707 06:58:46.068141 9441 caffe.cpp:409] fc6 forward: 256.357 ms. I0707 06:58:46.068197 9441 caffe.cpp:412] fc6 backward: 117.352 ms. I0707 06:58:46.068250 9441 caffe.cpp:409] relu6 forward: 0.10042 ms. I0707 06:58:46.068305 9441 caffe.cpp:412] relu6 backward: 0.00174 ms. I0707 06:58:46.068358 9441 caffe.cpp:409] drop6 forward: 0.42372 ms. I0707 06:58:46.068413 9441 caffe.cpp:412] drop6 backward: 0.00324 ms. I0707 06:58:46.068469 9441 caffe.cpp:409] fc7 forward: 136.134 ms. I0707 06:58:46.068522 9441 caffe.cpp:412] fc7 backward: 57.1792 ms. I0707 06:58:46.068577 9441 caffe.cpp:409] relu7 forward: 0.09016 ms. I0707 06:58:46.068631 9441 caffe.cpp:412] relu7 backward: 0.00196 ms. I0707 06:58:46.068686 9441 caffe.cpp:409] drop7 forward: 0.37678 ms. I0707 06:58:46.068739 9441 caffe.cpp:412] drop7 backward: 0.0037 ms. I0707 06:58:46.068794 9441 caffe.cpp:409] fc8 forward: 35.62 ms. I0707 06:58:46.068850 9441 caffe.cpp:412] fc8 backward: 20.6572 ms. I0707 06:58:46.068903 9441 caffe.cpp:409] prob forward: 0.48392 ms. I0707 06:58:46.068958 9441 caffe.cpp:412] prob backward: 0.13076 ms. I0707 06:58:46.069133 9441 caffe.cpp:417] Average Forward pass: 1580.33 ms. I0707 06:58:46.069190 9441 caffe.cpp:419] Average Backward pass: 1164.42 ms. I0707 06:58:46.069244 9441 caffe.cpp:421] Average Forward-Backward: 2745.12 ms. I0707 06:58:46.069344 9441 caffe.cpp:423] Total Time: 137256 ms. I0707 06:58:46.069401 9441 caffe.cpp:424] *** Benchmark ends ***

…

On Thu, Jul 6, 2017 at 11:29 PM, Yubin Wang ***@***.***> wrote: @kaishijeng <https://github.com/kaishijeng> you may use your/caffe/binary/caffe time -model alexnet.prototxt @kaishijeng <https://github.com/kaishijeng> @honggui <https://github.com/honggui> by the way, you guys test the performance on desktop processor? Is there some statistics from mobile devices? and In the doc, the ACL_NEON seems slower than offical caffe with openblas. Which devices are tested? seems a long way to go if test on 32 bit platform, since 32 bit openblas don't use neon to speed up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3rnE6hS_dD7QjHifuZNeGJPFn9z-ks5sLdA5gaJpZM4OLa-n> .

kaishijeng · 2017-07-08T05:06:30Z

If the above numbers are fair comparison, then ACL has 2.5 speedup over a pure CPU on a firefly platform.

I saw there is a MxnetonACL on github. Not sure whether it will have a plan to have TensorflowOnACL.
because I use Tensorflow framework for most of my ML projects.

xhbdahai · 2017-07-11T03:43:31Z

Hi Kaishijeng:
There is no clear plan so far with TensorFlow so far.

kaishijeng · 2017-07-11T04:19:24Z

Thanks for an update.

psyhtest · 2017-07-11T07:27:45Z

@xhbdahai, @honggui, @openailab-sh You will invariably end up with more questions about benchmarking Caffe-on-ACL against Caffe (or indeed other frameworks). Have you considered using / contributing to CK-Caffe? It's part of a growing suite of AI benchmarking tools based on Collective Knowledge, also including e.g. CK-Caffe2, CK-TensorFlow, CK-TensorRT, CK-KaNN.

For example, we have released benchmarking data for the Firefly-RK3399 platform that @kaishijeng uses.

For example, for the batch size of 2 (the smallest we have measured) on AlexNet (the closest to CaffeNet we have measured), we have obtained the following data for forward propagation (inference):

OpenBLAS: 695 ms
clBLAS: ~3700 ms
ViennaCL: ~3650 ms
CLBlast: ~4500 ms
libDNN w/ CLBlast: ~2160 ms
CLBlast (tuned by dividiti): ~1320 ms

(I can easily benchmark CaffeNet with the batch size of 1 if you are interested.)

Would you be interested in collaborating on adding Caffe-on-ACL to CK-Caffe?

psyhtest · 2017-07-11T07:29:49Z

As an added bonus, we already support ACL package and crowdbenchmarking across mobile devices.

OAIL · 2017-07-13T07:43:47Z

@psyhtest adding caffeOnACL to CK-Caffe is a good idea. will give you feedback after checking effort.

psyhtest · 2017-07-27T23:11:14Z

@OAIL How is the effort looking to you? :)

baynaa7 · 2018-02-26T09:20:40Z

Hello @honggui I am testing caffeACL vs caffe on tx2 board.
however classification example on alexnet gives following result.
arguments are exactly same with kaishijeng.
caffeACL: elapsed time: [2.28925] seconds
caffe: elapsed time: [1.2105] seconds

note both running on cpu version

any possible hypothesis for these results?
Thanks in advance.

honggui · 2018-02-27T02:45:03Z

Hi pcub,
The performance of the different layers, some may be better with ACL, and others may be better with OpenBLAS. To refer https://github.com/OAID/Caffe-HRT/blob/master/acl_openailab/user_manual.pdf, we could know the proper library for the operator.
BTW, OpenBLAS's threads seems affect ACL's threads much. Sometimes we could use "export OPENBLAS_NUM_THREADS=1" to lower the side effect.
Best Regards,
Honggui

baynaa7 · 2018-02-28T08:41:58Z

Thanks @honggui
export OPENBLAS_NUM_THREADS=1 works.

Steven9402 · 2018-06-22T09:59:30Z

Hi honggui, I want to test the performance of face recognition application on multi threads. Where I can add "CPPScheduler::set_num_threads(x)" to enable multi thread test for ACL? The executable that I use is OAID/FaceRecognition/bin/face-recognition.cpp. Also, I want to know if there is any interface to modify the number of threads used. Thanks a lot!

AnthonyBarbier mentioned this issue Jul 13, 2017

alexnet mali profiling ARM-software/ComputeLibrary#173

Closed

psyhtest mentioned this issue Jul 27, 2017

cmake error with lib-clblast-mali-overlay dividiti/ck-caffe#117

Closed

ymotola mentioned this issue Jan 18, 2018

OPENCL initialization failed - Without GPU support #32

Closed

baynaa7 mentioned this issue Feb 26, 2018

forward will use more time when enable ACL #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance gain ith ACL #2

performance gain ith ACL #2

kaishijeng commented Jul 1, 2017

honggui commented Jul 2, 2017

kaishijeng commented Jul 2, 2017 via email

honggui commented Jul 2, 2017 •

edited

Loading

kaishijeng commented Jul 2, 2017 via email

honggui commented Jul 3, 2017 •

edited

Loading

kaishijeng commented Jul 4, 2017

honggui commented Jul 6, 2017

kaishijeng commented Jul 6, 2017

honggui commented Jul 6, 2017

kaishijeng commented Jul 7, 2017

kaishijeng commented Jul 7, 2017

honggui commented Jul 7, 2017

kaishijeng commented Jul 7, 2017 via email

austingg commented Jul 7, 2017

kaishijeng commented Jul 7, 2017 via email

kaishijeng commented Jul 8, 2017

xhbdahai commented Jul 11, 2017

kaishijeng commented Jul 11, 2017

psyhtest commented Jul 11, 2017

psyhtest commented Jul 11, 2017 •

edited

Loading

OAIL commented Jul 13, 2017

psyhtest commented Jul 27, 2017

baynaa7 commented Feb 26, 2018

honggui commented Feb 27, 2018 •

edited

Loading

baynaa7 commented Feb 28, 2018

Steven9402 commented Jun 22, 2018

performance gain ith ACL #2

performance gain ith ACL #2

Comments

kaishijeng commented Jul 1, 2017

honggui commented Jul 2, 2017

kaishijeng commented Jul 2, 2017 via email

honggui commented Jul 2, 2017 • edited Loading

kaishijeng commented Jul 2, 2017 via email

honggui commented Jul 3, 2017 • edited Loading

kaishijeng commented Jul 4, 2017

honggui commented Jul 6, 2017

kaishijeng commented Jul 6, 2017

honggui commented Jul 6, 2017

kaishijeng commented Jul 7, 2017

kaishijeng commented Jul 7, 2017

honggui commented Jul 7, 2017

kaishijeng commented Jul 7, 2017 via email

austingg commented Jul 7, 2017

kaishijeng commented Jul 7, 2017 via email

kaishijeng commented Jul 8, 2017

xhbdahai commented Jul 11, 2017

kaishijeng commented Jul 11, 2017

psyhtest commented Jul 11, 2017

psyhtest commented Jul 11, 2017 • edited Loading

OAIL commented Jul 13, 2017

psyhtest commented Jul 27, 2017

baynaa7 commented Feb 26, 2018

honggui commented Feb 27, 2018 • edited Loading

baynaa7 commented Feb 28, 2018

Steven9402 commented Jun 22, 2018

honggui commented Jul 2, 2017 •

edited

Loading

honggui commented Jul 3, 2017 •

edited

Loading

psyhtest commented Jul 11, 2017 •

edited

Loading

honggui commented Feb 27, 2018 •

edited

Loading