uint8 quantized model runs slower than fp32 model #667

liamsun2019 · 2022-07-12T07:32:23Z

Hi author，
I encountered a question while doing inference on cortex-A55 aarch64 with CpuAcc as the backend. There are 2 models , one is fp32 and the other one is uint8 quantized. My tests showed that fp32 ran even faster than the uint8 quantized one. Just curious why this would happen. Please refer to the attachment for the 2 models. In addition, both c++ parser mode and delegate mode have the same issue. Appreciate your suggestions. Thanks.
test.zip

liamsun2019 · 2022-07-12T07:33:53Z

ReduceFp32ToFp16 is set to True in my tests.

catcor01 · 2022-07-19T08:39:18Z

Hi @liamsun2019,

I am getting 2 warnings for GATHER and TRANSPOSE running your models with CpuAcc seen in your issue #666. I just want to confirm these are still present for you so I can comment correctly on the results?

Running the models on CpuAcc with the following commands I can confirm the same regression (~245ms vs ~263ms):

./ExecuteNetwork -m u8l.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i X.1 -o 2180 --number-of-threads 1 --iterations 10
./ExecuteNetwork -m fp32.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i input.55 -o 1456 --number-of-threads 1 --iterations 10

From a quick look, I cannot see any operator that runs faster on fp32 compared to uint8 model. The profiling is quite extensive so I will spend some time looking through and come back if I get something.

Kind Regards, Cathal.

catcor01 · 2022-07-19T08:42:38Z

f32.txt
u8.txt

catcor01 · 2022-07-19T10:33:30Z

One thing I have noticed: average pooling (only used once in your model) is not supported in CpuAcc for uint8 and therefore the operation falls back to CpuRef.

Time Cost: ~4000us (CpuRef for uint8) vs ~117us = ~3900us = ~3.9ms.
There is a time cost to fall back to CpuRef due to a memory copy before and after the operation but it is negligible compared to above: ~9us and ~5us = max 15us.

@morgolock you might have an idea on if uint8 support for average pooling 2d can be added to compute library (it seems uint8 max pool 2d support it already there). Perhaps it cannot be added because of some kind of padding? Warning message: Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.

catcor01 · 2022-07-19T13:17:06Z

Along with the above the following is what I have discovered:

The uint8 model performs quantize and de-quantize operations (NeonQuantizeWorkload_Execute_#227 being the biggest time cost) which is adding up to approx 3.5-4 ms.
CpuAcc pooling 2d is slower for uint8. Can be up to 1 ms slower.
CpuAcc concat is slower for uint8. Can be up between 1-2 ms slower.
CpuRef gather operation can be twice as slow for the uint8 model (1.25ms vs 0.6ms).

catcor01 · 2022-07-19T15:08:42Z

Hello @liamsun2019,

Falling back to CpuRef is very much degrading your performance. Unfortunately, because many of the transpose and gather operations are not supported for CpuAcc, fallback is inevitable. We do not guarantee uint8 performance in CpuRef to be better than fp32 (it will actually more than likely be slower because of how it is implemented in ArmNN) which is why you are seeing worse uint8 performance. However, by using the delegate you can fallback to TfLite runtime and not CpuRef which should have efficient uint8 performance compared to float32. You can do that by running the following:

./ExecuteNetwork -m u8l.tflite -f tflite-binary --tflite-executor delegate -c CpuAcc -i X.1 -o 2180 --number-of-threads 1 --iterations 10

I hope this will improve the performance of your uint8 model.

Kind Regards, Cathal.

catcor01 · 2022-07-19T15:38:02Z

I have tried to run your model with the delegate and it fails due to the following error:

Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.
Warning: ERROR: Layer of type Pooling2d is not supported on any preferred backend [CpuAcc ]
terminate called after throwing an instance of 'armnn::Exception'
  what():  TfLiteArmnnDelegate: Exception (Failed to assign a backend to each layer) caught from optimize.

@SadikARM provided me with the following information of what is happening:
"
I believe why it is not falling back to the TfLite Runtime is that first IsLayerSupported() return true for Pooling2d layer which means it already delegated the layer to Arm NN then somewhere in the flow (seems like in optimization) CpuPool2dKernel::validate_arguments() called and it throw error. So in optimization level it is too late to fall back to TfLite Runtime because it delegated the graph already to Arm NN.
"
I will look into this and make a patch.

liamsun2019 · 2022-07-20T02:12:44Z

Hi @catcor01,

Many thanks for your time and so detailed analysis. Instead, I ran these 2 models basing on the sample codes. I made some modifications while building them, e.g, -DUSE_ARMNN_DELEGATE=0/1, to apply delegate or parser to the sample codes. I also noticed that there are many tranpose/gather operations in the model and I think that contributes some overhead to inference time. For delegate mode, I have not encountered the errors you listed. I will spend some time conducting more tests.

Thanks
Ｂ.R
Liam

catcor01 · 2022-07-29T12:58:38Z

Hello @liamsun2019,

A patch has been submitted to master (soon to be changed to main) fixing the above failure for CpuAcc. Your model should now be able to fully run using CpuAcc without the above error being thrown.

Kind Regards, Cathal.

…fix. * Originated from a GitHub issue: #667 * Initially, Arm NN supports the pool 2D operation because there is no padding on the pool2d. Neon failure occurs when padding is followed by average pool 2D due to folding optimization. * Here we prevent the folding optimization from happening for the above special case and add it in as a backend specific optimization. Signed-off-by: Cathal Corbett <cathal.corbett@arm.com> Change-Id: Ia0fd90c3a6b4b9d29c81106f154617d2e893e26b

liamsun2019 · 2022-08-04T02:47:12Z

Hi @catcor01，

Sorry for the late reply. I have been focusing on some other work recently. I will try this patch ASAP. Thanks for your kindly help.

…fix. * Originated from a GitHub issue: #667 * Initially, Arm NN supports the pool 2D operation because there is no padding on the pool2d. Neon failure occurs when padding is followed by average pool 2D due to folding optimization. * Here we prevent the folding optimization from happening for the above special case and add it in as a backend specific optimization. Signed-off-by: Cathal Corbett <cathal.corbett@arm.com> Change-Id: Ia0fd90c3a6b4b9d29c81106f154617d2e893e26b

keidav01 · 2022-09-20T14:26:43Z

@liamsun2019 could you let us know if this patch has fixed your issue? I will close this ticket otherwise. Thank you very much

liamsun2019 · 2022-09-21T01:02:16Z

Hi @keidav01 ,

There's no progress on my side since my attention has been absorbed by some other things so far. You can just close it and I will verify the patch ASAP. Thanks for your help.

B.R
Liam

keidav01 · 2022-09-21T08:53:56Z

Thank you @liamsun2019, closing

catcor01 added Bug Something isn't working Question Further information is requested Performance labels Jul 27, 2022

catcor01 added the TIME WAIT Waiting for an approppriate period for a response before closing the issue. label Jul 29, 2022

morgolock mentioned this issue Jul 29, 2022

uint8 quantized model runs slower than fp32 model ARM-software/ComputeLibrary#987

Closed

keidav01 closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uint8 quantized model runs slower than fp32 model #667

uint8 quantized model runs slower than fp32 model #667

liamsun2019 commented Jul 12, 2022

liamsun2019 commented Jul 12, 2022

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022

catcor01 commented Jul 19, 2022 •

edited

Loading

liamsun2019 commented Jul 20, 2022

catcor01 commented Jul 29, 2022

liamsun2019 commented Aug 4, 2022

keidav01 commented Sep 20, 2022

liamsun2019 commented Sep 21, 2022

keidav01 commented Sep 21, 2022

uint8 quantized model runs slower than fp32 model #667

uint8 quantized model runs slower than fp32 model #667

Comments

liamsun2019 commented Jul 12, 2022

liamsun2019 commented Jul 12, 2022

catcor01 commented Jul 19, 2022 • edited Loading

catcor01 commented Jul 19, 2022 • edited Loading

catcor01 commented Jul 19, 2022 • edited Loading

catcor01 commented Jul 19, 2022 • edited Loading

catcor01 commented Jul 19, 2022

catcor01 commented Jul 19, 2022 • edited Loading

liamsun2019 commented Jul 20, 2022

catcor01 commented Jul 29, 2022

liamsun2019 commented Aug 4, 2022

keidav01 commented Sep 20, 2022

liamsun2019 commented Sep 21, 2022

keidav01 commented Sep 21, 2022

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022 •

edited

Loading

catcor01 commented Jul 19, 2022 •

edited

Loading