Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uint8 quantized model runs slower than fp32 model #667

Closed
liamsun2019 opened this issue Jul 12, 2022 · 13 comments
Closed

uint8 quantized model runs slower than fp32 model #667

liamsun2019 opened this issue Jul 12, 2022 · 13 comments
Labels
Bug Something isn't working Performance Question Further information is requested TIME WAIT Waiting for an approppriate period for a response before closing the issue.

Comments

@liamsun2019
Copy link

Hi author,
I encountered a question while doing inference on cortex-A55 aarch64 with CpuAcc as the backend. There are 2 models , one is fp32 and the other one is uint8 quantized. My tests showed that fp32 ran even faster than the uint8 quantized one. Just curious why this would happen. Please refer to the attachment for the 2 models. In addition, both c++ parser mode and delegate mode have the same issue. Appreciate your suggestions. Thanks.
test.zip

@liamsun2019
Copy link
Author

ReduceFp32ToFp16 is set to True in my tests.

@catcor01
Copy link
Collaborator

catcor01 commented Jul 19, 2022

Hi @liamsun2019,

I am getting 2 warnings for GATHER and TRANSPOSE running your models with CpuAcc seen in your issue #666. I just want to confirm these are still present for you so I can comment correctly on the results?

Running the models on CpuAcc with the following commands I can confirm the same regression (~245ms vs ~263ms):

./ExecuteNetwork -m u8l.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i X.1 -o 2180 --number-of-threads 1 --iterations 10
./ExecuteNetwork -m fp32.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i input.55 -o 1456 --number-of-threads 1 --iterations 10

From a quick look, I cannot see any operator that runs faster on fp32 compared to uint8 model. The profiling is quite extensive so I will spend some time looking through and come back if I get something.

Kind Regards, Cathal.

@catcor01
Copy link
Collaborator

catcor01 commented Jul 19, 2022

f32.txt
u8.txt

@catcor01
Copy link
Collaborator

catcor01 commented Jul 19, 2022

One thing I have noticed: average pooling (only used once in your model) is not supported in CpuAcc for uint8 and therefore the operation falls back to CpuRef.

Time Cost: ~4000us (CpuRef for uint8) vs ~117us = ~3900us = ~3.9ms.
There is a time cost to fall back to CpuRef due to a memory copy before and after the operation but it is negligible compared to above: ~9us and ~5us = max 15us.

@morgolock you might have an idea on if uint8 support for average pooling 2d can be added to compute library (it seems uint8 max pool 2d support it already there). Perhaps it cannot be added because of some kind of padding? Warning message: Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.

@catcor01
Copy link
Collaborator

catcor01 commented Jul 19, 2022

Along with the above the following is what I have discovered:

  • The uint8 model performs quantize and de-quantize operations (NeonQuantizeWorkload_Execute_#227 being the biggest time cost) which is adding up to approx 3.5-4 ms.
  • CpuAcc pooling 2d is slower for uint8. Can be up to 1 ms slower.
  • CpuAcc concat is slower for uint8. Can be up between 1-2 ms slower.
  • CpuRef gather operation can be twice as slow for the uint8 model (1.25ms vs 0.6ms).

@catcor01
Copy link
Collaborator

Hello @liamsun2019,

Falling back to CpuRef is very much degrading your performance. Unfortunately, because many of the transpose and gather operations are not supported for CpuAcc, fallback is inevitable. We do not guarantee uint8 performance in CpuRef to be better than fp32 (it will actually more than likely be slower because of how it is implemented in ArmNN) which is why you are seeing worse uint8 performance. However, by using the delegate you can fallback to TfLite runtime and not CpuRef which should have efficient uint8 performance compared to float32. You can do that by running the following:

./ExecuteNetwork -m u8l.tflite -f tflite-binary --tflite-executor delegate -c CpuAcc -i X.1 -o 2180 --number-of-threads 1 --iterations 10

I hope this will improve the performance of your uint8 model.

Kind Regards, Cathal.

@catcor01
Copy link
Collaborator

catcor01 commented Jul 19, 2022

I have tried to run your model with the delegate and it fails due to the following error:

Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.
Warning: ERROR: Layer of type Pooling2d is not supported on any preferred backend [CpuAcc ]
terminate called after throwing an instance of 'armnn::Exception'
  what():  TfLiteArmnnDelegate: Exception (Failed to assign a backend to each layer) caught from optimize.

@SadikARM provided me with the following information of what is happening:
"
I believe why it is not falling back to the TfLite Runtime is that first IsLayerSupported() return true for Pooling2d layer which means it already delegated the layer to Arm NN then somewhere in the flow (seems like in optimization) CpuPool2dKernel::validate_arguments() called and it throw error. So in optimization level it is too late to fall back to TfLite Runtime because it delegated the graph already to Arm NN.
"
I will look into this and make a patch.

@liamsun2019
Copy link
Author

Hi @catcor01,

Many thanks for your time and so detailed analysis. Instead, I ran these 2 models basing on the sample codes. I made some modifications while building them, e.g, -DUSE_ARMNN_DELEGATE=0/1, to apply delegate or parser to the sample codes. I also noticed that there are many tranpose/gather operations in the model and I think that contributes some overhead to inference time. For delegate mode, I have not encountered the errors you listed. I will spend some time conducting more tests.

Thanks
B.R
Liam

@catcor01 catcor01 added Bug Something isn't working Question Further information is requested Performance labels Jul 27, 2022
@catcor01
Copy link
Collaborator

Hello @liamsun2019,

A patch has been submitted to master (soon to be changed to main) fixing the above failure for CpuAcc. Your model should now be able to fully run using CpuAcc without the above error being thrown.

Kind Regards, Cathal.

@catcor01 catcor01 added the TIME WAIT Waiting for an approppriate period for a response before closing the issue. label Jul 29, 2022
MatthewARM pushed a commit that referenced this issue Jul 29, 2022
…fix.

  * Originated from a GitHub issue: #667
  * Initially, Arm NN supports the pool 2D operation because there is no padding
    on the pool2d. Neon failure occurs when padding is followed by average pool 2D
    due to folding optimization.
  * Here we prevent the folding optimization from happening for the above special case
    and add it in as a backend specific optimization.

Signed-off-by: Cathal Corbett <cathal.corbett@arm.com>
Change-Id: Ia0fd90c3a6b4b9d29c81106f154617d2e893e26b
@liamsun2019
Copy link
Author

Hi @catcor01

Sorry for the late reply. I have been focusing on some other work recently. I will try this patch ASAP. Thanks for your kindly help.

nikraj01 pushed a commit that referenced this issue Sep 6, 2022
…fix.

  * Originated from a GitHub issue: #667
  * Initially, Arm NN supports the pool 2D operation because there is no padding
    on the pool2d. Neon failure occurs when padding is followed by average pool 2D
    due to folding optimization.
  * Here we prevent the folding optimization from happening for the above special case
    and add it in as a backend specific optimization.

Signed-off-by: Cathal Corbett <cathal.corbett@arm.com>
Change-Id: Ia0fd90c3a6b4b9d29c81106f154617d2e893e26b
@keidav01
Copy link
Contributor

@liamsun2019 could you let us know if this patch has fixed your issue? I will close this ticket otherwise. Thank you very much

@liamsun2019
Copy link
Author

Hi @keidav01 ,

There's no progress on my side since my attention has been absorbed by some other things so far. You can just close it and I will verify the patch ASAP. Thanks for your help.

B.R
Liam

@keidav01
Copy link
Contributor

Thank you @liamsun2019, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Performance Question Further information is requested TIME WAIT Waiting for an approppriate period for a response before closing the issue.
Projects
None yet
Development

No branches or pull requests

3 participants