-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uint8 quantized model runs slower than fp32 model #667
Comments
ReduceFp32ToFp16 is set to True in my tests. |
Hi @liamsun2019, I am getting 2 warnings for GATHER and TRANSPOSE running your models with CpuAcc seen in your issue #666. I just want to confirm these are still present for you so I can comment correctly on the results? Running the models on CpuAcc with the following commands I can confirm the same regression (~245ms vs ~263ms):
From a quick look, I cannot see any operator that runs faster on fp32 compared to uint8 model. The profiling is quite extensive so I will spend some time looking through and come back if I get something. Kind Regards, Cathal. |
One thing I have noticed: average pooling (only used once in your model) is not supported in CpuAcc for uint8 and therefore the operation falls back to CpuRef. Time Cost: ~4000us (CpuRef for uint8) vs ~117us = ~3900us = ~3.9ms. @morgolock you might have an idea on if uint8 support for average pooling 2d can be added to compute library (it seems uint8 max pool 2d support it already there). Perhaps it cannot be added because of some kind of padding? Warning message: |
Along with the above the following is what I have discovered:
|
Hello @liamsun2019, Falling back to CpuRef is very much degrading your performance. Unfortunately, because many of the transpose and gather operations are not supported for CpuAcc, fallback is inevitable. We do not guarantee uint8 performance in CpuRef to be better than fp32 (it will actually more than likely be slower because of how it is implemented in ArmNN) which is why you are seeing worse uint8 performance. However, by using the delegate you can fallback to TfLite runtime and not CpuRef which should have efficient uint8 performance compared to float32. You can do that by running the following:
I hope this will improve the performance of your uint8 model. Kind Regards, Cathal. |
I have tried to run your model with the delegate and it fails due to the following error:
@SadikARM provided me with the following information of what is happening: |
Hi @catcor01, Many thanks for your time and so detailed analysis. Instead, I ran these 2 models basing on the sample codes. I made some modifications while building them, e.g, -DUSE_ARMNN_DELEGATE=0/1, to apply delegate or parser to the sample codes. I also noticed that there are many tranpose/gather operations in the model and I think that contributes some overhead to inference time. For delegate mode, I have not encountered the errors you listed. I will spend some time conducting more tests. Thanks |
Hello @liamsun2019, A patch has been submitted to master (soon to be changed to main) fixing the above failure for CpuAcc. Your model should now be able to fully run using CpuAcc without the above error being thrown. Kind Regards, Cathal. |
…fix. * Originated from a GitHub issue: #667 * Initially, Arm NN supports the pool 2D operation because there is no padding on the pool2d. Neon failure occurs when padding is followed by average pool 2D due to folding optimization. * Here we prevent the folding optimization from happening for the above special case and add it in as a backend specific optimization. Signed-off-by: Cathal Corbett <cathal.corbett@arm.com> Change-Id: Ia0fd90c3a6b4b9d29c81106f154617d2e893e26b
Hi @catcor01, Sorry for the late reply. I have been focusing on some other work recently. I will try this patch ASAP. Thanks for your kindly help. |
…fix. * Originated from a GitHub issue: #667 * Initially, Arm NN supports the pool 2D operation because there is no padding on the pool2d. Neon failure occurs when padding is followed by average pool 2D due to folding optimization. * Here we prevent the folding optimization from happening for the above special case and add it in as a backend specific optimization. Signed-off-by: Cathal Corbett <cathal.corbett@arm.com> Change-Id: Ia0fd90c3a6b4b9d29c81106f154617d2e893e26b
@liamsun2019 could you let us know if this patch has fixed your issue? I will close this ticket otherwise. Thank you very much |
Hi @keidav01 , There's no progress on my side since my attention has been absorbed by some other things so far. You can just close it and I will verify the patch ASAP. Thanks for your help. B.R |
Thank you @liamsun2019, closing |
Hi author,
I encountered a question while doing inference on cortex-A55 aarch64 with CpuAcc as the backend. There are 2 models , one is fp32 and the other one is uint8 quantized. My tests showed that fp32 ran even faster than the uint8 quantized one. Just curious why this would happen. Please refer to the attachment for the 2 models. In addition, both c++ parser mode and delegate mode have the same issue. Appreciate your suggestions. Thanks.
test.zip
The text was updated successfully, but these errors were encountered: