Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Reduce CL target run Time at intialisation stage ( Prepare time) #977

Closed
abhajaswal opened this issue May 23, 2022 · 4 comments
Closed
Assignees
Labels
Milestone

Comments

@abhajaswal
Copy link

Hello I try to use the ACL 20.2

As you know when we run any example for ACL in iteration
Example mobilenet SSD v1 -> 1st time call for graph_run(0 takes about 1 min

2nd time onwards the graph run takes about 92 ms.

As i understand 1st time ACL creates the pipeline and memory/buffers etc , so it takes time , but is there any way i can reduce the
1st time initialisation time?

@morgolock morgolock self-assigned this May 23, 2022
@morgolock morgolock added this to the v22.05 milestone May 23, 2022
@abhajaswal
Copy link
Author

Dear team,

Could you let me know what could be the root cause?
Does the pipline creation takes more time?

I need to review the usage of the ARMNN further , but in case prepare takes lot of time then i need an understanding about it.

Tiime taken by ARMNN CPU to prepare : 992ms
ARMNN GPU : 9607ms

Tiime taken by opensource tflite CPU plugin : 44ms

@morgolock
Copy link

Hi @abhajaswal

Could you please try with the latest release 22.05? There have been some improvements in the startup time since 20.02.

In general, both for CPU and GPU, the first iteration is slower because during this run ACL performs various transformations on the tensors to make sure the memory is accessed in the best way possible. All this additional work is done by the operators in their corresponding ::prepare() methods. For example look at ClGemmConv2d : https://github.com/ARM-software/ComputeLibrary/blob/main/src/gpu/cl/operators/ClGemmConv2d.cpp#L617

For the OpenCL backend you also have to add the time to compile the OpenCL kernels at runtime, which occurs during configuration. To mitigate this problem you can save the compiled kernels to disk and restore them at runtime. For more information please see the example: https://github.com/ARM-software/ComputeLibrary/blob/main/examples/cl_cache.cpp

Please also be aware that the use of the opencl tuner in acl can affect startup time too, for more information please see: https://arm-software.github.io/ComputeLibrary/latest/architecture.xhtml#architecture_opencl_tuner

It would be helpful if you could share the complete command you used to run the example.

@abhajaswal
Copy link
Author

abhajaswal commented Jul 15, 2022

Thanks ! Using cl_cache.bin i am able to reduce the time to load model from 20612 ms to Init : 1379 ms

After cl_cache.bin restore
Image read time (From file or camera)
Min: 11 ms
Max: 11 ms
Avg: 11 ms
Image pre-process time
Min: 1 ms
Max: 1 ms
Avg: 1 ms
Model inference time
Min: 70 ms
Max: 70 ms
Avg: 70 ms
Model init/deinit time
Init : 1379 ms
Info: Shutdown time: 61.85 ms

initial was at time of 1st time save cl_cache.bin :

------------ PERFORMANCE ------------------
Image read time (From file or camera)
Min: 12 ms
Max: 12 ms
Avg: 12 ms
Image pre-process time
Min: 1 ms
Max: 1 ms
Avg: 1 ms
Model inference time
Min: 69 ms
Max: 69 ms
Avg: 69 ms
Model init/deinit time
Init : 20612 ms
Info: Shutdown time: 120.91 ms

-rwxr-xr-x 1 root root 2419612 Jan 2 18:24 armnn_clcahae.bin
-rw-r--r-- 1 root root 23018392 Jul 8 2022 od_tflite_model.tflite

This .bin file i will have to generate for N number of models , so wont it take up more memory .
Could we not reduce load time without cl_cache ?

Actually i tried TFlite GPU delegate , the load time for it is also low, i dint had to generate cl_cache.bin step for it.
So i wonder how tflite team optimized the load time and why using ARMNN/ACL i had to do this

@morgolock
Copy link

Hi @abhajaswal

Glad to hear you improved the load time using prebuilt opencl kernels.

This .bin file i will have to generate for N number of models , so wont it take up more memory .

Yes, you could easily implement deflating/inflating with something like zlib at runtime to reduce the size on disk if that's a concern.

Could we not reduce load time without cl_cache ?

Unfortunately not without a major rework of the library. At runtime the OpenCL kernels need to be compiled and that is what requires the additional time.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants