How to Reduce CL target run Time at intialisation stage ( Prepare time) #977

abhajaswal · 2022-05-23T09:18:13Z

Hello I try to use the ACL 20.2

As you know when we run any example for ACL in iteration
Example mobilenet SSD v1 -> 1st time call for graph_run(0 takes about 1 min

2nd time onwards the graph run takes about 92 ms.

As i understand 1st time ACL creates the pipeline and memory/buffers etc , so it takes time , but is there any way i can reduce the
1st time initialisation time?

abhajaswal · 2022-06-17T07:16:08Z

Dear team,

Could you let me know what could be the root cause?
Does the pipline creation takes more time?

I need to review the usage of the ARMNN further , but in case prepare takes lot of time then i need an understanding about it.

Tiime taken by ARMNN CPU to prepare : 992ms
ARMNN GPU : 9607ms

Tiime taken by opensource tflite CPU plugin : 44ms

morgolock · 2022-06-20T13:36:15Z

Hi @abhajaswal

Could you please try with the latest release 22.05? There have been some improvements in the startup time since 20.02.

In general, both for CPU and GPU, the first iteration is slower because during this run ACL performs various transformations on the tensors to make sure the memory is accessed in the best way possible. All this additional work is done by the operators in their corresponding ::prepare() methods. For example look at ClGemmConv2d : https://github.com/ARM-software/ComputeLibrary/blob/main/src/gpu/cl/operators/ClGemmConv2d.cpp#L617

For the OpenCL backend you also have to add the time to compile the OpenCL kernels at runtime, which occurs during configuration. To mitigate this problem you can save the compiled kernels to disk and restore them at runtime. For more information please see the example: https://github.com/ARM-software/ComputeLibrary/blob/main/examples/cl_cache.cpp

Please also be aware that the use of the opencl tuner in acl can affect startup time too, for more information please see: https://arm-software.github.io/ComputeLibrary/latest/architecture.xhtml#architecture_opencl_tuner

It would be helpful if you could share the complete command you used to run the example.

abhajaswal · 2022-07-15T11:39:02Z

Thanks ! Using cl_cache.bin i am able to reduce the time to load model from 20612 ms to Init : 1379 ms

After cl_cache.bin restore
Image read time (From file or camera)
Min: 11 ms
Max: 11 ms
Avg: 11 ms
Image pre-process time
Min: 1 ms
Max: 1 ms
Avg: 1 ms
Model inference time
Min: 70 ms
Max: 70 ms
Avg: 70 ms
Model init/deinit time
Init : 1379 ms
Info: Shutdown time: 61.85 ms

initial was at time of 1st time save cl_cache.bin :

------------ PERFORMANCE ------------------
Image read time (From file or camera)
Min: 12 ms
Max: 12 ms
Avg: 12 ms
Image pre-process time
Min: 1 ms
Max: 1 ms
Avg: 1 ms
Model inference time
Min: 69 ms
Max: 69 ms
Avg: 69 ms
Model init/deinit time
Init : 20612 ms
Info: Shutdown time: 120.91 ms

-rwxr-xr-x 1 root root 2419612 Jan 2 18:24 armnn_clcahae.bin
-rw-r--r-- 1 root root 23018392 Jul 8 2022 od_tflite_model.tflite

This .bin file i will have to generate for N number of models , so wont it take up more memory .
Could we not reduce load time without cl_cache ?

Actually i tried TFlite GPU delegate , the load time for it is also low, i dint had to generate cl_cache.bin step for it.
So i wonder how tflite team optimized the load time and why using ARMNN/ACL i had to do this

morgolock · 2022-09-14T13:46:39Z

Hi @abhajaswal

Glad to hear you improved the load time using prebuilt opencl kernels.

This .bin file i will have to generate for N number of models , so wont it take up more memory .

Yes, you could easily implement deflating/inflating with something like zlib at runtime to reduce the size on disk if that's a concern.

Could we not reduce load time without cl_cache ?

Unfortunately not without a major rework of the library. At runtime the OpenCL kernels need to be compiled and that is what requires the additional time.

Hope this helps.

morgolock self-assigned this May 23, 2022

morgolock added the Question label May 23, 2022

morgolock added this to the v22.05 milestone May 23, 2022

MikeJKelly mentioned this issue Jun 21, 2022

ARMNN prepare API takes lot of time when using GPU ARM-software/armnn#654

Closed

morgolock closed this as completed Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Reduce CL target run Time at intialisation stage ( Prepare time) #977

How to Reduce CL target run Time at intialisation stage ( Prepare time) #977

abhajaswal commented May 23, 2022

abhajaswal commented Jun 17, 2022

morgolock commented Jun 20, 2022

abhajaswal commented Jul 15, 2022 •

edited

Loading

morgolock commented Sep 14, 2022

How to Reduce CL target run Time at intialisation stage ( Prepare time) #977

How to Reduce CL target run Time at intialisation stage ( Prepare time) #977

Comments

abhajaswal commented May 23, 2022

abhajaswal commented Jun 17, 2022

morgolock commented Jun 20, 2022

abhajaswal commented Jul 15, 2022 • edited Loading

morgolock commented Sep 14, 2022

abhajaswal commented Jul 15, 2022 •

edited

Loading