-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multithreading independent of OpenMP #1901
Comments
openmp is available for all major platforms, even works on iOS. As for spinlock considerations, almost all the parallel tasks in ncnn are the balanced ones, so there is not so much waiting time. Besides, setting the OMP_WAIT_POLICY env can force threads not spin and sleep immediately. |
Your suggestion of using OMP_WAIT_POLICY to disable spinlocking would only be suitable if it can be switched from active/passive on the fly between network executions. The spinlocks are necessary to get a proper speedup however they should stop once the network has been executed which I believe isn't currently the case. Besides that, setting environment variables is easier said then done inside an ios or android application. One final thing to consider is that using OpenMP it is not possible to have multiple threadpools, this is a problem when for example you want to execute a small model on Little cores in parallel with a large model on the Big cores. As you can't specify a specific threadpool to use it is not possible to do this with the current OpenMP approach. |
I don't like setting env either. I didn't find that threads are still spinning after the model execution finishes. Spinning only occurs before thread joining inside each parallel loop. Once the loop finishes, all the workers will sleep. It is possible to have multiple threadpool using openmp. A new threadpool will be created for a new thread scope. create two threads via std::thread or pthread
|
Okay yes you may be right about the spinlocks actually stopping, probably after a certain time (maybe I was too impatient). However the suggestion of using two different threads to set different CPU affinity for OpenMP doesn't actually create 2 different OpenMP threadpools. From my observations I found that the OpenMP threadpool was actually shared between all the threads. Meaning that whatever call to set_cpu_powersave mode was last determines the affinity of the thread in the threadpool. I may be able to produce a minimal example to reproduce this behaviour (when I have some time). |
refer #1623 |
Try |
I was not aware of that setting! It seems to help quite a bit but I'm still worried about unnecessary spinlock activity and the fact that I don't think a small tuned value for one CPU works on another CPU as-well. A larger blocktime wait interval probably would help with portability between higher and lower end CPU's but that would also incur a larger percentage of CPU time spend doing nothing. (it seems the default was 200ms which is not suitable for real-time applications!) Ideally, we would be able to start spinlocks at the beginning of the model execution and stop them at the end. Changing the blocktime before and after model execution doesn't seem to be a suitable method to achieve that. |
The default blocktime set to 20ms instead of 200ms since 4e4f0ba |
Currently the multithreaded execution is using openmp to manage the threads and jobs. However this has some issues, for one OpenMP is not available for all platforms, the second issue is that OpenMP is not 100% consistent across platforms, often different versions are supported in different compilers and platforms.
Another large issue, (which I haven't found a way to work around) is that OpenMP uses spinlocks (checking for work in a while loop) in the worker threads. This means that every worker thread causes at 100% CPU utilizations after being spawned. Something that is pretty much a no-go on mobile platforms or other energy scarce platforms. It also seems OpenMP threads don't always respect the CPU-affinity set with the powersave mode.
I think it may be time/a good idea to start using a better threadpool management system, there are some very good and portable options out there such as (also used by TFLite/XNNPACK):
https://github.com/Maratyszcza/pthreadpool
The benefits would include better portability, more control over threads, and smarter thread and cpu usage for multi threaded execution.
The text was updated successfully, but these errors were encountered: