Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multithreading independent of OpenMP #1901

Closed
Timen opened this issue Jun 30, 2020 · 9 comments
Closed

Implement multithreading independent of OpenMP #1901

Timen opened this issue Jun 30, 2020 · 9 comments

Comments

@Timen
Copy link
Contributor

Timen commented Jun 30, 2020

Currently the multithreaded execution is using openmp to manage the threads and jobs. However this has some issues, for one OpenMP is not available for all platforms, the second issue is that OpenMP is not 100% consistent across platforms, often different versions are supported in different compilers and platforms.

Another large issue, (which I haven't found a way to work around) is that OpenMP uses spinlocks (checking for work in a while loop) in the worker threads. This means that every worker thread causes at 100% CPU utilizations after being spawned. Something that is pretty much a no-go on mobile platforms or other energy scarce platforms. It also seems OpenMP threads don't always respect the CPU-affinity set with the powersave mode.

I think it may be time/a good idea to start using a better threadpool management system, there are some very good and portable options out there such as (also used by TFLite/XNNPACK):
https://github.com/Maratyszcza/pthreadpool

The benefits would include better portability, more control over threads, and smarter thread and cpu usage for multi threaded execution.

@nihui
Copy link
Member

nihui commented Jun 30, 2020

openmp is available for all major platforms, even works on iOS.
The openmp pattern used in ncnn is just the pragma parallel for idiom, that is version 2.0 supported by almost all openmp implementations. It is simple and working good in general.

As for spinlock considerations, almost all the parallel tasks in ncnn are the balanced ones, so there is not so much waiting time. Besides, setting the OMP_WAIT_POLICY env can force threads not spin and sleep immediately.

@Timen
Copy link
Contributor Author

Timen commented Jun 30, 2020

Your suggestion of using OMP_WAIT_POLICY to disable spinlocking would only be suitable if it can be switched from active/passive on the fly between network executions. The spinlocks are necessary to get a proper speedup however they should stop once the network has been executed which I believe isn't currently the case. Besides that, setting environment variables is easier said then done inside an ios or android application.

One final thing to consider is that using OpenMP it is not possible to have multiple threadpools, this is a problem when for example you want to execute a small model on Little cores in parallel with a large model on the Big cores. As you can't specify a specific threadpool to use it is not possible to do this with the current OpenMP approach.

@nihui
Copy link
Member

nihui commented Jun 30, 2020

I don't like setting env either.

I didn't find that threads are still spinning after the model execution finishes. Spinning only occurs before thread joining inside each parallel loop. Once the loop finishes, all the workers will sleep.
You can see that the cpu utilization is zero during cooldown in the ncnn benchmark program.

It is possible to have multiple threadpool using openmp. A new threadpool will be created for a new thread scope.
Suppose your platform is 2 big cores + 4 little cores, and you want to execute model A on 2 big cores and model B on 4 little cores concurrently.

create two threads via std::thread or pthread

void thread_1()
{
ncnn::set_cpu_powersave(2); // bind to big cores
netA.opt.num_threads = 2;
}

void thread_2()
{
ncnn::set_cpu_powersave(1); // bind to little cores
netB.opt.num_threads = 4;
}

@Timen
Copy link
Contributor Author

Timen commented Jun 30, 2020

Okay yes you may be right about the spinlocks actually stopping, probably after a certain time (maybe I was too impatient). However the suggestion of using two different threads to set different CPU affinity for OpenMP doesn't actually create 2 different OpenMP threadpools. From my observations I found that the OpenMP threadpool was actually shared between all the threads. Meaning that whatever call to set_cpu_powersave mode was last determines the affinity of the thread in the threadpool.

I may be able to produce a minimal example to reproduce this behaviour (when I have some time).

@nihui
Copy link
Member

nihui commented Jun 30, 2020

refer #1623

@Timen
Copy link
Contributor Author

Timen commented Jun 30, 2020

I was also not able to reproduce the behaviour I had previously seen where the OpenMP threadpool was seemingly shared across threads. So that is one issue less however I still think the spinlocks used by openmp are an issue, for example I ran the blazeface model with 1 thread and with 4 threads, and I capped the number of times the network was executed to 30 times per second (30FPS). When I look at the snapdragon profiler for 1 thread I see very minimal utilization of a single core, as one would expect as it is a very small model (see snapdragon profiler load output below).
1_thread_load

However once I run it with 4 threads, one core is still only minimally utilized but the other 3 big cores are pegged at 100% utilization even at 30FPS as can be seen from the profiler screenshot below.
4_thread_load

This 100% CPU utilization, even when only executing the model for 2ms out of the 33ms is a big issue on mobile platforms, where it causes unnecessary battery drain, thermal throttling and reduces CPU available for other tasks.

@nihui
Copy link
Member

nihui commented Jul 1, 2020

Try kmp_set_blocktime(0);

@Timen
Copy link
Contributor Author

Timen commented Jul 1, 2020

Try kmp_set_blocktime(0);

I was not aware of that setting! It seems to help quite a bit but I'm still worried about unnecessary spinlock activity and the fact that I don't think a small tuned value for one CPU works on another CPU as-well. A larger blocktime wait interval probably would help with portability between higher and lower end CPU's but that would also incur a larger percentage of CPU time spend doing nothing. (it seems the default was 200ms which is not suitable for real-time applications!)

Ideally, we would be able to start spinlocks at the beginning of the model execution and stop them at the end. Changing the blocktime before and after model execution doesn't seem to be a suitable method to achieve that.

@nihui
Copy link
Member

nihui commented Jul 23, 2020

The default blocktime set to 20ms instead of 200ms since 4e4f0ba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants