Implement multithreading independent of OpenMP #1901

Timen · 2020-06-30T11:30:47Z

Currently the multithreaded execution is using openmp to manage the threads and jobs. However this has some issues, for one OpenMP is not available for all platforms, the second issue is that OpenMP is not 100% consistent across platforms, often different versions are supported in different compilers and platforms.

Another large issue, (which I haven't found a way to work around) is that OpenMP uses spinlocks (checking for work in a while loop) in the worker threads. This means that every worker thread causes at 100% CPU utilizations after being spawned. Something that is pretty much a no-go on mobile platforms or other energy scarce platforms. It also seems OpenMP threads don't always respect the CPU-affinity set with the powersave mode.

I think it may be time/a good idea to start using a better threadpool management system, there are some very good and portable options out there such as (also used by TFLite/XNNPACK):
https://github.com/Maratyszcza/pthreadpool

The benefits would include better portability, more control over threads, and smarter thread and cpu usage for multi threaded execution.

nihui · 2020-06-30T15:34:16Z

openmp is available for all major platforms, even works on iOS.
The openmp pattern used in ncnn is just the pragma parallel for idiom, that is version 2.0 supported by almost all openmp implementations. It is simple and working good in general.

As for spinlock considerations, almost all the parallel tasks in ncnn are the balanced ones, so there is not so much waiting time. Besides, setting the OMP_WAIT_POLICY env can force threads not spin and sleep immediately.

Timen · 2020-06-30T15:54:49Z

Your suggestion of using OMP_WAIT_POLICY to disable spinlocking would only be suitable if it can be switched from active/passive on the fly between network executions. The spinlocks are necessary to get a proper speedup however they should stop once the network has been executed which I believe isn't currently the case. Besides that, setting environment variables is easier said then done inside an ios or android application.

One final thing to consider is that using OpenMP it is not possible to have multiple threadpools, this is a problem when for example you want to execute a small model on Little cores in parallel with a large model on the Big cores. As you can't specify a specific threadpool to use it is not possible to do this with the current OpenMP approach.

nihui · 2020-06-30T16:13:19Z

I don't like setting env either.

I didn't find that threads are still spinning after the model execution finishes. Spinning only occurs before thread joining inside each parallel loop. Once the loop finishes, all the workers will sleep.
You can see that the cpu utilization is zero during cooldown in the ncnn benchmark program.

It is possible to have multiple threadpool using openmp. A new threadpool will be created for a new thread scope.
Suppose your platform is 2 big cores + 4 little cores, and you want to execute model A on 2 big cores and model B on 4 little cores concurrently.

create two threads via std::thread or pthread

void thread_1()
{
ncnn::set_cpu_powersave(2); // bind to big cores
netA.opt.num_threads = 2;
}

void thread_2()
{
ncnn::set_cpu_powersave(1); // bind to little cores
netB.opt.num_threads = 4;
}

Timen · 2020-06-30T16:24:16Z

Okay yes you may be right about the spinlocks actually stopping, probably after a certain time (maybe I was too impatient). However the suggestion of using two different threads to set different CPU affinity for OpenMP doesn't actually create 2 different OpenMP threadpools. From my observations I found that the OpenMP threadpool was actually shared between all the threads. Meaning that whatever call to set_cpu_powersave mode was last determines the affinity of the thread in the threadpool.

I may be able to produce a minimal example to reproduce this behaviour (when I have some time).

nihui · 2020-06-30T16:29:28Z

refer #1623

Timen · 2020-06-30T18:02:10Z

I was also not able to reproduce the behaviour I had previously seen where the OpenMP threadpool was seemingly shared across threads. So that is one issue less however I still think the spinlocks used by openmp are an issue, for example I ran the blazeface model with 1 thread and with 4 threads, and I capped the number of times the network was executed to 30 times per second (30FPS). When I look at the snapdragon profiler for 1 thread I see very minimal utilization of a single core, as one would expect as it is a very small model (see snapdragon profiler load output below).

However once I run it with 4 threads, one core is still only minimally utilized but the other 3 big cores are pegged at 100% utilization even at 30FPS as can be seen from the profiler screenshot below.

This 100% CPU utilization, even when only executing the model for 2ms out of the 33ms is a big issue on mobile platforms, where it causes unnecessary battery drain, thermal throttling and reduces CPU available for other tasks.

nihui · 2020-07-01T06:39:44Z

Try kmp_set_blocktime(0);

Timen · 2020-07-01T07:54:25Z

Try kmp_set_blocktime(0);

I was not aware of that setting! It seems to help quite a bit but I'm still worried about unnecessary spinlock activity and the fact that I don't think a small tuned value for one CPU works on another CPU as-well. A larger blocktime wait interval probably would help with portability between higher and lower end CPU's but that would also incur a larger percentage of CPU time spend doing nothing. (it seems the default was 200ms which is not suitable for real-time applications!)

Ideally, we would be able to start spinlocks at the beginning of the model execution and stop them at the end. Changing the blocktime before and after model execution doesn't seem to be a suitable method to achieve that.

nihui · 2020-07-23T06:25:35Z

The default blocktime set to 20ms instead of 200ms since 4e4f0ba

nihui added the enhancement label Jul 1, 2020

nihui closed this as completed Jul 23, 2020

tdurand mentioned this issue Dec 12, 2020

Potential mitigations for CPU throttling (thermal heating) opendatacam/opendatacam-mobile#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multithreading independent of OpenMP #1901

Implement multithreading independent of OpenMP #1901

Timen commented Jun 30, 2020

nihui commented Jun 30, 2020

Timen commented Jun 30, 2020 •

edited

nihui commented Jun 30, 2020 •

edited

Timen commented Jun 30, 2020 •

edited

nihui commented Jun 30, 2020

Timen commented Jun 30, 2020

nihui commented Jul 1, 2020

Timen commented Jul 1, 2020 •

edited

nihui commented Jul 23, 2020

Implement multithreading independent of OpenMP #1901

Implement multithreading independent of OpenMP #1901

Comments

Timen commented Jun 30, 2020

nihui commented Jun 30, 2020

Timen commented Jun 30, 2020 • edited

nihui commented Jun 30, 2020 • edited

Timen commented Jun 30, 2020 • edited

nihui commented Jun 30, 2020

Timen commented Jun 30, 2020

nihui commented Jul 1, 2020

Timen commented Jul 1, 2020 • edited

nihui commented Jul 23, 2020

Timen commented Jun 30, 2020 •

edited

nihui commented Jun 30, 2020 •

edited

Timen commented Jun 30, 2020 •

edited

Timen commented Jul 1, 2020 •

edited