Issue
Running the OpenMP (GCC 7.5) tests in parallel using the ctest -jXX option is much, much slower than running them one at a time. This OMP thread scheduling implementation doesn't seem to take other processes into account.
| CTest Parallelism (cpp.omp.cpp14) |
CPU Time (s) |
Walltime (mm:ss) |
| -j1 |
597 |
1:16 |
| -j2 |
33574 |
46:42 |
Fix incoming
I've set https://cmake.org/cmake/help/v3.10/prop_test/RUN_SERIAL.html
on the OMP tests in my changes for #1159.
Other backends
TBB
TBB has some scaling issues, but doesn't fall off a cliff at -j2. On a 6-core x 2-SMT CPU, TBB scales well for a small number of processes:
| CTest Parallelism (cpp.omp.cpp14) |
Walltime (s) |
| -j1 |
81 |
| -j2 |
55 |
| -j4 |
45 |
| -j6 |
50 |
| -j8 |
60 |
| -j12 |
68 |
Since CMake doesn't offer parallelism properties with finer control than RUN_SERIAL and all of the parallel configs are faster than -j1, we should just leave this as-is. These tests will continue to run at the requested parallelism.
After discussion with @griwes and @brycelelbach, TBB tests should also be marked RUN_SERIAL. The increased runtime is worth ensuring that the individual test processes will run at full threaded parallelism.
CUDA
CUDA scales very favorably with more CPUs, at least in the range I can test. On the same CPU as above while running tests on both GV100 and GP100:
| CTest Parallelism (cpp.cuda.cpp14) |
Walltime (s) |
| -j1 |
(a "Very Long Time") |
| -j6 |
208 |
| -j8 |
199 |
| -j10 |
197 |
| -j12 |
176 |
These tests will continue to run at the requested parallelism.