Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with many cores #1881

Closed
jeremiedbb opened this issue Nov 22, 2018 · 41 comments · Fixed by #4585
Closed

Performance issue with many cores #1881

jeremiedbb opened this issue Nov 22, 2018 · 41 comments · Fixed by #4585

Comments

@jeremiedbb
Copy link

Running the following code on a machine with many cores will give worse performances than limiting the number of threads.

import numpy as np
X = np.random.random_sample((2048, 2048))
%time np.linalg.svd(X)

I built numpy against OpenBLAS master (I observe the same thing with the OpenBLAS shipped with numpy from conda-forge).

Here are the timings I get for the previous code. The machine has 88 cores (44 physical + hyperthreading)

  • OPENBLAS_NUM_THREADS unset
    CPU times: user 2min 53s, sys: 11min 38s, total: 14min 31s
    Wall time: 11.9 s

  • OPENBLAS_NUM_THREADS=4
    CPU times: user 9.45 s, sys: 3.26 s, total: 12.7 s
    Wall time: 3.29 s

Below is comparison with MKL

  • MKL_NUM_THREADS unset
    CPU times: user 57.8 s, sys: 2.46 s, total: 1min
    Wall time: 1.59 s

  • MKL_NUM_THREADS=4
    CPU times: user 8.27 s, sys: 192 ms, total: 8.46 s
    Wall time: 2.16 s

It brings another issue: on cpus using hyperthreading, OpenBLAS will use the maximum number of threads possible which is twice the number of physical cores. But it should use only as many threads as the number of physical cores, because all BLAS operations don't really benefit from hyperthreading (hyperthreading is meant to parallelize tasks of different nature, which is the opposite of BLAS).

@martin-frbg
Copy link
Collaborator

Will need to see what BLAS calls np.linalg.svd makes...
Without promising any improvement, could you please try with the OpenBLAS "develop" branch ? ("master" is currently stuck at 0.2.20 so may actually be older than what ships with numpy, all subsequent releases have been made from the 0.3 branch).
It will probably depend on the individual hardware and problem size whether hyperthreading helps or not - when I brought up the topic in some other issue a while ago, a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware. The actual issue here could be that OpenBLAS creates too many threads for some subtask with a small matrix size - there have been a few changes in recent versions that address this in individual BLAS functions.

@jeremiedbb
Copy link
Author

jeremiedbb commented Nov 22, 2018

could you please try with the OpenBLAS "develop" branch ?

sorry I said master but it was against the develop branch. I didn't notice the main branch name was not master :)

a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware

It's strange because MKL never uses those :)

@jeremiedbb
Copy link
Author

It seems that svd calls xGESVD or xGESDD with x = s or d

@martin-frbg
Copy link
Collaborator

Thanks. Strictly speaking this is LAPACK (which for OpenBLAS means the non-parallelized, non-optimized "netlib" reference implementation), the task now is to follow their call tree and see which BLAS functions get involved.

@martin-frbg
Copy link
Collaborator

From http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing_ga84fdf22a62b12ff364621e4713ce02f2.html
this seems to be mostly GEMM/GEMV, SWAP,TRMM/TRMV which I think had already been checked for unnecessary thread creation. (This however was mostly to avoid senseless multithreading for small inputs, most if not all still take an all-or-nothing approach while MKL has been claimed to limit the number of active threads according to problem size.)

@jeremiedbb
Copy link
Author

Thanks. I guess #678 is addressing the same concern, but it's still at the thinking process step :(

About the hyperthreading, I'd strongly argue against using pseudocores. On every machine I tested it, there was no improvement using those. Limiting the number of threads to use only physical was always better or even. And for the same problems, MKL never used the pseudocores.

Obviously it requires benchmarks, but I think it could be worth.

@brada4
Copy link
Contributor

brada4 commented Nov 22, 2018

Before l2tf one could adjust x86perf to get more out of single pseudocore while other is idle, now they are best left disabled for security and performance altogether

@brada4
Copy link
Contributor

brada4 commented Nov 22, 2018

I will quickly recheck the threading.
@jeremiedbb what is your exact cpu(s)?
Is it helping to limit threads to those present in one numa node as seen in numactl -H

@brada4
Copy link
Contributor

brada4 commented Nov 22, 2018

(0.3.3 pthread) there is the lock race around malloc taking 1-2-4-HT8 0-0.2-2-6% of time (which should be gone in develop branch after 0.3.3), but otherwise no visible spills
it should grow much worse with lots of cores indeed.
EDIT: indeed 200x more time is consumed in syscalls max vs 4 processors EDIT2: it is in initial post.
If you could try if #1785 gives an improvement in the meantime?

@jeremiedbb
Copy link
Author

jeremiedbb commented Nov 23, 2018

@brada4 Thanks for looking into it !

the cpu is an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
There are 2 sockets with 22 physical cores on each socket, and each one has hyper-threading.

There are 2 NUMA nodes. Limiting threads to a single numa node definitly helps but not as much as limiting threads to physical cores. EDIT: they perform equally.

If I limit to 1 numa node, OpenBLAS will use 44 threads on 22 physical cores on 1 numa node
If I limit to physical cores, OpenBLAS will use 44 threads on 44 physical cores on 2 numa nodes

The second case runs twice faster regarding wall time and syscalls are twice lower. However both run slower that with only using 4 cores, and trigger way more syscalls.

If you could try if #1785 gives an improvement in the meantime?

I think I did try it because it was merged 1 month+ ago and I cloned the repo 2 days ago for this test.

@jeremiedbb
Copy link
Author

I edited my previous comment because at first I limited the cores the code could use, but I didn't limit the number of threads for openblas (I only used taskset). Apparently, Openblas spawns as many threads as the machine has cores, even in a program restricted to run on a specific subset of core by taskset.

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

There is another way that you set OPENBLAS_NUM_THREADS=22 (or 11 if CoD configuration is enabled in BIOS. check with numactl -H if you have 2 or 4 NUMA nodes) , then rely on OS to perform proper NUMA bindings. It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes.
EDIT: I actually do such configuration on a multi-user systems for living and for other HPC libraries too.

@jeremiedbb
Copy link
Author

jeremiedbb commented Nov 28, 2018

numactl is not installed and I don't have root access. However, lscpu is enough here and shows 2 numa nodes (with their assigned core ids).

When setting OPENBLAS_NUM_THREADS=44, the OS does not groups all threads on the same numa node. Neither does setting OPENBLAS_NUM_THREADS=22. However, it makes OpenBLAS not use hyperthreading, and gives a huge improvement on performance.

When setting OPENBLAS_NUM_THREADS=44, and manually setting the affinity using taskset to only use 1 numa node, performance is basically the same as when setting openblas_num_threads. Setting OPENBLAS_NUM_THREADS=22 with the same affinity produces slightly better performance.

It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes.

I think you're right. With bigger inputs, performances are more like expected.

In summary:

  • When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls, especially on machines with more than one numa node. But this is a know limitation of OpenBLAS and there is already an opened issue for that, Dynamic number of threads #678.

  • OpenBLAS will always try to use all available cores (physical + hyperthreading). I found that it hurts performances, experimenting with all sizes of input. This has also been reported in OpenMP thread placement and affinity #1653.
    MKL always uses only physical cores (as written in the doc of mkl_get_max_threads). I think that they do that for a good reason, and that it might be worth trying to do so.

  • When OpenBLAS is called from within a program which affinity has been manually set (e.g. using taskset), openblas_get_num_threads will still give all existing cores on the machine, sawning too many threads. I don't know if this is a known issue, or if this is expected.

Meanwhile, a simple solution is setting OPENBLAS_NUM_THREADS according to input size and environment conditions.

Thanks for your answers @brada4 @martin-frbg !

@martin-frbg
Copy link
Collaborator

When OpenBLAS is called from within a program which affinity has been manually set (e.g. using taskset), openblas_get_num_threads will still give all existing cores on the machine, sawning too many threads. I don't know if this is a known issue, or if this is expected.

I thought this should be solved already ( #1155 ) and init.c should only enumerate cpus in the current taskset, but I never tried my supposed fix on systems with multiple nodes. (Also the added code has various exit points for older libc or other restrictions on cpu enumeration, maybe I messed up there)

@brada4
Copy link
Contributor

brada4 commented Nov 28, 2018

Syscalls will be wildly reduced in 3.4. Check #1785
There is nothing specific yo numa nodes in particular code, though applying common sense there were n! potential lock interactions, which got measurably worse with dozens of threads.

MKL does that only from 2018, older takes all hyperthreads. Do you have any authoritative documentation that hyperthreads are waste? It is common sense, and a security problem for certain, but no clear statement they should be out for performance.

@jeremiedbb
Copy link
Author

Do you have any authoritative documentation that hyperthreads are waste?

Absolutely not. As I said, it's just the result of experiences. It should not be taken for granted, but I think it allows to at least open the discussion.

@brada4
Copy link
Contributor

brada4 commented Nov 28, 2018

There is no statement past "go and try this trick" e. g.
https://software.intel.com/en-us/mkl-linux-developer-guide-managing-multi-core-performance

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 28, 2018

Might be worthwile to experiment with OMP_PROC_BIND and OMP_PLACES as well
http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html
(And I also note that setting cpu affinity is disabled in the default configuration of OpenBLAS as it would affect other code in situations where the library is dlopened by an interpreter prior to loading other stuff.)

@brada4
Copy link
Contributor

brada4 commented Nov 28, 2018

Actually there are no timings in regard to system times after lock patch. Strange how nproc times syscall reduction went unnoticed on 88 cores detected.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 3, 2018

also, 22 is .. unfortunate.

#1846 which got merged as
f1c0227

is what I did on skylake to avoid some really bad behavior; one of these days I need to port a set of these fixes to haswell/broadwell.

basically what happens is that without the patch, openblas just divides, so

2048 / 22 = 93

93 is a really awful number from a SIMD perspective.... the patch in the commit will give much better behavior

@brada4
Copy link
Contributor

brada4 commented Dec 3, 2018

Not only SIMD, I spotted same with l1blas, like messing up basic (supposedly page, hopefully cache line) alignment for subsequent threads.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 3, 2018

Rereading #1846, it would seem to be sufficient to just copy the define GEMM_PREFERED_SIZE 32 line from the SKYLAKEX section of param.h to the corresponding HASWELL block (around line 1511 of the current develop/0.3.4 version) and rerun the numpy linalg.svd test

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 3, 2018

maybe. One might need to do the equivalent of dcc5d62 to avoid nasty "we did not think a work chunk could be 0 sized" behavior

@martin-frbg
Copy link
Collaborator

Good question if blas_quickdivide can legitimately return zero (unless the work size is zero, in which case I wonder why we would get to that code at all).

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 3, 2018

before the rounding to preferred sizes it wouldn't.
but lets say width is 64, and 3 cpus. preferred size is 32

before you get
21 21 22 as sizes
with the preferred size you get
32 32 0

(21 is one of those really nasty sizes for performance, so this case is a LOT faster with the rounding up)

but due to the rounding up you can get 0 work now

@martin-frbg
Copy link
Collaborator

Not sure I get this, as the (rounded) width gets deducted from the initial m (or n) shouldn't the divider loop exit after assigning two workloads of 32 each ?

@brada4
Copy link
Contributor

brada4 commented Dec 3, 2018

I will do tomorrow l regarding my alignment theory
@fenrus75 there is no alihnment/chunk preference signaling in most cases, say l1 thread does know only number of elements in vector, but not S/CD/Z, i.e. their size.

@brada4
Copy link
Contributor

brada4 commented Dec 5, 2018

When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls

@jeremiedbb could you try in common.h

#ifndef YIELDING
// #define YIELDING     sched_yield()
#define YIELDING {};
#endif

(strace -o strace.out benchmark/dgemm.goto 1024 1024 1024)

Modern thread libraries de-schedule idle threads when idle implicitly, no need to force it,thats the syscal that was done 20000 times in 5s DGEMM.

@martin-frbg I think some small sleep should be here - it is sort of polling thread status.... That yielding syscall consumes (non-productively) CPU cycles when trivial sleep would just set timer in future.
blas_server.c

      while(tsqq) {
        YIELDING; <---- 20000 syscalls.
        pthread_mutex_lock(&thread_status[queue->assigned].lock);
        tsqq=thread_status[queue -> assigned].queue;
        pthread_mutex_unlock(&thread_status[queue->assigned].lock);

@martin-frbg
Copy link
Collaborator

@brada4 remember this was tried before and the results were inconclusive at best.

@brada4
Copy link
Contributor

brada4 commented Dec 5, 2018

I think it is around places where now retired sched.compat_yield sysctl was operating, now we are stuck in the world with non-compat one.
I think pthread_cond_wait is pthread equivalent not involving syscalls.

@treadalus
Copy link

I just scanned this thread. In my experience on AWS, hyperthreading causes performance issues. I use scripts to turn that off. Btw, I wrote a wrapper for Octave to control the number of threads that would fit nicely under OpenBLAS/benchmark/scripts/OCTAVE/ but I'm not sure how best to submit it.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 6, 2018 via email

@fenrus75
Copy link
Contributor

I will retract my statement based on (new) data.

I have now encountered cases where hyperthreading with openblas is a severe hinderance.

HT is fine if both threads are doing basic math operations... both do productive work and while performance increase is limited, they also don't get in the way of each other

HOWEVER

OpenBLAS at times puts a thread in a polling mode doing yield() in a tight loop. This tight loop does a system call (with all the nasty TLB and cache effects in addition), takes many kernel locks as part of this etc etc. The result is that this polling thread utterly destroys the math performance of the other thread on the same core.... sometimes in the 2x range

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 28, 2018

That brings us back to #1560 and #900 - how does the same scenario look with the YIELDING macro from common.h set to a couple of nop's (as it is done on ARM, POWER and a somewhat random selection of AMD cpus) ?

@fenrus75
Copy link
Contributor

on x86 (both Intel and AMD) one would use "pause" not "nop" (pause is also a hyperthreading / virtualization / power management hint)

I'll test. My initial fix was to count cores not hyperthreads.

there is also a root cause behind this that I am not sure of yet; the symptom is lots of time spent in yield()

but the cause is likely some other bug. I've observed in this specific scenario that a lot of "M = 0" dgemm calls got made by the level 3 syrk code.... maybe that causes some threads to finish early and just spin

@fenrus75
Copy link
Contributor

the nop (pause) does not help much at all (within measurement noise)

the HT fix is still almost 2x.

@fenrus75
Copy link
Contributor

next experiment was to make blas_thread_server just not yield, but always use cond variables.
small gain for both with and without ht fix case...

(but I have a modern glibc, I suspect it might be doing a spinafore internally)

@brada4
Copy link
Contributor

brada4 commented Dec 29, 2018

The visible problem is that _yield makes syscall, nop does nothing. Some delay like .0001s could do same without grinding CPU cycles out of turbo mode.

@fenrus75
Copy link
Contributor

that was in the tree very very briefly... but got reverted since it hurt as well.

in the case that seem to matter the most, just using the condition variable, rather than first polling, seems to work just as well as polling.

there's a different problem a little bit, in that if I use taskset to reduce the number of cpus, performance goes up (I'll do a sweep tomorrow to see if there's a sweetspot, and then, if we can compute where the sweetspot is based on matrix size or something)

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 30, 2018

On a 10 core/20 thread system, data below. All numbers are relative cost to the unthreaded (1) case, lower is better

Size 1 core 2 cores 3 cores 4 cores 5 cores 6 cores 7 cores 8 cores 9 cores 10 cores 11 threads 12 threads 13 threads 14 threads 15 threads 16 threads 17 threads 18 threads 19 threads 20 threads
96x96x96 1.000 1.181 1.517 1.087 1.146 1.496 1.540 1.065 1.135 1.144 1.152 1.365 1.494 1.515 1.495 1.068 1.159 1.159 1.141 1.078
128x128x128 1.000 0.781 1.062 0.657 0.628 0.912 0.911 0.675 0.763 0.632 0.655 0.841 0.899 0.896 0.914 0.675 0.755 0.760 0.768 0.610
192x192x192 1.000 0.719 0.669 0.648 0.684 0.440 0.406 0.535 0.515 0.664 0.670 0.441 0.440 0.401 0.403 0.539 0.539 0.539 0.541 0.656
256x256x256 1.000 0.687 0.612 0.415 0.455 0.472 0.478 0.331 0.295 0.362 0.359 0.432 0.444 0.495 0.487 0.333 0.331 0.295 0.293 0.366
384x384x384 1.000 0.606 0.429 0.350 0.360 0.256 0.258 0.265 0.277 0.278 0.288 0.291 0.259 0.262 0.269 0.281 0.271 0.290 0.285 0.296
512x512x512 1.000 0.588 0.440 0.323 0.326 0.267 0.262 0.206 0.205 0.210 0.261 0.284 0.292 0.295 0.293 0.216 0.194 0.192 0.193 0.197
768x768x768 1.000 0.259 0.180 0.141 0.126 0.102 0.101 0.082 0.084 0.086 0.133 0.100 0.101 0.091 0.091 0.090 0.088 0.089 0.089 0.088
1024x1024x1024 1.000 0.563 0.398 0.298 0.276 0.235 0.204 0.169 0.157 0.162 0.212 0.216 0.216 0.218 0.217 0.162 0.161 0.161 0.162 0.163
1536x1536x1536 1.000 0.560 0.192 0.146 0.127 0.103 0.094 0.084 0.083 0.068 0.118 0.100 0.094 0.093 0.094 0.080 0.080 0.080 0.074 0.073
2048x2048x2048 1.000 0.366 0.373 0.202 0.167 0.141 0.125 0.110 0.105 0.101 0.145 0.147 0.133 0.128 0.127 0.110 0.106 0.104 0.101 0.102

@thijssteel
Copy link

I think that besides pure performance, dynamic selection of the number of threads has other advantages:

  • it's probably way better for power usage
  • it makes it a bit more clear to developers how scalable their code is ( for example, my latest project, the new QZ algorithm for lapack, rarely uses more than 4 threads when linked with MKL. ). This is valuable information.

I also think it would be interesting to look at the performance for rectangular matrices. The QZ algorithm contains many (k x k x n) multiplications with k << n and I've found those to be particularly problematic when the threads aren't limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants