maybe a problem with 0.3.12 and NumPy? #2970

mattip · 2020-11-02T16:24:14Z

Just a heads up, that NumPy is seeing some problems with OpenBLAS 0.3.12.

We released NumPy 1.19.3 with OpenBLAS 0.3.12 in order to fix the windows fmod problems. We had to back it out and are releasing a 0.19.4 with the previous OpenBLAS and code to error out if the fmod bug is detected instead. We got reports that 1.19.3 crashes, xref numpy/numpy#17674 and numpy/numpy#17684. In the first issue, the reporter provided ~~a docker image and code to repoduce~~ output of some system diagonstics. The second issue is a lack of memory in hstack, which is difficult to attribute to OpenBLAS, so there may be something else going on.
Both issues were "fixed" (or covered up) by the 1.19.4 release candidate with an older OpenBLAS.
Does any of this make sense to you?

Edit: no docker image, just diagnostics

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-11-02T16:42:11Z

No idea. If anything, thread memory requirements of 0.3.12 should be less than before, and I thought NumPy was satisfied with the state of OpenBLAS after application of the fmod workarounds. Are you bundling the actual 0.3.12 release or some snapshot from around that time ? And the older OpenBLAS you went back to is what - 0.3.10 or even older ?

charris · 2020-11-02T17:10:57Z

@martin-frbg Our CI testing showed no problems and the wheels builds succeeded, so 0.3.12 looked good prior to release.

EDIT: I also expect a fair number of people tried 1.19.3 because of the Python 3.9 support. If the problem was universal there would probably have been more bug reports.

martin-frbg · 2020-11-02T17:21:38Z

Well the only change in 0.3.11/12 that I would expect to have any effect at library/thread initialization time is the reduction of the BLAS3_MEM_ALLOC_THRESHOLD variable, and its previous default is stil available in Makefile.rule (or as a parameter to make. Reducing this actually made crashes in other people's code go away however. If your "working" OpenBLAS is older than 0.3.10 then we are looking at a scary number of changes, and probably need a git bisect to get anywhere.

h-vetinari · 2020-11-02T18:21:56Z

Unfortunately, it's 0.3.9

brada4 · 2020-11-02T18:25:53Z

Original issues got closed without code change
numpy/numpy#17674 (comment)
numpy/numpy#17684 (comment)

charris · 2020-11-02T18:29:15Z

@brada4 We switched back to the earlier library for 1.19.4.

martin-frbg · 2020-11-02T18:49:47Z

Hm thanks. 0.3.10 (or at least the changes that looked important enough to label with the milestone) was mostly CMAKE build improvements and a few thread race fixes, nothing that I would expect to blow up as soon as the library gets loaded.

brada4 · 2020-11-02T18:50:50Z

The build tag (a32f1dca) on OpenBLAS so file is not related to any tags here. Where it does come from, how do we trace back to OpenBLAS source code used to build the library?
Backtrace in those issues says last OpenBLAS thread jumped to 0x0 that should not happen, and without backtrace hard to tell what made it do so.
Also given threads mentioned - would it be possible to run same crashing docker with OPENBLAS_NUM_THREADS=1 added ?

mattip · 2020-11-02T19:00:42Z

The descriptive string v0.3.7-527-g79fd006c is the result of git describe --tags --abbrev=8. The OpenBLAS commit is 79fd006c and is somewhere between 0.3.9 and 0.3.10.

mattip · 2020-11-02T19:07:17Z

According to the comment in the issue there is information at https://drive.google.com/drive/folders/18mcYQu4GGPzwCRj14Pzy8LvfC9bcLid8?usp=sharing on the docker environment where the segfault occurs.

brada4 · 2020-11-02T19:15:55Z

It would take ages to reconstruct failing environment 1:1. The question is if it is possible to clearly attribute problem to "OpenBLAS threading" by confirming there is no problem whatsoever when threading thing is turned off via env variable.

Issue mentions /virtualenv/python3.6/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-a32f1dca.3.12.so - github search points it to this thread, and full search to same tag mentioned in exact numpy binary build) but not to the code - could you help to dig it to OpenBLAS tag?
I am not picky or anything, would be easier for all if we know numerology behind file tagging and can admit guilty right away ;-)

mattip · 2020-11-02T19:50:06Z

@brada4 I am not sure what you are asking. A static build of OpenBLAS is downloaded and is built into the so you see with some other things as part of the NumPy build process. The reason it has a hash is to uniquely bind that so to the NumPy build, it does not directly reflect a version of OpenBLAS. The exact version of OpenBLAS used in the build of the wheel is the code tagged with the 1.19.3 tag, and is here - you can see the tag is 0.3.12. You can download the wheel and check the result of openblas_get_config() as we do later in that file to verify that the OpenBLAS version is the one we thought it was.

brada4 · 2020-11-02T19:54:39Z

@mattip thanks, that fills the gap in my understanding.

carlkl · 2020-11-02T19:54:51Z

One thing that was changed in 3.12 is x86_64: clobber all xmm registers after vzeroupper. See also Omit clobbers from vzeroupper until final [PR92190]. And Microsoft x64 calling convention tells, that:
user-written assembly language routines must preserve XMM6 and XMM7.
Is this patch really necessary? And is this valid for the Windows 64 ABI?

carlkl · 2020-11-02T20:08:13Z

Another bit can be found here: Software consequences of extending XMM to YMM (Agner Fog)

However, this scenario is only relevant if the legacy function saves and restores the XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway.

mattip · 2020-11-02T20:10:37Z

Both the users who reported this issue are using dockers on linux, right?

martin-frbg · 2020-11-02T20:22:24Z

Callee-saves is implemented in the assembly PROLOGUE/EPILOGUE for Windows so this particular change should play no role here

bashtage · 2020-11-03T17:14:25Z

OpenBLAS 0.3.12 is making large memory requests on initialization that depdne heavily on the number of threads. With 24 threads, I am seeing:

This scales pretty linearly with the number of threads when I set $env:OPENBLAS_NUM_THREADS.

bashtage · 2020-11-03T17:21:03Z

When I roll back to 0.3.9 the memory ask about 1/4:

I checked 0.3.10 and it also shows high memory usage, so it seems like something between these two releases.

martin-frbg · 2020-11-03T17:47:33Z

That then is probably the GEMM buffer, configurable as BUFFERSIZE at compile time (in Makefile.rule) The default was increased a few times recently to fix overflows at huge matrix sizes - there is a certainly a trade-off and possibly a fundamental design flaw in OpenBLAS involved.0.3.12 built with make BUFFERSIZE=20 should have about the same footprint as 0.3.9. (see #1698 for original issue, #2538 for the motivation to actually change the defaults)

mattip · 2020-11-04T05:25:38Z

If we shrink the BUFFERSIZE, would we run the risk of running out of room with 24 threads and large matrices? Should NumPy cap the number of threads with openblas_set_num_threads ?

martin-frbg · 2020-11-04T07:42:31Z

Yes, in the same way as 0.3.9 and earlier but not worse. 2. Probably yes, though that seems to depend on use case (method and context of loading numpy/openblas). All this assuming that BUFFERSIZE is actually the culprit - I have not gotten around to
trying the dockerfile from the numpy ticket, and that ticket did not mention hardware constraints)

jonringer · 2020-11-06T00:55:24Z

bumping from 0.3.10 to 0.3.12 in nixpkgs has caused numpy to fail on tests now
NixOS/nixpkgs#101780 (comment)

stacktrace:

(gdb) backtrace
#0  0x00007fffe33a327a in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#1  0x00007fffe33a1e09 in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#2  0x00007fffe88fc11f in exec_blas () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#3  0x00007fffe87c9d93 in gemm_driver () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#4  0x00007fffe87c9f67 in dgemm_thread_nn () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#5  0x00007fffe86dedc7 in cblas_dgemm () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#6  0x00007fffea2e87c1 in cblas_matrixproduct ()
   from /nix/store/1dynhpbdks2lzpjbkz1i5p0rkm5klfn4-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so

lscpu info

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD Ryzen Threadripper 3990X 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         3776.522
CPU max MHz:                     2900.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        5800.49
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr ss
                                 e sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_t
                                 sc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe po
                                 pcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalign
                                 sse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc
                                  mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2
                                 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves c
                                 qm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt
                                  lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthres
                                 hold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

martin-frbg · 2020-11-06T08:35:40Z

@jonringer this looks like something else, not the BUFFERSIZE thing. Wonder why the backtrace shows the OpenMP library and a liblapack but nothing that calls itself OpenBLAS - do you have libopenblas aliased to liblapack, or could this be a mixup of library builds ? (So far I have not seen mention of numpy itself failing tests with 0.3.12, cf charris' comment above #2970 (comment) )
(Accidentally closed this due to touchpad malfunction...)

mattip · 2020-11-06T08:58:16Z

I am a little hesitant about overriding the OpenBLAS default BUFFERSIZE constant in NumPy's build process. OpenBLAS changed it for good reasons, and it could cause incompatibilities. Is refactoring this on the roadmap?

brada4 · 2020-11-09T10:50:15Z

"prod with error" is not expected to work. Ubuntu 16 kernel has no provision of AVX512 XSAVE.

carlkl · 2020-11-09T14:36:13Z

@brada4, do you mean some individual SIMD instruction set extensions are disabled (using XCR0 register) by the virtual OS within Docker, and this is the reason for the segfaults?

brada4 · 2020-11-09T14:53:12Z

AVX512 extensions are visible in KVM guest, but using them leads to numeric eurekas, which likely signify register corruption.
HWE kernel (4.15.xxxx-generic) in both sides functions properly.
Did not try 18LTS usermode over 16LTS kernel though.
Proper tests with significance here would be trying ++NUM_THREADS=1 and ++_COERTYPE=(haswell|sandybridge) and seeing problem goes or stays.

mattip · 2020-11-10T08:43:15Z

What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?

martin-frbg · 2020-11-10T08:56:59Z

Seems quite unlikely to me - there was code using AVX512 extensions in 0.3.9 already and the DYNAMIC_ARCH builds are supposed to do a runtime check for actual availability. I'd be much more interested in results from a build with the smaller 0.3.9 BUFFERSIZE, or a simple self-contained reproducer. If anything I would expect AVX512 mishandling to result in SIGILL or SIGFPE (or silent garbage results), not SIGSEGV

mattip · 2020-11-10T09:44:00Z

Had understood the original issue ticket to have a docker image for reproducing the problem

Sorry, I corrected that in the description. FWIW, the information provided indicates the host machine has 128GB and 31 cores. I can't replicate that locally unfortunately. How much memory does the default BUFFER_SIZE allocate per thread?

martin-frbg · 2020-11-10T09:58:52Z

128mb, with 0.3.9 it was only 32mb (written as a bitshift in common_x86_64.h, it is "32<<22" vs "32<<20", therefore the suggestion to specify "20" in #2970 (comment)

brada4 · 2020-11-10T12:09:40Z

What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?

Completely corrupt numeric results, basically all openblas tests corrupt, ssh drops out often etc. Anyone running Ubuntu 16 in both sides of KVM, rigging compiler to have AVX-512 supported, then yes, it stops working.

jonringer · 2020-11-11T01:28:39Z

This may be unrelated to the other discussion items, but I was successful in finding the offending commit. I also verified that reverting the commit allowed for me to build numpy and run the numpy tests on a 3990X.
Offending commit: 3094fc6

Testing steps:
Nix allows you to pin certain packages, and substitute it in all downstream packages. Builds are hermetic, and usually byte reproducible.
Started git bisect on the 3.10 release up until current develop branch git bisect start HEAD 63b03ef

default.nix

let
  openblas_overlay = self: super: {
    # replace all references of openblas in nixpkgs with local version
    openblas = super.openblas.overrideAttrs (oldAttrs: {
      src = super.lib.cleanSource ./.;
    });
  };
in

# create new package set, referencing local version of openblas
import ../nixpkgs { overlays = [ openblas_overlay ]; }

bisect command:

git bisect run nix-build default.nix -A python3Packages.numpy --no-build-output --cores 128

gist of run: https://gist.github.com/jonringer/3d9351b8f1e153f5a7275975880b4319

This also seems to align with my previous comment #2970 (comment), in which the seg fault would occur inside libgomp

martin-frbg · 2020-11-15T13:30:36Z

libgomp segfault not reproduced on Ryzen5-4600H with current develop (with and without interface64 and symbol suffixing) and gcc 7.5. Rerunning the builds with gcc 9.3 now but no segfault so far.

mattip · 2020-11-15T17:02:25Z

@jonringer can you try out the NumPy wheels from https://anaconda.org/scipy-wheels-nightly/numpy ? We built the last round of wheels with BUFFERSIZE=20, so it would be nice to know if that will work as a stop-gap for the upcoming NumPy release (assuming the wheels installed via pip install numpy==1.19.3, with stock OpenBLAS 0.3.12, crash for you).

martin-frbg · 2020-11-15T17:14:34Z

Thx @mattip - obviously my 4600 is a poor substitute for Threadripper, but at least his does not some appear to be some separate issue with OpenMP and forks after all. (Wonder how much memory that TR has, obviously a fork() would be a perfect place to run out of it)

martin-frbg · 2020-11-15T20:56:02Z

@mattip according to #2982 (comment) his problem appears to be unrelated to BUFFERSIZE after all. Curious whether the change fixes "your" numpy problems though

jonringer · 2020-11-15T21:18:10Z

@mattip I don't actually use this library personally or professionally. My main concern is that I'm no longer able to test numpy without a segfault while curating packages on nixpkgs.

martin-frbg · 2020-11-15T21:40:47Z

Hm... only instance of "fork" in the numpy testsuite seems to be in a case that looks even more benign than the 1024x1024 dgemm
used in the fork utest - but I may well be missing something here ?

mattip · 2020-11-16T06:29:20Z

Sorry for the noise. I was grasping for straws, seems I chose the wrong one.

Flamefire · 2020-11-16T08:30:14Z

Hm... only instance of "fork" in the numpy testsuite seems to be in a case that looks even more benign than the 1024x1024 dgemm
used in the fork utest - but I may well be missing something here ?

FWIW: Some Python stuff does a fork behind the scenes. Such as querying the system name/type/arch which spawns a process via fork while looking innocent on the usage site, like print(sys.name) (or so)

martin-frbg · 2020-11-28T09:59:50Z

If I read the comments on numpy/numpy#17759 correctly, reducing BUFFERSIZE apparently did not fix the
numpy issue ? The one other change to the startup code in the 0.3.9 ... 0.3.12 timeframe that stands out to me is #2466 but I do not see it causing this kind of problem. (Also it would not feature in an OpenMP build)

mattip · 2020-11-28T16:12:09Z

I think we can close this, or at least relate to it as a use-case OpenBLAS will not support. It does not seem reasonable to me to open 36 threads on a machine with less than 1GB memory available to support the threads. NumPy is now using BUFFERSIZE=20 in the PyPI wheels.

h-vetinari mentioned this issue Nov 2, 2020

numpy v1.19.3 conda-forge/numpy-feedstock#210

Closed

3 tasks

mattip mentioned this issue Nov 3, 2020

polyfit and eig regression tests fail after Windows 10 update to 2004 numpy/numpy#16744

Closed

jonringer mentioned this issue Nov 6, 2020

Python packages set: patch updates NixOS/nixpkgs#101636

Merged

10 tasks

martin-frbg closed this as completed Nov 6, 2020

martin-frbg reopened this Nov 6, 2020

This was referenced Nov 11, 2020

Revert "Lazyly reinit threads after a fork in OMP mode" #2982

Closed

[staging] openblas: fix seg fault on some architectures NixOS/nixpkgs#103379

Closed

tylerjereddy mentioned this issue Nov 11, 2020

MAINT, CI: bump to latest stable OpenBLAS scipy/scipy#12371

Closed

mattip mentioned this issue Nov 12, 2020

BLD: use BUFFERSIZE=20 in OpenBLAS numpy/numpy#17759

Merged

martin-frbg mentioned this issue Nov 15, 2020

WIP, Testing: Reduce the default BUFFERSIZE for x86_64 to its 0.3.9 value #2990

Closed

jonringer mentioned this issue Nov 19, 2020

Segfault when NUM_THREADS is smaller than host vCPUs #2993

Closed

martin-frbg mentioned this issue Nov 26, 2020

Don't overwrite blas_thread_buffer if already set #2995

Merged

charris mentioned this issue Dec 4, 2020

BLD: use BUFFERSIZE=20 in OpenBLAS numpy/numpy#17924

Merged

martin-frbg closed this as completed Dec 12, 2020

charris mentioned this issue Jan 9, 2021

Large memory overhead change in numpy==1.19.5 vs 1.19.4 on ubuntu 20.04 numpy/numpy#18141

Closed

martin-frbg mentioned this issue Apr 13, 2021

Segmentation fault with openblas >0.3.10 and numpy #3180

Closed

maybe a problem with 0.3.12 and NumPy? #2970

maybe a problem with 0.3.12 and NumPy? #2970

Comments

mattip commented Nov 2, 2020 • edited

martin-frbg commented Nov 2, 2020

charris commented Nov 2, 2020 • edited

martin-frbg commented Nov 2, 2020

h-vetinari commented Nov 2, 2020 • edited

brada4 commented Nov 2, 2020

charris commented Nov 2, 2020

martin-frbg commented Nov 2, 2020

brada4 commented Nov 2, 2020 • edited

mattip commented Nov 2, 2020

mattip commented Nov 2, 2020

brada4 commented Nov 2, 2020

mattip commented Nov 2, 2020

brada4 commented Nov 2, 2020

carlkl commented Nov 2, 2020 • edited

carlkl commented Nov 2, 2020

mattip commented Nov 2, 2020

martin-frbg commented Nov 2, 2020

bashtage commented Nov 3, 2020

bashtage commented Nov 3, 2020

martin-frbg commented Nov 3, 2020

mattip commented Nov 4, 2020

martin-frbg commented Nov 4, 2020 • edited

jonringer commented Nov 6, 2020

martin-frbg commented Nov 6, 2020 • edited

mattip commented Nov 6, 2020

brada4 commented Nov 9, 2020

carlkl commented Nov 9, 2020 • edited

brada4 commented Nov 9, 2020 • edited

mattip commented Nov 10, 2020

martin-frbg commented Nov 10, 2020 • edited

mattip commented Nov 10, 2020

martin-frbg commented Nov 10, 2020

brada4 commented Nov 10, 2020

jonringer commented Nov 11, 2020

martin-frbg commented Nov 15, 2020

mattip commented Nov 15, 2020

martin-frbg commented Nov 15, 2020

martin-frbg commented Nov 15, 2020

jonringer commented Nov 15, 2020

martin-frbg commented Nov 15, 2020

mattip commented Nov 16, 2020

Flamefire commented Nov 16, 2020

martin-frbg commented Nov 28, 2020

mattip commented Nov 28, 2020

mattip commented Nov 2, 2020 •

edited

charris commented Nov 2, 2020 •

edited

h-vetinari commented Nov 2, 2020 •

edited

brada4 commented Nov 2, 2020 •

edited

carlkl commented Nov 2, 2020 •

edited

martin-frbg commented Nov 4, 2020 •

edited

martin-frbg commented Nov 6, 2020 •

edited

carlkl commented Nov 9, 2020 •

edited

brada4 commented Nov 9, 2020 •

edited

martin-frbg commented Nov 10, 2020 •

edited