Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maybe a problem with 0.3.12 and NumPy? #2970

Closed
mattip opened this issue Nov 2, 2020 · 57 comments
Closed

maybe a problem with 0.3.12 and NumPy? #2970

mattip opened this issue Nov 2, 2020 · 57 comments

Comments

@mattip
Copy link
Contributor

mattip commented Nov 2, 2020

Just a heads up, that NumPy is seeing some problems with OpenBLAS 0.3.12.

We released NumPy 1.19.3 with OpenBLAS 0.3.12 in order to fix the windows fmod problems. We had to back it out and are releasing a 0.19.4 with the previous OpenBLAS and code to error out if the fmod bug is detected instead. We got reports that 1.19.3 crashes, xref numpy/numpy#17674 and numpy/numpy#17684. In the first issue, the reporter provided a docker image and code to repoduce output of some system diagonstics. The second issue is a lack of memory in hstack, which is difficult to attribute to OpenBLAS, so there may be something else going on.
Both issues were "fixed" (or covered up) by the 1.19.4 release candidate with an older OpenBLAS.
Does any of this make sense to you?

Edit: no docker image, just diagnostics

@martin-frbg
Copy link
Collaborator

No idea. If anything, thread memory requirements of 0.3.12 should be less than before, and I thought NumPy was satisfied with the state of OpenBLAS after application of the fmod workarounds. Are you bundling the actual 0.3.12 release or some snapshot from around that time ? And the older OpenBLAS you went back to is what - 0.3.10 or even older ?

@charris
Copy link

charris commented Nov 2, 2020

@martin-frbg Our CI testing showed no problems and the wheels builds succeeded, so 0.3.12 looked good prior to release.

EDIT: I also expect a fair number of people tried 1.19.3 because of the Python 3.9 support. If the problem was universal there would probably have been more bug reports.

@martin-frbg
Copy link
Collaborator

Well the only change in 0.3.11/12 that I would expect to have any effect at library/thread initialization time is the reduction of the BLAS3_MEM_ALLOC_THRESHOLD variable, and its previous default is stil available in Makefile.rule (or as a parameter to make. Reducing this actually made crashes in other people's code go away however. If your "working" OpenBLAS is older than 0.3.10 then we are looking at a scary number of changes, and probably need a git bisect to get anywhere.

@h-vetinari
Copy link
Contributor

h-vetinari commented Nov 2, 2020

Unfortunately, it's 0.3.9

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2020

Original issues got closed without code change
numpy/numpy#17674 (comment)
numpy/numpy#17684 (comment)

@charris
Copy link

charris commented Nov 2, 2020

@brada4 We switched back to the earlier library for 1.19.4.

@martin-frbg
Copy link
Collaborator

Hm thanks. 0.3.10 (or at least the changes that looked important enough to label with the milestone) was mostly CMAKE build improvements and a few thread race fixes, nothing that I would expect to blow up as soon as the library gets loaded.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2020

The build tag (a32f1dca) on OpenBLAS so file is not related to any tags here. Where it does come from, how do we trace back to OpenBLAS source code used to build the library?
Backtrace in those issues says last OpenBLAS thread jumped to 0x0 that should not happen, and without backtrace hard to tell what made it do so.
Also given threads mentioned - would it be possible to run same crashing docker with OPENBLAS_NUM_THREADS=1 added ?

@mattip
Copy link
Contributor Author

mattip commented Nov 2, 2020

The descriptive string v0.3.7-527-g79fd006c is the result of git describe --tags --abbrev=8. The OpenBLAS commit is 79fd006c and is somewhere between 0.3.9 and 0.3.10.

@mattip
Copy link
Contributor Author

mattip commented Nov 2, 2020

According to the comment in the issue there is information at https://drive.google.com/drive/folders/18mcYQu4GGPzwCRj14Pzy8LvfC9bcLid8?usp=sharing on the docker environment where the segfault occurs.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2020

It would take ages to reconstruct failing environment 1:1. The question is if it is possible to clearly attribute problem to "OpenBLAS threading" by confirming there is no problem whatsoever when threading thing is turned off via env variable.

Issue mentions /virtualenv/python3.6/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-a32f1dca.3.12.so - github search points it to this thread, and full search to same tag mentioned in exact numpy binary build) but not to the code - could you help to dig it to OpenBLAS tag?
I am not picky or anything, would be easier for all if we know numerology behind file tagging and can admit guilty right away ;-)

@mattip
Copy link
Contributor Author

mattip commented Nov 2, 2020

@brada4 I am not sure what you are asking. A static build of OpenBLAS is downloaded and is built into the so you see with some other things as part of the NumPy build process. The reason it has a hash is to uniquely bind that so to the NumPy build, it does not directly reflect a version of OpenBLAS. The exact version of OpenBLAS used in the build of the wheel is the code tagged with the 1.19.3 tag, and is here - you can see the tag is 0.3.12. You can download the wheel and check the result of openblas_get_config() as we do later in that file to verify that the OpenBLAS version is the one we thought it was.

@brada4
Copy link
Contributor

brada4 commented Nov 2, 2020

@mattip thanks, that fills the gap in my understanding.

@carlkl
Copy link

carlkl commented Nov 2, 2020

One thing that was changed in 3.12 is x86_64: clobber all xmm registers after vzeroupper. See also Omit clobbers from vzeroupper until final [PR92190]. And Microsoft x64 calling convention tells, that:
user-written assembly language routines must preserve XMM6 and XMM7.
Is this patch really necessary? And is this valid for the Windows 64 ABI?

@carlkl
Copy link

carlkl commented Nov 2, 2020

Another bit can be found here: Software consequences of extending XMM to YMM (Agner Fog)

However, this scenario is only relevant if the legacy function saves and restores the XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway.

@mattip
Copy link
Contributor Author

mattip commented Nov 2, 2020

Both the users who reported this issue are using dockers on linux, right?

@martin-frbg
Copy link
Collaborator

Callee-saves is implemented in the assembly PROLOGUE/EPILOGUE for Windows so this particular change should play no role here

@bashtage
Copy link

bashtage commented Nov 3, 2020

OpenBLAS 0.3.12 is making large memory requests on initialization that depdne heavily on the number of threads. With 24 threads, I am seeing:

image

This scales pretty linearly with the number of threads when I set $env:OPENBLAS_NUM_THREADS.

@bashtage
Copy link

bashtage commented Nov 3, 2020

When I roll back to 0.3.9 the memory ask about 1/4:

image

I checked 0.3.10 and it also shows high memory usage, so it seems like something between these two releases.

@martin-frbg
Copy link
Collaborator

That then is probably the GEMM buffer, configurable as BUFFERSIZE at compile time (in Makefile.rule) The default was increased a few times recently to fix overflows at huge matrix sizes - there is a certainly a trade-off and possibly a fundamental design flaw in OpenBLAS involved.0.3.12 built with make BUFFERSIZE=20 should have about the same footprint as 0.3.9. (see #1698 for original issue, #2538 for the motivation to actually change the defaults)

@mattip
Copy link
Contributor Author

mattip commented Nov 4, 2020

If we shrink the BUFFERSIZE, would we run the risk of running out of room with 24 threads and large matrices? Should NumPy cap the number of threads with openblas_set_num_threads ?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 4, 2020

  1. Yes, in the same way as 0.3.9 and earlier but not worse. 2. Probably yes, though that seems to depend on use case (method and context of loading numpy/openblas). All this assuming that BUFFERSIZE is actually the culprit - I have not gotten around to
    trying the dockerfile from the numpy ticket, and that ticket did not mention hardware constraints)

@jonringer
Copy link

bumping from 0.3.10 to 0.3.12 in nixpkgs has caused numpy to fail on tests now
NixOS/nixpkgs#101780 (comment)

stacktrace:

(gdb) backtrace
#0  0x00007fffe33a327a in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#1  0x00007fffe33a1e09 in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#2  0x00007fffe88fc11f in exec_blas () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#3  0x00007fffe87c9d93 in gemm_driver () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#4  0x00007fffe87c9f67 in dgemm_thread_nn () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#5  0x00007fffe86dedc7 in cblas_dgemm () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#6  0x00007fffea2e87c1 in cblas_matrixproduct ()
   from /nix/store/1dynhpbdks2lzpjbkz1i5p0rkm5klfn4-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
lscpu info
$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD Ryzen Threadripper 3990X 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         3776.522
CPU max MHz:                     2900.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        5800.49
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr ss
                                 e sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_t
                                 sc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe po
                                 pcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalign
                                 sse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc
                                  mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2
                                 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves c
                                 qm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt
                                  lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthres
                                 hold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 6, 2020

@jonringer this looks like something else, not the BUFFERSIZE thing. Wonder why the backtrace shows the OpenMP library and a liblapack but nothing that calls itself OpenBLAS - do you have libopenblas aliased to liblapack, or could this be a mixup of library builds ? (So far I have not seen mention of numpy itself failing tests with 0.3.12, cf charris' comment above #2970 (comment) )
(Accidentally closed this due to touchpad malfunction...)

@mattip
Copy link
Contributor Author

mattip commented Nov 6, 2020

I am a little hesitant about overriding the OpenBLAS default BUFFERSIZE constant in NumPy's build process. OpenBLAS changed it for good reasons, and it could cause incompatibilities. Is refactoring this on the roadmap?

@brada4
Copy link
Contributor

brada4 commented Nov 9, 2020

"prod with error" is not expected to work. Ubuntu 16 kernel has no provision of AVX512 XSAVE.

@carlkl
Copy link

carlkl commented Nov 9, 2020

@brada4, do you mean some individual SIMD instruction set extensions are disabled (using XCR0 register) by the virtual OS within Docker, and this is the reason for the segfaults?

@brada4
Copy link
Contributor

brada4 commented Nov 9, 2020

AVX512 extensions are visible in KVM guest, but using them leads to numeric eurekas, which likely signify register corruption.
HWE kernel (4.15.xxxx-generic) in both sides functions properly.
Did not try 18LTS usermode over 16LTS kernel though.
Proper tests with significance here would be trying ++NUM_THREADS=1 and ++_COERTYPE=(haswell|sandybridge) and seeing problem goes or stays.

@mattip
Copy link
Contributor Author

mattip commented Nov 10, 2020

What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 10, 2020

Seems quite unlikely to me - there was code using AVX512 extensions in 0.3.9 already and the DYNAMIC_ARCH builds are supposed to do a runtime check for actual availability. I'd be much more interested in results from a build with the smaller 0.3.9 BUFFERSIZE, or a simple self-contained reproducer. If anything I would expect AVX512 mishandling to result in SIGILL or SIGFPE (or silent garbage results), not SIGSEGV

@mattip
Copy link
Contributor Author

mattip commented Nov 10, 2020

Had understood the original issue ticket to have a docker image for reproducing the problem

Sorry, I corrected that in the description. FWIW, the information provided indicates the host machine has 128GB and 31 cores. I can't replicate that locally unfortunately. How much memory does the default BUFFER_SIZE allocate per thread?

@martin-frbg
Copy link
Collaborator

128mb, with 0.3.9 it was only 32mb (written as a bitshift in common_x86_64.h, it is "32<<22" vs "32<<20", therefore the suggestion to specify "20" in #2970 (comment)

@brada4
Copy link
Contributor

brada4 commented Nov 10, 2020

What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?

Completely corrupt numeric results, basically all openblas tests corrupt, ssh drops out often etc. Anyone running Ubuntu 16 in both sides of KVM, rigging compiler to have AVX-512 supported, then yes, it stops working.

@jonringer
Copy link

This may be unrelated to the other discussion items, but I was successful in finding the offending commit. I also verified that reverting the commit allowed for me to build numpy and run the numpy tests on a 3990X.
Offending commit: 3094fc6

Testing steps:
Nix allows you to pin certain packages, and substitute it in all downstream packages. Builds are hermetic, and usually byte reproducible.
Started git bisect on the 3.10 release up until current develop branch git bisect start HEAD 63b03ef

default.nix
let
  openblas_overlay = self: super: {
    # replace all references of openblas in nixpkgs with local version
    openblas = super.openblas.overrideAttrs (oldAttrs: {
      src = super.lib.cleanSource ./.;
    });
  };
in

# create new package set, referencing local version of openblas
import ../nixpkgs { overlays = [ openblas_overlay ]; }

bisect command:

git bisect run nix-build default.nix -A python3Packages.numpy --no-build-output --cores 128

gist of run: https://gist.github.com/jonringer/3d9351b8f1e153f5a7275975880b4319

This also seems to align with my previous comment #2970 (comment), in which the seg fault would occur inside libgomp

@martin-frbg
Copy link
Collaborator

libgomp segfault not reproduced on Ryzen5-4600H with current develop (with and without interface64 and symbol suffixing) and gcc 7.5. Rerunning the builds with gcc 9.3 now but no segfault so far.

@mattip
Copy link
Contributor Author

mattip commented Nov 15, 2020

@jonringer can you try out the NumPy wheels from https://anaconda.org/scipy-wheels-nightly/numpy ? We built the last round of wheels with BUFFERSIZE=20, so it would be nice to know if that will work as a stop-gap for the upcoming NumPy release (assuming the wheels installed via pip install numpy==1.19.3, with stock OpenBLAS 0.3.12, crash for you).

@martin-frbg
Copy link
Collaborator

Thx @mattip - obviously my 4600 is a poor substitute for Threadripper, but at least his does not some appear to be some separate issue with OpenMP and forks after all. (Wonder how much memory that TR has, obviously a fork() would be a perfect place to run out of it)

@martin-frbg
Copy link
Collaborator

@mattip according to #2982 (comment) his problem appears to be unrelated to BUFFERSIZE after all. Curious whether the change fixes "your" numpy problems though

@jonringer
Copy link

@mattip I don't actually use this library personally or professionally. My main concern is that I'm no longer able to test numpy without a segfault while curating packages on nixpkgs.

@martin-frbg
Copy link
Collaborator

Hm... only instance of "fork" in the numpy testsuite seems to be in a case that looks even more benign than the 1024x1024 dgemm
used in the fork utest - but I may well be missing something here ?

@mattip
Copy link
Contributor Author

mattip commented Nov 16, 2020

Sorry for the noise. I was grasping for straws, seems I chose the wrong one.

@Flamefire
Copy link
Contributor

Hm... only instance of "fork" in the numpy testsuite seems to be in a case that looks even more benign than the 1024x1024 dgemm
used in the fork utest - but I may well be missing something here ?

FWIW: Some Python stuff does a fork behind the scenes. Such as querying the system name/type/arch which spawns a process via fork while looking innocent on the usage site, like print(sys.name) (or so)

@martin-frbg
Copy link
Collaborator

If I read the comments on numpy/numpy#17759 correctly, reducing BUFFERSIZE apparently did not fix the
numpy issue ? The one other change to the startup code in the 0.3.9 ... 0.3.12 timeframe that stands out to me is #2466 but I do not see it causing this kind of problem. (Also it would not feature in an OpenMP build)

@mattip
Copy link
Contributor Author

mattip commented Nov 28, 2020

I think we can close this, or at least relate to it as a use-case OpenBLAS will not support. It does not seem reasonable to me to open 36 threads on a machine with less than 1GB memory available to support the threads. NumPy is now using BUFFERSIZE=20 in the PyPI wheels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants