Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gbmv! segfaults for large matrices #1698

Closed
dlfivefifty opened this issue Jul 25, 2018 · 28 comments
Closed

gbmv! segfaults for large matrices #1698

dlfivefifty opened this issue Jul 25, 2018 · 28 comments
Milestone

Comments

@dlfivefifty
Copy link

I'm getting the following issue calling gbmv! in Julia, that is only present when compiled with OpenBLAS:

julia> n = 100_000_000; A = randn(3,n); x = randn(n); y = similar(x); BLAS.gbmv!('N', n, 1, 1, 1.0, A, x, 0.0, y);
ERROR: ReadOnlyMemoryError()
Stacktrace:
 [1] gbmv!(::Char, ::Int64, ::Int64, ::Int64, ::Float64, ::Array{Float64,2}, ::Array{Float64,1}, ::Float64, ::Array{Float64,1}) at /Users/solver/Projects/julia7/usr/share/julia/stdlib/v0.7/LinearAlgebra/src/blas.jl:652
 [2] top-level scope at none:0

julia> n = 100_000_000; A = randn(3,n); x = randn(n); y = similar(x); BLAS.gbmv!('N', n, 1, 1, 1.0, A, x, 0.0, y);
Segmentation fault: 11

When compiled with Apple's system BLAS, the above examples work fine.

@martin-frbg
Copy link
Collaborator

Which version of OpenBLAS is this ?

@dlfivefifty
Copy link
Author

According to openblas.version file in Julia v0.7-beta, it's

OPENBLAS_BRANCH=v0.3.0
OPENBLAS_SHA1=939452ea9dcb57abdcc3f1278c6db668a4690465

@martin-frbg
Copy link
Collaborator

Possibly PR #1389 (original issue #1089) was not sufficient. How many cores does your hardware have ?

@dlfivefifty
Copy link
Author

4 cores I think. It's a MacBook Pro 2017.

@martin-frbg
Copy link
Collaborator

reproduced with a git clone of julia. gdb claims the action happens in dscal_kernel_8_zero called from dscal_k_HASWELL. (So pretty much the same signature as #1089 indeed).

@brada4
Copy link
Contributor

brada4 commented Jul 26, 2018

seems like osx now adds mprotect-ed guard pages around allocations to catch over/under flows.

I suspect https://github.com/xianyi/OpenBLAS/blob/25f2d25cfecf6cf850aa5b989f279e5023b6234b/driver/level2/gbmv_thread.c#L98 where I removed buffer, the padding-looking adjustment should actually go to x, (?) and code was wrong before and after (?)

@martin-frbg
Copy link
Collaborator

Unlikely as that line is inside an #ifdef TRANSA that does not apply here. I suspect this is a different but related failure mode with range_n rather than range_m going out of bounds on the last thread of the set.

@brada4
Copy link
Contributor

brada4 commented Jul 26, 2018

Is L246 writing past end of queue? , same rationale as 529bfc3 (but not 1:1 refecting to files there)
@dlfivefifty can you quickly check if adding 1 to queue size in https://github.com/xianyi/OpenBLAS/blob/25f2d25cfecf6cf850aa5b989f279e5023b6234b/driver/level2/gbmv_thread.c#L179 solves the issue?

@martin-frbg
Copy link
Collaborator

Nope, does not help. (And interestingly it appears to be a random thread from the middle of the queue array that experiences the segfault. Valgrind is no use because of JuliaLang/julia#27131 )

@brada4
Copy link
Contributor

brada4 commented Jul 27, 2018

work divider behaves somewhat counter-intuitive, in particular case 25(0) 25(0)1 25(0) 24(9)
I will try to plant same call in other scriptable BLAS consumer

@brada4
Copy link
Contributor

brada4 commented Jul 27, 2018

...Does not crash with single thread where crashes with multiple

@martin-frbg
Copy link
Collaborator

Do you have a reproducer that does not need julia ?

@brada4
Copy link
Contributor

brada4 commented Jul 27, 2018

@martin-frbg should be identical to scipy dgbmv , but that is via cblas by preference. R exports it via R_ext.h, but no standard or popular module wraps it into scriptable resource.

@martin-frbg
Copy link
Collaborator

Unfortunately building with -fsanitize=address appears to make the bug go away. (it reappears when running julia from gdb, but address sanitizer does not provide any additional information in that context)

@martin-frbg
Copy link
Collaborator

Hmm. Perhaps BUFFER_SIZE as defined in common_x86_64.h is actually too small here ?

@brada4
Copy link
Contributor

brada4 commented Jul 30, 2018

It points to threading threshold in interface/scal.c
Once that is exceeded there is a jump to unmapped area from place I could not find in assemblies.

   0x00007fffe528ecb9 <+201>:   mov    %rcx,%rdi
   0x00007fffe528ecbc <+204>:   lea    0x8(%rsp),%rsi
   0x00007fffe528ecc1 <+209>:   jp     0x7fffe528ed98 <dscal_k_PILEDRIVER+424>
   0x00007fffe528ecc7 <+215>:   jne    0x7fffe528ed98 <dscal_k_PILEDRIVER+424>
   0x00007fffe528eccd <+221>:   callq  0x7fffe528ead0 <- call to unmapped
   0x00007fffe528ecd2 <+226>:   movsd  0x8(%rsp),%xmm0 <- crash eip
   0x00007fffe528ecd8 <+232>:   movsd  (%rsp),%xmm1
   0x00007fffe528ecdd <+237>:   ucomisd %xmm1,%xmm0

I dont know if that is thread safety issue (in principle threading at this level should be single thread dscal?)

@martin-frbg
Copy link
Collaborator

Not sure I understand your find - when scal is called from gbmv_thread, it gets passed an invalid buffer pointer ("y") already. (All the setup code in gbmv_thread.c seems to be concerned with partitioning a buffer that is apparently assumed to always be big enough. That is why I now suspect that setting BUFFER_SIZE to 64<<20 or 128<<20 (as seen in the ppc and ia64headers respectively) might help. I will not be able to test this before the next weekend unfortunately.

@martin-frbg
Copy link
Collaborator

Increasing BUFFER_SIZE does not help. (Increasing the stack size limit does, same as in #1089)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Sep 1, 2018

Revisiting this with 0.3.3, for some weird reason I cannot reproduce this issue in a shell started from an X session, but it is still there when logged in without X (but with same ulimits set). Could this be another (non)alignment problem ? (Actually it is something about the (size of the) environment - works fine as a regular user or after a plain "su" from there, crashes with a login shell of the root user.
Ulimits appear to be the same in both settings, and there is no obvious difference in the paths or the
environment variables except that the regular user has more of both.)

Rebuilt julia for the generic target to get around the valgrind issue but still not wiser.

@martin-frbg
Copy link
Collaborator

Some of the valgrind traces look as if Julia's garbage collector moved the work array while the OpenBLAS threads were trying to access it, but I cannot really make sense of it yet. What seems clear however is that this condition already existed with 0.2.13 (and probably much earlier, but trying older versions gets tedious as they lacked the symbol suffixes for interface64 builds.)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Sep 30, 2018

Revisiting this, it now looks to me as if this is a fundamental issue rather than a coding bug - each thread needs to allocate a memory range big enough to hold the entire matrix from within the "buffer" region that is defined by the BUFFER_SIZE constant in common_<platform>.h. It seems one would need to increase the default 32 << 20 in common_x86_64.h all the way to 1024UL << 20 to accomodate just 4 threads with the n=100_000_000 workload. (Perhaps there needs to be an upper limit on parallelization in at least some of the interface routines as well, where some already have lower bounds now ?)

@martin-frbg martin-frbg added this to the 0.3.5 milestone Dec 2, 2018
@martin-frbg martin-frbg modified the milestones: 0.3.5, 0.3.6 Dec 30, 2018
@martin-frbg
Copy link
Collaborator

Moving to next milestone as this is still unclear to me, could be similar in concept to bug #173

@martin-frbg martin-frbg modified the milestones: 0.3.6, 0.3.7 Apr 27, 2019
@martin-frbg martin-frbg modified the milestones: 0.3.7, 0.3.8 Aug 11, 2019
@martin-frbg
Copy link
Collaborator

Made configurable in the build system for now to avoid having to hack common_x86_64.h.

@martin-frbg
Copy link
Collaborator

closing as the original issue is fixed by increasing the default size and making it configurable - the requirement for this big buffer as such is a fundamental design "feature" that will need revisiting at some point

@dlfivefifty
Copy link
Author

Segfaulting by design seems like a weird choice...

@martin-frbg
Copy link
Collaborator

...from something like 20 years ago when computer memory was scarce and where it would have been the norm that users would configure the innards of the library for their particular workloads.

@dlfivefifty
Copy link
Author

An error saying “increase the default size” seems much more sensible

@martin-frbg
Copy link
Collaborator

assuming an appropriate single location for such a check can be found (that does not cost performance during regular operation). Increasing the default to accomodate your case (and match what the various internal parameters would have required in the first place) has already led to certain misgivings in the python camp, so a more fundamental long-term resolution would certainly be desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants