Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow on large lu decomposition with OpenBLAS #42591

Closed
chriselrod opened this issue Oct 11, 2021 · 23 comments · Fixed by #42708
Closed

Stack overflow on large lu decomposition with OpenBLAS #42591

chriselrod opened this issue Oct 11, 2021 · 23 comments · Fixed by #42708
Labels
domain:linear algebra Linear algebra kind:bug Indicates an unexpected problem or unintended behavior

Comments

@chriselrod
Copy link
Contributor

chriselrod commented Oct 11, 2021

julia> versioninfo()
Julia Version 1.8.0-DEV.690
Commit bbe93759d5 (2021-10-09 16:35 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_NUM_THREADS = 28

julia> using LinearAlgebra

julia> A = rand(10,10);

julia> lu(A);

julia> A = rand(100,100);

julia> lu(A);

julia> A = rand(1000,1000);

julia> lu(A);
ERROR: StackOverflowError:
Stacktrace:
 [1] getrf!(A::Matrix{Float64})
   @ LinearAlgebra.LAPACK ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lapack.jl:575
 [2] #lu!#155
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:84 [inlined]
 [3] lu(A::Matrix{Float64}, pivot::RowMaximum; check::Bool)
   @ LinearAlgebra ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79
 [4] lu (repeats 2 times)
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79 [inlined]
 [5] top-level scope
   @ REPL[13]:1

julia> using MKL

julia> lu(A);

julia> 

So this looks more like an OpenBLAS bug.

Only shows up when using multiple threads:

julia> BLAS.get_num_threads()
28

julia> BLAS.set_num_threads(8)

julia> lu(A);
ERROR: StackOverflowError:
Stacktrace:
 [1] getrf!(A::Matrix{Float64})
   @ LinearAlgebra.LAPACK ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lapack.jl:575
 [2] #lu!#155
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:84 [inlined]
 [3] lu(A::Matrix{Float64}, pivot::RowMaximum; check::Bool)
   @ LinearAlgebra ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79
 [4] lu (repeats 2 times)
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79 [inlined]
 [5] top-level scope
   @ REPL[11]:1

julia> BLAS.set_num_threads(1)

julia> lu(A);

julia> BLAS.set_num_threads(2)

julia> lu(A);
ERROR: StackOverflowError:
Stacktrace:
 [1] getrf!(A::Matrix{Float64})
   @ LinearAlgebra.LAPACK ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lapack.jl:575
 [2] #lu!#155
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:84 [inlined]
 [3] lu(A::Matrix{Float64}, pivot::RowMaximum; check::Bool)
   @ LinearAlgebra ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79
 [4] lu (repeats 2 times)
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79 [inlined]
 [5] top-level scope
   @ REPL[15]:1

julia>
@chriselrod
Copy link
Contributor Author

On another system where I built OpenBLAS locally, instead of using binarybuilder:

julia> versioninfo()
Julia Version 1.8.0-DEV.660
Commit 153db7a7a8* (2021-10-05 16:17 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

julia> using LinearAlgebra

julia> A = rand(10,10);

julia> lu(A);

julia> A = rand(100,100);

julia> lu(A);

julia> A = rand(1000,1000);

julia> lu(A);

julia> using MKL

julia> lu(A);

julia> BLAS.get_num_threads()
18

@DilumAluthge
Copy link
Member

DilumAluthge commented Oct 11, 2021

Can you get an rr trace (on Linux x86_64)?

One way to do so is to download the rr_capture.jl script (https://github.com/JuliaLang/julia/blob/master/.buildkite/utilities/rr/rr_capture.jl), and then run your MWE as e.g. julia rr_capture.jl julia mwe.jl.

@chriselrod
Copy link
Contributor Author

I sent the file to you on slack, because github doesn't allow .tar.zst files. I tried renaming to .txt but their filter seems smart enough to detect that.

@DilumAluthge
Copy link
Member

@chriselrod Could you bisect?

@DilumAluthge
Copy link
Member

@Keno @ianatol What would be the best way for @chriselrod to share his rr trace with you? DM it to you on Slack?

@DilumAluthge DilumAluthge added kind:bug Indicates an unexpected problem or unintended behavior domain:linear algebra Linear algebra labels Oct 11, 2021
@DilumAluthge DilumAluthge modified the milestone: 1.8 Oct 11, 2021
@chriselrod chriselrod reopened this Oct 11, 2021
@chriselrod
Copy link
Contributor Author

> git bisect good
77172e99d18bfe58fe2f38f27a52832a4fd72e66 is the first bad commit
commit 77172e99d18bfe58fe2f38f27a52832a4fd72e66
Author: Valentin Churavy <v.churavy@gmail.com>
Date:   Fri Oct 8 20:47:26 2021 -0400

    bump OpenBLAS_jll

 deps/checksums/openblas          | 184 +++++++++++++++++++--------------------
 stdlib/OpenBLAS_jll/Project.toml |   2 +-
 2 files changed, 93 insertions(+), 93 deletions(-)

@DilumAluthge
Copy link
Member

😭

@DilumAluthge
Copy link
Member

@Keno Is it possible that this bug was introduced by Reference-LAPACK/lapack#625?

@andreasnoack
Copy link
Member

Is it possible that this bug was introduced by Reference-LAPACK/lapack#625?

It can't be those changes. The routines there are related to eigenvalue computations and the issue here is the LU. It's not obvious where a stack overflow in

ccall((@blasfunc($getrf), libblastrampoline), Cvoid,
(Ref{BlasInt}, Ref{BlasInt}, Ptr{$elty},
Ref{BlasInt}, Ptr{BlasInt}, Ptr{BlasInt}),
m, n, A, lda, ipiv, info)
chkargsok(info[])
would come from.

@chriselrod
Copy link
Contributor Author

chriselrod commented Oct 11, 2021

Presumably OpenBLAS is internally allocating a large amount of stack memory for something when multithreaded for large enough getrf calls.

@andreasnoack
Copy link
Member

Could it be related to the changes mentioned in https://discourse.julialang.org/t/blas-performance-testing-for-julia-1-8/69520

@ViralBShah
Copy link
Member

ViralBShah commented Oct 11, 2021

Yes, my guess is that we are allocating a lot of stack memory and 4096 max threads may be too much. 4096 was a bit too aggressive, I thought, and perhaps we need to go to 512 for now. This is the stuff in getrf that is blowing the stack.:

https://github.com/xianyi/OpenBLAS/blob/8a87e80c742a146d3900d64046d4f4c7fd58b6b0/lapack/getrf/getrf_parallel.c#L400

I wonder if their OpenMP versions are better on this front.

https://github.com/xianyi/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel_omp.c

You will start seeing these in Julia source builds that don't use BB after #42584 is merged to bring it in line with BB.

@chriselrod Can you provide a reasonably good number for max threads that we can get away with for now? Is it 1024, 512, or even smaller?

@staticfloat
Copy link
Sponsor Member

I feel we should be able to force USE_ALLOC_HEAP by having a large number of max threads, and then use dynamic amounts (e.g. instead of MAX_CPU_NUMBER, use args->nthreads) to avoid allocating such a large amount.

@ViralBShah
Copy link
Member

ViralBShah commented Oct 11, 2021

I had not seen ALLOC_HEAP before in the openblas code. Seems like they already have the dynamic versions. Is it just a matter of always passing USE_ALLOC_HEAP=1 to the openblas build, and it will do the right thing? I wonder if we may lose performance on small problems.

@JuliaLang JuliaLang deleted a comment from staticfloat Oct 11, 2021
@ViralBShah
Copy link
Member

The GETRF_MEM_ALLOC_THRESHOLD is quite low. Given that we are going with 4096 max threads, shouldn't we already be in the ALLOC_HEAP part of the code? Or is MAX_CPU_NUMBER based on runtime and not compile time?

https://github.com/xianyi/OpenBLAS/blob/8a87e80c742a146d3900d64046d4f4c7fd58b6b0/common.h#L394

@staticfloat
Copy link
Sponsor Member

MAX_CPU_NUMBER is based on compile time; what I'm suggesting is we instead use args->nthreads.

@staticfloat
Copy link
Sponsor Member

I also don't really see why we can't use variable-length arrays (added in c99) and just use args->nthreads even without ALLOC_HEAP.

@ViralBShah
Copy link
Member

Wouldn't that require significant patching and testing of openblas? Not something I want to sign up for myself!

@chriselrod
Copy link
Contributor Author

@chriselrod Can you provide a reasonably good number for max threads that we can get away with for now? Is it 1024, 512, or even smaller?

Are the largest core counts today 128?
The Ampere Altra max is 128 cores, this data sheet mentions a 2P configuration for 256 cores.

The biggest Epycs are 64 cores, I believe in up to a 2 socket configuration for 128 cores. (If it has 128 cores/256 threads, we'd prefer 128 BLAS threads).

Zen4 will be out in about a year. It is moving from 7nm to 5nm, so I'd expect it to bump up core counts again.
With Intel, next year's saphire rapids are expected to be up to 56 cores, but if they offer it in a 4 socket configuration they could of course still get >128.

Depending on how long you want this change to be good for, 256 will probably last a couple years for x86.
Could go for 512 if you also want to cover another doubling on the ARM side of things. The Neoverse N2 core still prioritizes core size, so presumably with a small node future chips will be able to fit a ton of them.

It'd also be interesting to actually run benchmarks on systems with these huge core counts to see how things scale as a function of matrix size.
Because parallelizing the reduction dimension isn't really practical and thus normally isn't limited, we're left with M and N (where M refers to the leading axis of A and C and N the lagging axis of B and C in C = A*B.
With 512 cores, even if M = 10_000, we'd want to parallelize along the N dimension, which does reduce the potential for L3 cache reuse.
Would probably take some work getting good scaling at these sizes.

Note that OpenBLAS doesn't look like it's doing well with non-monolithic memory, e.g. the monolithic Intel 10980XE seems over 4x faster than the AMD 5950X for matmul instead of the expected roughly 2x faster.
I did double check that OpenBLAS 3.17 has detection for Zen3.

So it may take some tuning to get a large percentage of peak flops.

@ViralBShah
Copy link
Member

@chriselrod Is 512 good enough to avoid stack overflows?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 12, 2021

Because VLA were removed in C11, but there's always been alloca.

@ViralBShah
Copy link
Member

I say the non-monolithic cases are handled by vendor BLAS through our LBT mechanism. We just want to pick reasonable defaults out of the box for openblas for now. IMO, any real effort we put in should go into our Julia native versions.

@ViralBShah
Copy link
Member

@KristofferC this is what the latest openblas binaries fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants