Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[linear algebra] Patch Openblas v0.3.10 to fix EXCEPTION_ACCESS_VIOLATION #42397

Merged

Conversation

inkydragon
Copy link
Member

@inkydragon inkydragon commented Sep 27, 2021

Issue Description

OpenBLAS, or more precisely CBLAS, will crash when you multiply two large matrices together.
After testing, this issue was found to occur in OpenBLAS v0.3.10, and was fixed in v0.3.12 and later.

Conditions for problem reproduction

  • Windows system. Cannot reproduce on macOS, Linux, or even WSL.
  • Intel CPU. Cannot reproduce on AMD CPU.
  • Julia 1.6.x. 1.7.0 use OpenBLAS v0.3.13, the problem is fixed

Test code

  1. Use LinearAlgebra.BLAS

Note: You can replace Julia-1.6.x/bin/libopenblas64_.dll to test other dlls.

arrSzie = 10000  # No error
arrSzie = 10001  # Has error
arrType = ComplexF64

using LinearAlgebra.BLAS: dotc, dotu

x = ones(arrType, arrSzie);
y = ones(arrType, arrSzie);

dotc(x, y) # crash: EXCEPTION_ACCESS_VIOLATION
dotu(x, y) # crash

dotc(x, y) == dotu(x, y) == 10001.0 + 0.0im
  1. call the dll directly from ccall
arrSzie = 10001
arrType = ComplexF64
# OpenBLAS inside julia uses the `:libopenblas64_` symbol, 
#   so please **RENAME** the DLL you want to test to `libopenblas.dll`.
blasPath = raw"V:\libopenblas.dll"
blasSym = :libopenblas

using LinearAlgebra.BLAS: BlasInt
using Libdl: dlopen

dlopen(blasPath)
# Print BLAS version
strip(unsafe_string(ccall((:openblas_get_config64_, blasSym), Ptr{UInt8}, () )))

result = Ref{ComplexF64}();
x = ones(arrType, arrSzie);
y = ones(arrType, arrSzie);
ccall((:cblas_zdotc_sub64_, blasSym), Cvoid,
    (BlasInt, Ptr{ComplexF64}, BlasInt, Ptr{ComplexF64}, BlasInt, Ptr{ComplexF64}),
    length(x), x, stride(x, 1), y, stride(x, 1), result)
# crash
result[]

result[] == 10001.0 + 0.0im

You can get the compiled product with the patch applied from the AzurePipeline in JuliaPackaging/Yggdrasil#3655.

e.g.: OpenBLAS.v0.3.10.x86_64-w64-mingw32-libgfortran5

Patch code

With some git bisect (#40963 (comment)), I found that commit b205323 seems to fix this issue.

But, I'm not sure if this fully fixes the issue, more testing may be needed.


fix #40963.

Note: Buildbot uses the compiled product of BinaryBuilder.jl, so the bug is not fixed in the CI output artifacts.
You can download the compiled product of BinaryBuilder.jl with patch applied in JuliaPackaging/Yggdrasil#3655, replace the libopenblas64_.dll that comes with julia, and test it.

We may need to merge JuliaPackaging/Yggdrasil#3655 to get the repaired OpenBLAS_jll.jl first, and then update the OpenBLAS version in julia's dependency.

@inkydragon inkydragon changed the title Patch Openblas v0.3.10 to fix EXCEPTION_ACCESS_VIOLATION [linear algebra] Patch Openblas v0.3.10 to fix EXCEPTION_ACCESS_VIOLATION Sep 29, 2021
@ViralBShah
Copy link
Member

This looks good. @KristofferC Can we merge?

@DilumAluthge
Copy link
Member

We should probably merge JuliaPackaging/Yggdrasil#3655 first, right? cc: @staticfloat

@ViralBShah
Copy link
Member

ViralBShah commented Oct 1, 2021

It doesn't matter. This is just fixing the Julia source build. Once the Yggdrasil PR is merged, we can bump deps/Versions.make and I am sure 2 other places that I always forget at which point the fix will start appearing in the binaries.

@vchuravy
Copy link
Member

vchuravy commented Oct 1, 2021

Once JuliaRegistries/General#45875 lands we should update versions and refresh checksums

@ViralBShah ViralBShah added the linear algebra Linear algebra label Oct 1, 2021
@inkydragon
Copy link
Member Author

After updating the version number manually, I updated the checksum with make -f contrib/refresh_checksums.mk openblas .

Is there anything else I need to do?

OpenBLAS.v0.3.10+8.x86_64-w64-mingw32-libgfortran4.tar.gz/md5/d616bef77c9d1b0f199ebcb8a5ccf908
OpenBLAS.v0.3.10+8.x86_64-w64-mingw32-libgfortran4.tar.gz/sha512/b12a9310be4a934b86b073ed03cea18e30f01039399162409b1dff1e097d75430cbaa1d7d5cf0972d32907137c54d455e266aeb854ad86f7bbb309b01c0d9dd1
OpenBLAS.v0.3.10+8.x86_64-w64-mingw32-libgfortran5.tar.gz/md5/733f0803a1c835a94beeec70ef713512
OpenBLAS.v0.3.10+8.x86_64-w64-mingw32-libgfortran5.tar.gz/sha512/97fcbfd8af8ad281203529e2d31bf5a4892831ebf2805d09b373ce3bd5fb108f33436ea42b08287d283b7b45e4be0f5433409a9bc5d8252941f30cd03851a839
openblas-63b03efc2af332c88b86d4fd8079d00f4b439adf.tar.gz/md5/3d692acc6927454f620a4c493bdb159d
openblas-63b03efc2af332c88b86d4fd8079d00f4b439adf.tar.gz/sha512/cf89f6db1b6366833d29a1dc718ea0b8f61d162f70695c33fc94afbaba232605630a7a7cc3d3b9bed7493ec85402b65180ca99c3101de7141d6f2919318f55c1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two lines of checksums are not updated, is this the expected behavior of the contrib/refresh_checksums.mk script?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes those are the sources

@inkydragon
Copy link
Member Author

Downloaded and tested buildbot/package_win64 , confirmed that the problem has been fixed

julia> versioninfo()
Julia Version 1.6.3-pre.85
Commit 10de22d171 (2021-10-01 03:32 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

julia> arrSzie = 10000;
julia> arrSzie = 10001;  # Has error
julia> arrType = ComplexF64;

julia> using LinearAlgebra.BLAS: dotc, dotu
julia> x = ones(arrType, arrSzie);
julia> y = ones(arrType, arrSzie);

julia> dotc(x, y) # crash: EXCEPTION_ACCESS_VIOLATION
10001.0 + 0.0im

julia> dotu(x, y) # crash
10001.0 + 0.0im

julia> dotc(x, y) == dotu(x, y) == 10001.0 + 0.0im
true

@inkydragon
Copy link
Member Author

Tests of SparseArrays/higherorderfns, SparseArrays/sparsevector, SparseArrays/sparse, Dates/arithmetic, and cmdlineargs failed.
The logs show that there is a communication problem, which does not seem to be related to this pr.

Failed test log
SparseArrays/higherorderfns        (9) |         failed at 2021-10-01T01:16:10.649
ProcessExitedException(9)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:710
  [2] wait()
    @ Base ./task.jl:770
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:106
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:389
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:383
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:599
  [7] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:390
  [8] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421
  [9] macro expansion
    @ /usr/home/julia/buildbot/w2_tester/tester_freebsd64/build/share/julia/test/runtests.jl:217 [inlined]
 [10] (::var"#29#39"{Vector{Task}, var"#print_testworker_errored#35"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#33"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
    @ Main ./task.jl:411
SparseArrays/sparsevector          (8) |         failed at 2021-10-01T01:16:10.935
ProcessExitedException(8)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:710
  [2] wait()
    @ Base ./task.jl:770
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:106
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:389
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:383
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:599
  [7] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:390
  [8] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421
  [9] macro expansion
    @ /usr/home/julia/buildbot/w2_tester/tester_freebsd64/build/share/julia/test/runtests.jl:217 [inlined]
 [10] (::var"#29#39"{Vector{Task}, var"#print_testworker_errored#35"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#33"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
    @ Main ./task.jl:411
Worker 9 terminated.
Worker 5 terminated.
SparseArrays/sparse                (5) |         failed at 2021-10-01T01:16:11.004
ProcessExitedException(5)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:710
  [2] wait()
    @ Base ./task.jl:770
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:106
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:389
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:383
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:599
  [7] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:390
  [8] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421
  [9] macro expansion
    @ /usr/home/julia/buildbot/w2_tester/tester_freebsd64/build/share/julia/test/runtests.jl:217 [inlined]
 [10] (::var"#29#39"{Vector{Task}, var"#print_testworker_errored#35"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#33"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
    @ Main ./task.jl:411
Worker 2 terminated.
Dates/arithmetic                   (2) |         failed at 2021-10-01T01:16:11.067
ProcessExitedException(2)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:710
  [2] wait()
    @ Base ./task.jl:770
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:106
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:389
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:383
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:599
  [7] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:390
  [8] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421
  [9] macro expansion
    @ /usr/home/julia/buildbot/w2_tester/tester_freebsd64/build/share/julia/test/runtests.jl:217 [inlined]
 [10] (::var"#29#39"{Vector{Task}, var"#print_testworker_errored#35"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#33"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
    @ Main ./task.jl:411
Dates/conversions                 (10) |        started at 2021-10-01T01:16:24.837
Dates/conversions                 (10) |     4.47 |   0.04 |  0.9 |      94.32 |   324.80
LibGit2/online                    (10) |        started at 2021-10-01T01:16:30.385
download                          (11) |        started at 2021-10-01T01:16:33.966
cmdlineargs                        (3) |         failed at 2021-10-01T01:16:37.147
Test Failed at /usr/home/julia/buildbot/w2_tester/tester_freebsd64/build/share/julia/test/cmdlineargs.jl:651
  Expression: occursin(r"\.jl:(\d+)", bt)
   Evaluated: occursin(r"\.jl:(\d+)", "")

@vchuravy vchuravy merged commit ec4c102 into JuliaLang:backports-release-1.6 Oct 2, 2021
@vchuravy
Copy link
Member

vchuravy commented Oct 2, 2021

Thanks!

@vchuravy
Copy link
Member

vchuravy commented Oct 2, 2021

(Missed that this was against 1.6)

KristofferC pushed a commit that referenced this pull request Nov 11, 2021
…TION (#42397)

* [OpenBLAS] cherry pick one patch to fix `zdot` crash

xref issue: #40963
cherry pick: OpenMathLib/OpenBLAS@b205323

* [test/LinearAlgebra] test Issue #40963

* [OpenBLAS] Update version

* [OpenBLAS] Update checksums
@inkydragon inkydragon deleted the openblas-v0.3.10-patch branch January 7, 2022 02:55
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
…TION (#42397)

* [OpenBLAS] cherry pick one patch to fix `zdot` crash

xref issue: #40963
cherry pick: OpenMathLib/OpenBLAS@b205323

* [test/LinearAlgebra] test Issue #40963

* [OpenBLAS] Update version

* [OpenBLAS] Update checksums
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants