Skip to content

Inaccurate result for neoversev1 cblas_sdot kernel only when using single thread #5466

@giordano

Description

@giordano

This issue was reported on Nvidia Grace CPU. The reproducer is in Julia, so not entirely minimal, but hopefully it should give you an idea of how to reproduce it with a direct call to the library:

$ OPENBLAS_VERBOSE=2 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = randn(Float32, 10_000_000);

julia> B = copy(A);

julia> BLAS.set_num_threads(2)
Core: neoversev1

julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991972f6

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.987286f6

julia> two_threads  one_thread
false

julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6

julia> expected_result  one_thread
false

julia> expected_result  two_threads
true

The call to BLAS.dot reduces to a call to cblas_sdot, so that should be the offending OpenBLAS kernel. This is using OpenBLAS v0.3.29, haven't tried v0.3.30. For reproducing the issue, the vector needs to have 10 million elements (doesn't reproduce with 1 million or less), the two vectors need to have the same elements (but need not to be the same vector in memory, this is why I used B = copy(A)), and subtracting their average value is important, without that it doesn't reproduce, so this means there are many small numbers in the array.

This is specific to the neoversev1 type of the kernel, armv8 and neoversen1 don't seem to have this accuracy issue (and no one else reported this issue before, this code is coming from Julia's test suite, so many people have been running it for a long time):

$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=armv8 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = randn(Float32, 10_000_000);

julia> B = copy(A);

julia> BLAS.set_num_threads(2)
Core: armv8

julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991984f6

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.991984f6

julia> two_threads  one_thread
true

julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6

julia> expected_result  one_thread
true

julia> expected_result  two_threads
true

julia>
$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=neoversen1 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = randn(Float32, 10_000_000);

julia> B = copy(A);

julia> BLAS.set_num_threads(2)
Core: neoversen1

julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.994247f6

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.993646f6

julia> two_threads  one_thread
true

julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6

julia> expected_result  one_thread
true

julia> expected_result  two_threads
true

Originally reported at JuliaLang/julia#59664.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions