Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic behavior calculating opnorm(A) #27960

Closed
KlausC opened this issue Jul 6, 2018 · 17 comments
Closed

Non-deterministic behavior calculating opnorm(A) #27960

KlausC opened this issue Jul 6, 2018 · 17 comments
Labels
bug Indicates an unexpected problem or unintended behavior linear algebra Linear algebra regression Regression in behavior compared to a previous version upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@KlausC
Copy link
Contributor

KlausC commented Jul 6, 2018

The following experiment shows, that the maximal singular value calculated for a dense matrix is not calculated reliably in some cases. There is a relative standard deviation of 3e-4 in this example. Same as for svd(AE).S[1] is observed for svdvals(AE)[1] and opnorm(AE).
The behaviour seems to be correct for smaller matrices (e.g. (500,500) worked as expected), while it was reproducible for larger matrices.

As far as I saw, the calculations are done in the external openblas dependency.

julia> versioninfo()
Julia Version 0.7.0-beta.154
Commit d5531c3995 (2018-07-05 13:54 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)

julia> using Statistics

julia> A = sprand(500, 999, 0.1);

julia> AE = Matrix(A);

julia> x = [svd(AE).S[1] for i in 1:20];

julia> std(x) / mean(x)
0.00029392541804423026

julia> x
20-element Array{Float64,1}:
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.622748322695166
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.6689164955749  
 35.65800675664228 
 35.66898738453416 
 35.6689164955749  
 35.6689164955749  

julia> 
@antoine-levitt
Copy link
Contributor

antoine-levitt commented Jul 6, 2018

Can't reproduce on

Julia Version 0.7.0-beta.0
Commit f41b1ecaec (2018-06-24 01:32 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

I always get the same results. Possibly a different RNG seed, or skylake vs ivybridge?

The results should be deterministic: this is a well-conditioned operation. Is the bug present on A' instead of A? Are the other singular values correct? Maybe fill an issue with OpenBLAS?

@StefanKarpinski
Copy link
Sponsor Member

Could threading be the cause of the non-determinism?

@KlausC
Copy link
Contributor Author

KlausC commented Jul 7, 2018

Answering the questions:

  1. there is no difference, if A is replaced by A'
  2. not only the largest singular value, but all are affected
  3. Actually I don't know how to disable threading, so I can't answer @StefanKarpinski 's question. Starting julia -p 1 did not make a difference.

Additional observations:

  1. the bug does not appear on v0.6.3
  2. the bug is not related to uninitialized work area for the Fortran function dgesdd
  3. the bug also can be reproduced with smaller matrix - see code below.
  4. with this example, the result is sometimes correct. If there is a background job running in parallel,
    there are much less correct results (1.9% vs. 85.0% fail in 1000 runs). That would support Stefan's idea.
julia> VERSION
v"0.7.0-beta.154"
# (This is a rank-1 matrix; all except the first SV should be zero)
julia> function fool(n,m)
row = ones(n); col = [0.99999^(i-11) for i = 1:m]; A = col * row';
U, S, V = svd(A'); S[1:4]
end
fool (generic function with 1 method)

julia> fool(129, 290)
gesdd!: zeroed workspace
4-element Array{Float64,1}:
 193.14741234711033      
   1.9484878017769987    
   0.8204758008731955    
   1.9288304472938805e-13

julia> fool(129, 290)
gesdd!: zeroed workspace
4-element Array{Float64,1}:
 193.1475921487633       
   2.1055592380484165    
   0.6977071657771906    
   2.0038999448505881e-13

Again, the old version succeeds:

julia> VERSION
v"0.6.3"

julia> function fool(n,m); row = ones(n); col = [0.99999^(i-11) for i = 1:m]; A = col * row'; U, S, V = svd(A'); (norm(row)*norm(col), S[1:4]); end
fool (generic function with 1 method)

julia> fool(129, 290)
(193.15681226288552, [193.157, 3.15201e-13, 1.93002e-14, 1.93002e-14])

julia> function fool(n,m); row = ones(n); col = [0.99999^(i-11) for i = 1:m]; A = col * row'; U, S, V = svd(A'); S[1:4]; end
fool (generic function with 1 method)

julia> fool(129, 290)
4-element Array{Float64,1}:
 193.157      
   3.15201e-13
   1.93002e-14
   1.93002e-14

@antoine-levitt
Copy link
Contributor

Try BLAS.set_num_threads

@KlausC
Copy link
Contributor Author

KlausC commented Jul 7, 2018

Bingo! Thanks Antoine.
LinearAlgebra.BLAS.set_num_thread(1): 100% correct
LinearAlgebra.BLAS.set_num_thread(2): 98.3% correct
LinearAlgebra.BLAS.set_num_thread(4): 6.4% correct
That looks like an upstream bug/incompatibility.
I got the following configuration:

julia> LinearAlgebra.BLAS.openblas_get_config()
"USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge MAX_THREADS=16"

By the way, on v0.6.3 I have:

julia> Base.BLAS.openblas_get_config()
"DYNAMIC_ARCH NO_AFFINITY Sandybridge MAX_THREADS=128"

How should I proceed? (setting num_threads=1 in my startup-config would be a cheap work-around).
Is it worth while to determine the OpenBlas versions in use and file a bug to OpenBlas/LAPACK?

@ViralBShah
Copy link
Member

0.6.3 shipped with a very old version of openblas. 0.6.4 will at least ship with 0.3.0, and master is on 0.3.1. So, if you can still reproduce on 0.6.4 (should ship next week) or master, I would report upstream. Otherwise close this.

@jebej
Copy link
Contributor

jebej commented Jul 7, 2018

Probably the same issue: OpenMathLib/OpenBLAS#1666

@antoine-levitt
Copy link
Contributor

@jebej Unlikely, the offending commit in that issue was 12 days ago. @ViralBShah the bug is present on master, not on 0.6, so it's probably worth it to open an issue in OpenBLAS.

@KlausC The good thing is you've managed to reduce it to a matrix failing, so you can just write it to a file and post it on a bug report in OpenBLAS.

@jebej
Copy link
Contributor

jebej commented Jul 7, 2018

@antoine-levitt IIUC that commit made it into OpenBLAS 0.3.1, which made it in Julia on July 1st.

@andreasnoack andreasnoack added linear algebra Linear algebra upstream The issue is with an upstream dependency, e.g. LLVM regression Regression in behavior compared to a previous version bug Indicates an unexpected problem or unintended behavior labels Jul 9, 2018
@andreasnoack
Copy link
Member

I think we should revert the 0.3.1 upgrade. See #28002. This is a quite serious bug that could affect anything that uses threaded BLAS which is a lot.

@KristofferC
Copy link
Sponsor Member

Or add the OpenMathLib/OpenBLAS@5f2a3c0 patch perhaps?

@andreasnoack
Copy link
Member

Yes. We could also do that. Are we aware of any fixes we need from 0.3.1? It might be a bit simpler not to introduce a patch and just wait for 0.3.2.

@ViralBShah
Copy link
Member

There are no known 0.3.1 fixes that we need, to the best of my knowledge.

@ViralBShah
Copy link
Member

ViralBShah commented Jul 9, 2018

The upstream patch is not merged yet. Given that we want 0.7-beta2 out soon, reverting seems better, and then moving to 0.3.2 before the final 0.7 release seems lesser work.

@KristofferC
Copy link
Sponsor Member

The upstream patch is not merged yet.

Hmm, OpenMathLib/OpenBLAS#1667 seems merged.

There is a comment saying (OpenMathLib/OpenBLAS#1673 (comment))

Would be a pity if we had to revert all that thread initialization speedup we gained with 0.3.1.

but I don't know how big that speedup is.

@KlausC
Copy link
Contributor Author

KlausC commented Jul 10, 2018

After openblas 0.3.0 has been re-established, the bug cannot be reproduced on my configuration.

The used version is found in julia/deps/openblas.version.
The actual version of libopenblas64_.so linked to julia is julia/usr/lib/libopenblas64_.so. This is a copy of what is found in one of julia/deps/scratch/openblas-*. Unfortunately a simple make did not reflect the changes in julia/deps/openblas.version.
In order to get the correct version of the library installed there, it was necessary to remove julia/usr, then run make.

@KristofferC
Copy link
Sponsor Member

Fixed by #28002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior linear algebra Linear algebra regression Regression in behavior compared to a previous version upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

7 participants