Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized dgetrf doesn't detect matrix as singular #4505

Open
jmtilli opened this issue Feb 18, 2024 · 9 comments
Open

Parallelized dgetrf doesn't detect matrix as singular #4505

jmtilli opened this issue Feb 18, 2024 · 9 comments

Comments

@jmtilli
Copy link

jmtilli commented Feb 18, 2024

I have the following matrix:

10,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,10,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,10,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,0,
0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,
0,0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,
0,0,0,-5.6353902507748659,0,0,15.635391250774864,0,0,-10,-9.9999999999999995e-07,0,0,0,
0,0,0,0,-5.6353902507748659,0,0,25.635391250774866,0,-10,-10,-9.9999999999999995e-07,0,0,
0,0,0,0,0,-5.6353902507748659,0,0,25.635390250774865,-10,0,-10,0,0,
0,0,0,0,0,0,-10,-10,-10,30,0,0,0,0,
0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,0,20.000001999999999,-10,-9.9999999999999995e-07,0,
0,0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,-10,30.001000860019595,0,-10.000999860019597,
0,0,0,0,0,0,0,0,0,0,-9.9999999999999995e-07,0,5.635391250774866,-5.6353902507748659,
0,0,0,0,0,0,0,0,0,0,0,-10.000999860019597,-5.6353902507748659,15.636390110794462,

...which should be singular. At least getting its inverse in GNU Octave shows that it's singular to machine precision.

If I try to do LU decomposition for it using OpenBLAS:

#include <stdio.h>

#define lapack_int int
#define LAPACK_dgetrf dgetrf_
void LAPACK_dgetrf(const lapack_int*, const lapack_int*, double*, const lapack_int*, lapack_int*, lapack_int*);

double M[] = {
10,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,10,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,10,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,0,
0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,
0,0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,
0,0,0,-5.6353902507748659,0,0,15.635391250774864,0,0,-10,-9.9999999999999995e-07,0,0,0,
0,0,0,0,-5.6353902507748659,0,0,25.635391250774866,0,-10,-10,-9.9999999999999995e-07,0,0,
0,0,0,0,0,-5.6353902507748659,0,0,25.635390250774865,-10,0,-10,0,0,
0,0,0,0,0,0,-10,-10,-10,30,0,0,0,0,
0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,0,20.000001999999999,-10,-9.9999999999999995e-07,0,
0,0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,-10,30.001000860019595,0,-10.000999860019597,
0,0,0,0,0,0,0,0,0,0,-9.9999999999999995e-07,0,5.635391250774866,-5.6353902507748659,
0,0,0,0,0,0,0,0,0,0,0,-10.000999860019597,-5.6353902507748659,15.636390110794462,
};

int main(int argc, char **argv)
{
        const int n = 14;
        int ipiv[14];
        int info = 0;
        LAPACK_dgetrf(&n, &n, M, &n, ipiv, &info);
        printf("Info %d\n", info);
        return 0;
}

...it prints info 0. However, if I set the environment variable OPENBLAS_NUM_THREADS to 1, it prints 14.

So it incorrectly detects the matrix as decomposable if I run it with many threads, and only detects it correctly as singular if I run it with one thread. The separator seems to be 3 vs 4 threads: with 3 threads it prints info 14 but with 4 threads it prints info 0.

I have OpenBLAS 0.3.8. I understand that in newer versions of OpenBLAS, small matrices are always handled with only one thread so it improves performance, so testing the issue with a new OpenBLAS isn't even possible. However, since the calculation happens differently depending on the number of threads, I think there's an issue somewhere that should be tracked down. I would expect OpenBLAS to give identical results no matter how many threads are in use.

@martin-frbg
Copy link
Collaborator

I cannot immediately reproduce this with 0.3.26 (after disabling the code that enforces single-threading of course), what is your hardware please ?

@jmtilli
Copy link
Author

jmtilli commented Feb 21, 2024

The hardware is a VirtualBox virtual machine running Ubuntu 20.04 LTS on top of Windows 11 host. CPU is Intel Core i7-10510U and hardware virtualization is being used.

@martin-frbg
Copy link
Collaborator

Thank you - that should be Haswell as far as OpenBLAS is concerned (unless your vbox setup hides the avx2 capability),
I cannot reproduce the problem there although I cannot immediately point to a particular code change that would have fixed it

@martin-frbg
Copy link
Collaborator

Appears to be a loss of precision somewhere in the Sandybridge kernels...

@jmtilli
Copy link
Author

jmtilli commented Feb 21, 2024

Here are the cpuinfo flags:

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase avx2 invpcid rdseed clflushopt md_clear flush_l1d arch_capabilities

...so it doesn't hide the avx2.

@martin-frbg
Copy link
Collaborator

The old version may have used a different GEMM kernel, and GETRF behaviour depends on its unroll factor - which seems to explain the discrepancy I've found with Sandybridge. In one case, the workload gets split up again where it appears the problem comes in.

@martin-frbg
Copy link
Collaborator

Interestingly, the plain unoptimized Fortran code of the "netlib" Reference-LAPACK implementation does not recognize this particular matrix as singular either. And so far I have not come across any other (trivially) singular matrix that similarly slips through either codes. This is looking more and more like a weird corner case of machine precision (and/or compiler code generation) to me.

@jmtilli
Copy link
Author

jmtilli commented Mar 1, 2024

Maybe this is more like an issue due to limited floating point precision rather than a real problem. I understand that in parallel versions, calculations can happen in different order and the usual mathematical rules of associativity for example do not necessarily apply. I encountered the matrix on a circuit nodal analysis problem that was incorrectly presented, and it should indeed be singular, since the problem presentation was incorrect and important elements were missing from the matrix.

@martin-frbg
Copy link
Collaborator

Probably yes - and it is interesting that even the simple single-threaded Fortran code of the Reference LAPACK does not treat this matrix as singular on my machine, so compiler version or the math libraries on your system may play a role here as well. Not sure if I want to calculate the determinant by hand, but I don't think I see any obviously interdependent rows or columns that would make it unequivocally singular. (Octave probably uses either OpenBLAS or the Reference BLAS/LAPACK so is no independent proof)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants