Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error in dgemm_oncopy_HASWELL instead of OOM #10487

Closed
rened opened this issue Mar 12, 2015 · 17 comments
Closed

Bus error in dgemm_oncopy_HASWELL instead of OOM #10487

rened opened this issue Mar 12, 2015 · 17 comments
Labels
upstream The issue is with an upstream dependency, e.g. LLVM
Milestone

Comments

@rened
Copy link
Member

rened commented Mar 12, 2015

Originally reported here: https://groups.google.com/forum/#!topic/julia-users/rfuMGrf-dK4

I can reproduce this with the following code on both 0.3.6 and a 10 days old master with the following code:

  using TSne, MNIST
  data, labels = traindata()
  Y = tsne(data, 2, 50, 1000, 20.0)

my versioninfo():

Julia Version 0.4.0-dev+3639
Commit 7f7e9ae* (2015-03-01 22:49 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Error message:

signal (10): Bus error: 10
dgemm_oncopy_HASWELL at /Users/rene/local/devjulia/usr/lib/libopenblas.dylib (unknown line)
inner_thread at /Users/rene/local/devjulia/usr/lib/libopenblas.dylib (unknown line)
blas_thread_server at /Users/rene/local/devjulia/usr/lib/libopenblas.dylib (unknown line)
_pthread_body at /usr/lib/system/libsystem_pthread.dylib (unknown line)
_pthread_struct_init at /usr/lib/system/libsystem_pthread.dylib (unknown line)
/
@ViralBShah
Copy link
Member

Cc: @xianyi

@ViralBShah ViralBShah added the upstream The issue is with an upstream dependency, e.g. LLVM label Mar 12, 2015
@ViralBShah ViralBShah added this to the 0.4 milestone Mar 12, 2015
@ViralBShah
Copy link
Member

Is it possible to come up with a standalone minimal example for the failure?

@rened
Copy link
Member Author

rened commented Mar 12, 2015

After some digging:

X = rand(100,6000); X'*X  # => works
X = rand(100,60000); X'*X  # => fails

Running it in gdb consumes my entire RAM, so I don't have a stack trace, unfortunately.

@rened
Copy link
Member Author

rened commented Mar 12, 2015

... ok, this is just due to being out of memory apparently... on this system

julia> versioninfo()
Julia Version 0.3.6
Commit 0c24dca (2015-02-17 22:12 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

I get a clean MemoryError indicating out of memory, and can continue working, instead of the crash with "Bus error".
Is there anything we can do to improve the behavior on Haswell?

@andreasnoack
Copy link
Member

I can reproduce this on my Mac. @ViralBShah should we detect the memory issue before calling OpenBLAS or is it OpenBLAS that doesn't handle this correctly?

@xianyi
Copy link

xianyi commented Mar 13, 2015

@andreasnoack , I think OpenBLAS handle the failed memory allocation. OpenBLAS just exit the program.

@ViralBShah
Copy link
Member

@xianyi Should we be providing our own xerbla to deal with such cases to avoid the exit? In general, we never want openblas to exit, but to return an error code or an exception back to Julia.

@andreasnoack If the above is not possible, then we may want to build such safeguards on the Julia side before calling openblas.

@xianyi
Copy link

xianyi commented Mar 13, 2015

@ViralBShah , how could I use xerbla to deal with this case?

@ViralBShah
Copy link
Member

xerbla is not really relevant for this: http://www.netlib.org/lapack/explore-3.1.1-html/xerbla.f.html

Perhaps have a way to just return an error code when allocation fails, instead of calling abort? Maybe a build parameter?

@vtjnash
Copy link
Member

vtjnash commented Apr 16, 2015

fwiw, lapack-netlib usuallyalways calls xerbla in the case of a malloc failure

alternatively, we could potentially trap calls to exit and convert them to calls to jl_throw

@vtjnash
Copy link
Member

vtjnash commented Apr 16, 2015

presumably, this would be as simple as commenting out the calls to exit, as shown below. but I don't see where in dgemm_oncopy_HASWELL there would have been memory allocated (on the worker thread) @xianyi

jameson@julia:~/julia/deps/openblas$ git diff
diff --git a/interface/imatcopy.c b/interface/imatcopy.c
index 89f0ec8..62e8d0f 100644
--- a/interface/imatcopy.c
+++ b/interface/imatcopy.c
@@ -127,7 +127,7 @@ void CNAME( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows,
        if ( b == NULL )
        {
                printf("Memory alloc failed\n");
-               exit(1);
+               return; //exit(1);
        }

        if ( order == BlasColMajor )
diff --git a/interface/zimatcopy.c b/interface/zimatcopy.c
index 3f273cf..19fce0a 100644
--- a/interface/zimatcopy.c
+++ b/interface/zimatcopy.c
@@ -133,7 +133,7 @@ void CNAME( enum CBLAS_ORDER CORDER, enum CBLAS_TRANSPOSE CTRANS, blasint crows,
         if ( b == NULL )
         {
                 printf("Memory alloc failed\n");
-                exit(1);
+                return; //exit(1);
         }

@xianyi
Copy link

xianyi commented Apr 17, 2015

The source file of dgemm_oncopy_HASWELL is kernel/generic/gemm_ncopy_4.c. There isn't any memory allocation. The buffer is allocated at the work thread.

@vtjnash vtjnash modified the milestones: 0.4.x, 0.4.0 Apr 19, 2015
@vtjnash vtjnash changed the title Bus error in dgemm_oncopy_HASWELL Bus error in dgemm_oncopy_HASWELL instead of OOM Apr 19, 2015
@rened
Copy link
Member Author

rened commented Aug 10, 2015

Bump - still present on current master. I would propose to add a check for this on the Julia side until upstream handles this cleanly? Rationale: 0.4 will attract a lot of new users, better not segfault for trivial stuff.

@xianyi
Copy link

xianyi commented Aug 10, 2015

@ViralBShah , I want to reproduce this bug on my Mac. However, I cannot find out how to build OpenBLAS develop branch under Julia.

I edited julia/deps/openblas.version v0.2.14->develop. However, it built break.

@yuyichao
Copy link
Contributor

@xianyi Have you tried USE_SYSTEM_BLAS=1?

@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2015

I edited julia/deps/openblas.version v0.2.14->develop. However, it built break.

What happened? That should work, though I think you'll need to set a specific SHA1, that's what we primarily use, the branch is secondary.

@tkelman
Copy link
Contributor

tkelman commented Nov 1, 2015

I think this is fixed (gives an OutOfMemoryError()) on 0.4 and master. Leave a comment or open a new issue if not.

@tkelman tkelman closed this as completed Nov 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

7 participants