Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Recurrent intermittent travis test failure #4016

Closed
kmsquire opened this Issue · 7 comments

4 participants

@kmsquire
Collaborator

Happens both on clang (here and here) and gcc (here)

$ cd /tmp/julia/share/julia/test && /tmp/julia/bin/julia runtests.jl all
    From worker 4:       * numbers
    From worker 5:       * strings
    From worker 3:       * keywordargs
    From worker 6:       * unicode
    From worker 2:       * core
    From worker 7:       * collections
    From worker 9:       * remote
    From worker 8:       * hashing
    From worker 9:       * iostring
    From worker 3:       * arrayops
    From worker 9:       * linalg
    From worker 8:       * blas
    From worker 6:       * fft
    From worker 2:       * dsp
    From worker 7:       * sparse
    From worker 5:       * bitarray
    From worker 8:       * random
Worker 2 terminated.
ERROR: read: end of file
 in read at iobuffer.jl:68
 in read at stream.jl:609
 in anonymous at task.jl:797

ERROR: ProcessExitedException()
 in yield at multi.jl:1490
 in wait at task.jl:105
 in wait_full at multi.jl:545
 in remotecall_fetch at multi.jl:645
 in remotecall_fetch at multi.jl:650
 in anonymous at multi.jl:1332
at /tmp/julia/share/julia/test/runtests.jl:20
The command "/tmp/julia/bin/julia runtests.jl all" exited with 1.

It's also unclear where task.jl:797 is, since task.jl only has 164 lines, but possibly from stream.jl:797.
Other backtrace locations are iobuffer.jl:68 and stream.jl:609

I was looking to see if there might be a race condition in IOBuffer, e.g., where isopen() becomes false in wait_nb before data is written, or the readnotify condition is notified before the buffer is filled, etc., but didn't see anything obvious.

@staticfloat
Owner

I can get this on my OSX box as well, if I just run make testall in a loop, I eventually hit this.

If there's anything I can do to help debug this, let me know.

@kmsquire
Collaborator

So if I had paid more attention, the problem occurs during the DSP tests, where the worker terminates. The following is sufficient to cause a segfault on two linux systems that I tried:

julia> ;cd test
/home/kmsquire/Source/julia/test

julia> using Base.Test

julia> while true
           include("dsp.jl")
       end
Segmentation fault (core dumped)
@Keno
Owner

It might be worth valgrinding this one with MEMDEBUG enabled. I did earlier today (unreleatedly) and I saw a fair number of invalid reads/writes though I can't rule out that those weren't caused by my changes.

@kmsquire
Collaborator

@loladiro, will do. Right now, in the debugger, I can see that there's memory corruption.

julia> while true
           include("dsp.jl")
       end

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
489     p->freelist = p->freelist->next;
Missing separate debuginfos, use: debuginfo-install ncurses-libs-5.7-3.20090208.el6.x86_64
(gdb) backtrace
#0  0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
#1  0x00007ffff72540ef in allocobj (sz=368) at gc.c:981
#2  0x00007ffff7242008 in _new_array (atype=0x67fa60, ndims=1, dims=0x7fffffffb7f0) at array.c:80
#3  0x00007ffff7242c95 in jl_alloc_array_1d (atype=0x67fa60, nr=40) at array.c:297
#4  0x00007ffff0e382a9 in ?? ()
#5  0x00007fffffffb930 in ?? ()
#6  0x01007ffff7242008 in ?? ()
#7  0x0000000003c48350 in ?? ()
#8  0x0000004e00000000 in ?? ()
#9  0x000000000000000b in ?? ()
#10 0x0000000000000008 in ?? ()
#11 0x000000000000000b in ?? ()
#12 0x00007fffffffb960 in ?? ()
#13 0x0000000200000100 in ?? ()
#14 0x0000000000ad90c0 in ?? ()
#15 0x0000000000000580 in ?? ()
#16 0x0000000000000000 in ?? ()
(gdb) print p     
$1 = (pool_t *) 0x7ffff7fcdb68
(gdb) print *p
$2 = {osize = 384, pages = 0x3e5c380, freelist = 0x4009000000000000}
(gdb) print *(p.freelist)
Cannot access memory at address 0x4009000000000000
(gdb) 
@Keno
Owner

Yup, that's the one I saw earlier today as well. I'm not quite sure but I think it might be related to the size of the work array in gesdd that was changed recently.

@kmsquire kmsquire closed this issue from a commit
@kmsquire kmsquire Match Octave/Numpy's minimum RWORK size for gesdd. Fixes #4016
gesdd's RWORK size was recently changed to match the netlib header.
However, the minimum size calculated results in a segfault.  Both
Octave and Numpy use a different minimum size, and testing verified
that anything smaller than this leads to a segfault.

Numpy: https://github.com/numpy/numpy/blob/master/numpy/linalg/umath_linalg.c.src#L2922
Octave: http://hg.savannah.gnu.org/hgweb/octave/file/2f1729cae08f/liboctave/numeric/CmplxSVD.cc#l184
9f60588
@kmsquire kmsquire closed this in 9f60588
@kmsquire
Collaborator

FWIW, I reported this upstream back when this problem originally appeared, and the documentation of ZGESDD was recently fixed (although the fix won't appear in an LAPACK release until sometime this summer).

@kmsquire kmsquire referenced this issue from a commit
@JeffBezanson JeffBezanson fix #3966
the size of the RWORK array in zgesdd was wrong.
484a9f0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.