Recurrent intermittent travis test failure #4016

kmsquire opened this Issue Aug 11, 2013 · 7 comments


None yet

4 participants

The Julia Language member

Happens both on clang (here and here) and gcc (here)

$ cd /tmp/julia/share/julia/test && /tmp/julia/bin/julia runtests.jl all
    From worker 4:       * numbers
    From worker 5:       * strings
    From worker 3:       * keywordargs
    From worker 6:       * unicode
    From worker 2:       * core
    From worker 7:       * collections
    From worker 9:       * remote
    From worker 8:       * hashing
    From worker 9:       * iostring
    From worker 3:       * arrayops
    From worker 9:       * linalg
    From worker 8:       * blas
    From worker 6:       * fft
    From worker 2:       * dsp
    From worker 7:       * sparse
    From worker 5:       * bitarray
    From worker 8:       * random
Worker 2 terminated.
ERROR: read: end of file
 in read at iobuffer.jl:68
 in read at stream.jl:609
 in anonymous at task.jl:797

ERROR: ProcessExitedException()
 in yield at multi.jl:1490
 in wait at task.jl:105
 in wait_full at multi.jl:545
 in remotecall_fetch at multi.jl:645
 in remotecall_fetch at multi.jl:650
 in anonymous at multi.jl:1332
at /tmp/julia/share/julia/test/runtests.jl:20
The command "/tmp/julia/bin/julia runtests.jl all" exited with 1.

It's also unclear where task.jl:797 is, since task.jl only has 164 lines, but possibly from stream.jl:797.
Other backtrace locations are iobuffer.jl:68 and stream.jl:609

I was looking to see if there might be a race condition in IOBuffer, e.g., where isopen() becomes false in wait_nb before data is written, or the readnotify condition is notified before the buffer is filled, etc., but didn't see anything obvious.

The Julia Language member

I can get this on my OSX box as well, if I just run make testall in a loop, I eventually hit this.

If there's anything I can do to help debug this, let me know.

The Julia Language member

So if I had paid more attention, the problem occurs during the DSP tests, where the worker terminates. The following is sufficient to cause a segfault on two linux systems that I tried:

julia> ;cd test

julia> using Base.Test

julia> while true
Segmentation fault (core dumped)
The Julia Language member

It might be worth valgrinding this one with MEMDEBUG enabled. I did earlier today (unreleatedly) and I saw a fair number of invalid reads/writes though I can't rule out that those weren't caused by my changes.

The Julia Language member

@loladiro, will do. Right now, in the debugger, I can see that there's memory corruption.

julia> while true

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
489     p->freelist = p->freelist->next;
Missing separate debuginfos, use: debuginfo-install ncurses-libs-5.7-3.20090208.el6.x86_64
(gdb) backtrace
#0  0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
#1  0x00007ffff72540ef in allocobj (sz=368) at gc.c:981
#2  0x00007ffff7242008 in _new_array (atype=0x67fa60, ndims=1, dims=0x7fffffffb7f0) at array.c:80
#3  0x00007ffff7242c95 in jl_alloc_array_1d (atype=0x67fa60, nr=40) at array.c:297
#4  0x00007ffff0e382a9 in ?? ()
#5  0x00007fffffffb930 in ?? ()
#6  0x01007ffff7242008 in ?? ()
#7  0x0000000003c48350 in ?? ()
#8  0x0000004e00000000 in ?? ()
#9  0x000000000000000b in ?? ()
#10 0x0000000000000008 in ?? ()
#11 0x000000000000000b in ?? ()
#12 0x00007fffffffb960 in ?? ()
#13 0x0000000200000100 in ?? ()
#14 0x0000000000ad90c0 in ?? ()
#15 0x0000000000000580 in ?? ()
#16 0x0000000000000000 in ?? ()
(gdb) print p     
$1 = (pool_t *) 0x7ffff7fcdb68
(gdb) print *p
$2 = {osize = 384, pages = 0x3e5c380, freelist = 0x4009000000000000}
(gdb) print *(p.freelist)
Cannot access memory at address 0x4009000000000000
The Julia Language member
The Julia Language member

Yup, that's the one I saw earlier today as well. I'm not quite sure but I think it might be related to the size of the work array in gesdd that was changed recently.

@kmsquire kmsquire added a commit that closed this issue Aug 13, 2013
@kmsquire kmsquire Match Octave/Numpy's minimum RWORK size for gesdd. Fixes #4016
gesdd's RWORK size was recently changed to match the netlib header.
However, the minimum size calculated results in a segfault.  Both
Octave and Numpy use a different minimum size, and testing verified
that anything smaller than this leads to a segfault.

@kmsquire kmsquire closed this in 9f60588 Aug 13, 2013
The Julia Language member

FWIW, I reported this upstream back when this problem originally appeared, and the documentation of ZGESDD was recently fixed (although the fix won't appear in an LAPACK release until sometime this summer).

@kmsquire kmsquire referenced this issue Mar 19, 2015
@JeffBezanson JeffBezanson fix #3966
the size of the RWORK array in zgesdd was wrong.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment