Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP


Recurrent intermittent travis test failure #4016

kmsquire opened this Issue · 7 comments

4 participants


Happens both on clang (here and here) and gcc (here)

$ cd /tmp/julia/share/julia/test && /tmp/julia/bin/julia runtests.jl all
    From worker 4:       * numbers
    From worker 5:       * strings
    From worker 3:       * keywordargs
    From worker 6:       * unicode
    From worker 2:       * core
    From worker 7:       * collections
    From worker 9:       * remote
    From worker 8:       * hashing
    From worker 9:       * iostring
    From worker 3:       * arrayops
    From worker 9:       * linalg
    From worker 8:       * blas
    From worker 6:       * fft
    From worker 2:       * dsp
    From worker 7:       * sparse
    From worker 5:       * bitarray
    From worker 8:       * random
Worker 2 terminated.
ERROR: read: end of file
 in read at iobuffer.jl:68
 in read at stream.jl:609
 in anonymous at task.jl:797

ERROR: ProcessExitedException()
 in yield at multi.jl:1490
 in wait at task.jl:105
 in wait_full at multi.jl:545
 in remotecall_fetch at multi.jl:645
 in remotecall_fetch at multi.jl:650
 in anonymous at multi.jl:1332
at /tmp/julia/share/julia/test/runtests.jl:20
The command "/tmp/julia/bin/julia runtests.jl all" exited with 1.

It's also unclear where task.jl:797 is, since task.jl only has 164 lines, but possibly from stream.jl:797.
Other backtrace locations are iobuffer.jl:68 and stream.jl:609

I was looking to see if there might be a race condition in IOBuffer, e.g., where isopen() becomes false in wait_nb before data is written, or the readnotify condition is notified before the buffer is filled, etc., but didn't see anything obvious.


I can get this on my OSX box as well, if I just run make testall in a loop, I eventually hit this.

If there's anything I can do to help debug this, let me know.


So if I had paid more attention, the problem occurs during the DSP tests, where the worker terminates. The following is sufficient to cause a segfault on two linux systems that I tried:

julia> ;cd test

julia> using Base.Test

julia> while true
Segmentation fault (core dumped)

It might be worth valgrinding this one with MEMDEBUG enabled. I did earlier today (unreleatedly) and I saw a fair number of invalid reads/writes though I can't rule out that those weren't caused by my changes.


@loladiro, will do. Right now, in the debugger, I can see that there's memory corruption.

julia> while true

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
489     p->freelist = p->freelist->next;
Missing separate debuginfos, use: debuginfo-install ncurses-libs-5.7-3.20090208.el6.x86_64
(gdb) backtrace
#0  0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
#1  0x00007ffff72540ef in allocobj (sz=368) at gc.c:981
#2  0x00007ffff7242008 in _new_array (atype=0x67fa60, ndims=1, dims=0x7fffffffb7f0) at array.c:80
#3  0x00007ffff7242c95 in jl_alloc_array_1d (atype=0x67fa60, nr=40) at array.c:297
#4  0x00007ffff0e382a9 in ?? ()
#5  0x00007fffffffb930 in ?? ()
#6  0x01007ffff7242008 in ?? ()
#7  0x0000000003c48350 in ?? ()
#8  0x0000004e00000000 in ?? ()
#9  0x000000000000000b in ?? ()
#10 0x0000000000000008 in ?? ()
#11 0x000000000000000b in ?? ()
#12 0x00007fffffffb960 in ?? ()
#13 0x0000000200000100 in ?? ()
#14 0x0000000000ad90c0 in ?? ()
#15 0x0000000000000580 in ?? ()
#16 0x0000000000000000 in ?? ()
(gdb) print p     
$1 = (pool_t *) 0x7ffff7fcdb68
(gdb) print *p
$2 = {osize = 384, pages = 0x3e5c380, freelist = 0x4009000000000000}
(gdb) print *(p.freelist)
Cannot access memory at address 0x4009000000000000

Yup, that's the one I saw earlier today as well. I'm not quite sure but I think it might be related to the size of the work array in gesdd that was changed recently.

@kmsquire kmsquire closed this issue from a commit
@kmsquire kmsquire Match Octave/Numpy's minimum RWORK size for gesdd. Fixes #4016
gesdd's RWORK size was recently changed to match the netlib header.
However, the minimum size calculated results in a segfault.  Both
Octave and Numpy use a different minimum size, and testing verified
that anything smaller than this leads to a segfault.

@kmsquire kmsquire closed this in 9f60588

FWIW, I reported this upstream back when this problem originally appeared, and the documentation of ZGESDD was recently fixed (although the fix won't appear in an LAPACK release until sometime this summer).

@kmsquire kmsquire referenced this issue from a commit
@JeffBezanson JeffBezanson fix #3966
the size of the RWORK array in zgesdd was wrong.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.