Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite hangs indefinitely when it should be failing #14374

Closed
petercolberg opened this issue Dec 12, 2015 · 11 comments
Closed

Test suite hangs indefinitely when it should be failing #14374

petercolberg opened this issue Dec 12, 2015 · 11 comments
Labels
test This change adds or pertains to unit tests

Comments

@petercolberg
Copy link
Contributor

The julia 0.4.2 test suite hangs indefinitely with this error:

https://buildd.debian.org/status/fetch.php?pkg=julia&arch=i386&ver=0.4.2-2&stamp=1449841497

env JULIA_CPU_CORES=4 JULIA_TEST_MAXRSS_MB=500 make -C test USE_SYSTEM_LIBUNWIND=1 USE_SYSTEM_PCRE=1 USE_SYSTEM_BLAS=1 USE_SYSTEM_LAPACK=1 USE_BLAS64=0 USE_SYSTEM_FFTW=1 USE_SYSTEM_GMP=1 USE_SYSTEM_ARPACK=1 USE_SYSTEM_MPFR=1 USE_SYSTEM_SUI
make[2]: Entering directory '/«PKGBUILDDIR»/test'
Warning: git information unavailable; versioning information limited
 /«PKGBUILDDIR»/usr/bin/julia --check-bounds=yes --startup-file=no ./runtests.jl all
        From worker 5:       * linalg/matmul         in  22.43 seconds, maxrss  155.79 MB
        From worker 5:       * linalg/schur          in  23.92 seconds, maxrss  176.69 MB
        From worker 5:       * linalg/special        in  10.63 seconds, maxrss  187.79 MB
        From worker 5:       * linalg/eigen          in  27.96 seconds, maxrss  212.90 MB
        From worker 5:       * linalg/bunchkaufman   in  19.42 seconds, maxrss  224.61 MB
        From worker 4:       * linalg/dense          in 123.99 seconds, maxrss  254.21 MB
        From worker 5:       * linalg/svd            in  21.08 seconds, maxrss  234.63 MB
        From worker 5:       * linalg/tridiag        in  71.47 seconds, maxrss  265.65 MB
        From worker 4:       * linalg/lapack         in 143.34 seconds, maxrss  282.14 MB
        From worker 5:       * linalg/bidiag         in  77.87 seconds, maxrss  305.19 MB
        From worker 3:       * linalg/qr             in 277.37 seconds, maxrss  280.61 MB
        From worker 5:       * linalg/pinv           in  17.37 seconds, maxrss  368.57 MB
        From worker 5:       * linalg/cholesky       in  70.23 seconds, maxrss  368.57 MB
        From worker 4:       * linalg/diagonal       in 162.91 seconds, maxrss  337.10 MB
        From worker 3:       * linalg/givens         in 176.16 seconds, maxrss  290.84 MB
        From worker 5:       * linalg/lu             in  96.66 seconds, maxrss  385.29 MB
        From worker 3:       * linalg/generic        in  22.59 seconds, maxrss  290.84 MB
        From worker 5:       * linalg/uniformscaling in  24.45 seconds, maxrss  387.89 MB
        From worker 4:       * linalg/symmetric      in  56.18 seconds, maxrss  353.91 MB
        From worker 4:       * keywordargs           in  16.81 seconds, maxrss  356.78 MB
        From worker 3:       * linalg/arnoldi        in  30.15 seconds, maxrss  314.45 MB
        From worker 3:       * printf                in  14.72 seconds, maxrss  319.54 MB
        From worker 3:       * char                  in   5.31 seconds, maxrss  321.15 MB
        From worker 5:       * core                  in 175.41 seconds, maxrss  426.35 MB
        From worker 5:       * triplequote           in   2.81 seconds, maxrss  426.35 MB
        From worker 3:       * string                in 150.91 seconds, maxrss  397.96 MB
        From worker 2:       * linalg/triangular     in 962.22 seconds, maxrss  592.21 MB
Error [connect: connection refused (ECONNREFUSED)] on 6 while connecting to peer 2. Exiting.
Worker 6 terminated.
ERROR (unhandled task failure): EOFError: read end of file
        From worker 4:       * numbers               in 514.34 seconds, maxrss  475.01 MB
        From worker 4:       * dict                  in  75.40 seconds, maxrss  498.85 MB
        From worker 3:       * dates                 in 461.62 seconds, maxrss  453.04 MB
        From worker 3:       * remote                in   1.51 seconds, maxrss  453.82 MB
        From worker 5:       * unicode               in 489.21 seconds, maxrss  444.96 MB
        From worker 5:       * staged                in  11.69 seconds, maxrss  446.47 MB
        From worker 3:       * iobuffer              in  26.04 seconds, maxrss  458.60 MB
        From worker 3:       * tuple                 in  31.44 seconds, maxrss  462.73 MB
        From worker 4:       * hashing               in 173.56 seconds, maxrss  515.33 MB
        From worker 5:       * arrayops              in 374.96 seconds, maxrss  609.52 MB
        From worker 3:       * subarray              in 2293.62 seconds, maxrss  865.47 MB

The test workers are configured to restart when exceeding a resident memory size of 500 MB.

Maybe the timeout waiting for spawned workers is too small?

@ViralBShah ViralBShah added the test This change adds or pertains to unit tests label Dec 12, 2015
@petercolberg
Copy link
Contributor Author

I could reproduce this failure in a local build chroot.

It seems that remotecall_fetch() hangs after ECONNREFUSED, instead of raising an exception.

This means the test suite hung until terminated by build daemon (signal 15).

@petercolberg petercolberg changed the title [0.4.2] test suite aborts with connection refused (ECONNREFUSED) [0.4.2] test suite hangs indefinitely after connection refused (ECONNREFUSED) Dec 12, 2015
@nalimilan
Copy link
Member

Do you think this is a new failure in 0.4.2?

@petercolberg
Copy link
Contributor Author

Yes, this issue first appeared in 0.4.2.

I wonder if it is related to the use of JULIA_TEST_MAXRSS_MB. Limiting the memory usage became necessary with 0.4 since the test suite has grown significantly, to the point where it exceeds the memory provided by Debian (and Ubuntu) build machines. At some point it would be good to think about a garbage collector for generated machine code, but that is a different issue.

@nalimilan
Copy link
Member

JULIA_TEST_MAXRSS_MB is new in 0.4.2, but if anything it should have limited the problem. Am I missing something?

@rekado
Copy link

rekado commented Dec 21, 2015

I'm trying to update the Julia package for GNU Guix (which builds in a chroot) and I see that the tests stall at this same point since version 0.4.2. I do not see ECONNREFUSED, though. The builder terminated the process after 3600 seconds of silence.

@petercolberg petercolberg changed the title [0.4.2] test suite hangs indefinitely after connection refused (ECONNREFUSED) Test suite hangs indefinitely when it should be failing Dec 21, 2015
@rekado
Copy link

rekado commented Dec 23, 2015

In my case I can work around this by disabling the "repl" and "replcompletions" tests. It's possible that disabling the "repl" test is enough.

@tkelman
Copy link
Contributor

tkelman commented Jan 3, 2016

Was this fixed by something? If so, by what?

@petercolberg
Copy link
Contributor Author

It would be good to have feedback from a Julia developer with intimate knowledge of @async.

This issue is really two issues, that the test suite hangs on failure, and the specific test that fails. For now I am only interested in fixing the former, since it is probably not intended that the test suite hangs instead of failing. This renders the test suite summary useless, since it is never executed.

@timholy timholy reopened this Jan 3, 2016
@timholy
Copy link
Sponsor Member

timholy commented Jan 3, 2016

You might try limiting the number of processes to 2. That fixed a lot of memory problems on Travis.

@petercolberg, one problem with getting developer attention to fix this is that (if I remember correctly) you are using a nonstandard build (not LLVM 3.3). This bug does not crop up on standard builds, as witnessed by the successful completion of dozens of Travis runs each day. As another symptom of trouble, your test times seem appallingly long; mine are a tenth of what you're seeing. Of course there might be hardware differences, but the machine I'm running on is 2010 era or so.

@petercolberg
Copy link
Contributor Author

@timholy Yes, the build is indeed non-standard in every way (LLVM 3.7 on i386).

I had a glimpse of hope that the hanging could be fixed though. I would like to see the test suite crash instantly in such cases, which would significantly ease debugging of test failures in general.

@KristofferC
Copy link
Sponsor Member

Please give a comment if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test This change adds or pertains to unit tests
Projects
None yet
Development

No branches or pull requests

7 participants