Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random CI failures rather intense right now #9763

Closed
IainNZ opened this issue Jan 14, 2015 · 56 comments
Closed

Random CI failures rather intense right now #9763

IainNZ opened this issue Jan 14, 2015 · 56 comments

Comments

@IainNZ
Copy link
Member

IainNZ commented Jan 14, 2015

(spawned from #9679 to get more eyes)

In #9679 I merged in some changes to the tests (splitting test/collections.jl into test/sets.jl andtest/dicts.jl, and moved around some lines in thebase/sets.jl` file [mainly to group methods together]). TravisCI Linux failed, but it worked locally, on Appveyor and TravisCI OSX, so I merged anyway.

Since then, perhaps by coincidence, we've been getting seemingly random CI failures - many of them seem to be occurring on the workers running the sets tests but not all of them.

No one as far as I know has been able to reproduce locally, and I'm way out of my depth :D

@tkelman
Copy link
Contributor

tkelman commented Jan 14, 2015

The sets failure is an additional new failure mode, on top of #9544 (best guess) which has been somewhat often causing assertion failures on OSX Travis and codegen segfaults on AppVeyor for a few weeks, and longer-term pre-existing ones like #9176 and #9501 and #7942-esque timouts.

All of this is adding up to CI being actively detrimental instead of helpful right now, when maybe 50% of commits or PR's fail for completely unrelated reasons that we can't reproduce locally.

If we have a brave/confident volunteer, we can contact Travis and ask for ssh access to a worker VM for 24 hours to do as much debugging and information gathering as we can.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 14, 2015

i'm pretty sure we need to disable that raise(2) test for the time being (and eventually fix the signal handlers to use sigwait instead of sigaction, or something like that)

@eschnett
Copy link
Contributor

I believe the set failures may be caused by IntSet. The resolution of #8570 seems fishy: This supposedly adds the capability to add int(2^32) to an IntSet. This has two problems:

  • This IntSet then has 2^32+1 entries, requiring a lot of storage. This isn't at all the case for which IntSet has been designed. This may make Travis sufficiently unhappy to abort randomly.
  • On a 32-bit platform, IntSets can have at most 2^32-1 entries, likely leading to memory corruption. (But I think we're saved by the wrap-around of 2^32==0 on 32-bit platforms.)

I suggest:

  • Use Int64 to count elements of IntSets; after all, its elements are explicitly Int64 anyway. Also, add explicit overflow checks.
  • Remove this test case, as it uses too much memory.

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 14, 2015

This IntSet then has 2^32+1 entries, requiring a lot of storage.

Interesting. That ends up with a 768MB array for storage. Big, but not ridiculous. Are there other big allocations in test? A quick grep for large exponentiations shows that linalg1 creates a 300MB array, but there may be others I missed. How much RAM do typical travis machines have? It could explain why nobody is reproducing it at home if travis machines have much less RAM than our computers typically do.

I don't have a 32bit system handy, but I'm pretty sure IntSets can handle elements up to Int64(2)^36 or so since there are 2^5 bits per element of the array.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 14, 2015

didn't mean to push this directly to master, but oh well, lots of changes here:
69b84e9

including one pretty serious bug:
69b84e9#diff-669d4cc5c9c8f4573c5f8d57f5dcab20R1382
it's a pity libssp didn't catch that one (on appveyor that is, since ubuntu seems to have disabled it – would be awesome if travis could re-enable it though for the system gcc there)

@tkelman
Copy link
Contributor

tkelman commented Jan 14, 2015

if travis machines have much less RAM than our computers typically do

This is very likely. A while back I had some instrumentation code that ran on AppVeyor and was showing our tests taking up all the memory (and timing out) there, but that was before they rolled out the higher-performance Pro environment that we've been using. If anyone wants to make an experimental branch/test PR and do the same basic thing on Travis (print remaining memory after each test file), that could be interesting to look at.

@timholy
Copy link
Sponsor Member

timholy commented Jan 14, 2015

IIRC there's a big allocation somewhere in reduce or reducedim.

We went through a period where make testall would sometimes (10%?) force-reboot my laptop. Don't know if that's still a problem, because I've largely stopped running it locally on my laptop for that reason.

@tkelman
Copy link
Contributor

tkelman commented Jan 14, 2015

Was the laptop maybe overheating from pegging all cores at 100%? Mine does that, so I have export JULIA_CPU_CORES=4 in my .bashrc

@timholy
Copy link
Sponsor Member

timholy commented Jan 14, 2015

I actually have JULIA_CPU_CORES=2. It's 2 physical cores with hyperthreading, so by default julia runs the tests with 4 workers, and that makes it impossible to get any other work done while the tests run. I've never tried to monitor the temperature while this is happening, though.

@tkelman
Copy link
Contributor

tkelman commented Jan 14, 2015

core test still segfaulting https://travis-ci.org/JuliaLang/julia/jobs/46980567

@eschnett
Copy link
Contributor

With 9 workers, if some workers allocate 500 MByte, Julia tests combined may require several GByte. It would depend on timing coincidences whether they all require this memory at the same time.

Can we introduce an environment variable that specifies the maximum amount of memory that Julia should use? This can be checked in the allocator, and we'd get a nice error message (with backtrace) if Julia uses too much memory. If the operating system's memory limit is reached, then the Julia process may be aborted before it can output a backtrace.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 14, 2015

That's a very Java-like thought.

It's not very nice of the OS to be accepting our malloc requests, but then sending the OOM to kill our app when we try to use it. If that's really the problem, we might want to ask Travis to change the kernel over-allocation allowance.

Tkelman: I opened a new issue specifically for that assertion failure. I currently suspect it maybe the convert-pocolipse

@aviks
Copy link
Member

aviks commented Jan 14, 2015

@vtjnash I've seen crashes often running julia in low memory situations and trying to allocate a large array. Just a Killed on the console. Usually happens when I've made a mistake configuring the amount of memory in a VM.

And yes, I am a Java programmer, but I don't think the kernel knows that.. yet. :)

@staticfloat
Copy link
Sponsor Member

Interesting. I usually get MemoryError() when I have a buildbot with too
little memory. (Usually happens because an LLVM SVN test hangs and then I
get multiple julia processes just sitting there eating up memory)

On Wed, Jan 14, 2015, 08:24 Avik Sengupta notifications@github.com wrote:

@vtjnash https://github.com/vtjnash I've seen crashes often running
julia in low memory situations and trying to allocate a large array. Just a
Killed on the console. Usually happens when I've made a mistake
configuring the amount of memory in a VM.

And yes, I am a Java programmer, but I don't think the kernel knows that..
yet. :)


Reply to this email directly or view it on GitHub
#9763 (comment).

@aviks
Copy link
Member

aviks commented Jan 14, 2015

I investigated a little more where my errors were coming from, and should probably clarify. Using the plain array constructor does throw a MemoryError correctly. But reading a large CSV file into memory kills the process. The following is on a recent 0.4-dev release, on 2GB linux VM. The file I'm trying to read contains 63K floating point values, one in each row, weighing in at around 1 GB.

vagrant@vagrant-ubuntu-trusty-64:~$ julia/usr/bin/julia 
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+2558 (2015-01-08 07:21 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 2bb647a* (6 days old master)
|__/                   |  x86_64-linux-gnu

julia> a=Array(Float64,1000,1000,1000)
ERROR: MemoryError()
 in call at base.jl:260

julia> readcsv("input_SPECFUNC_BASELINE.csv")
Killed
vagrant@vagrant-ubuntu-trusty-64:~$

readcsv uses mmap, which is sounds a likely culprit, and sure enough:

julia> readcsv("input_SPECFUNC_BASELINE.csv", use_mmap=false)
ERROR: MemoryError()
 in call at datafmt.jl:148
 in readdlm_string at datafmt.jl:249
 in readdlm_string at datafmt.jl:273
 in readdlm_auto at datafmt.jl:57
 in readdlm at datafmt.jl:47
 in readdlm at datafmt.jl:45
 in readcsv at datafmt.jl:485

@eschnett
Copy link
Contributor

@vtjnash Once the operating system decides that it doesn't want to give us any more memory, the Julia process is in a bad state. It's not clear at all that it can still generate and output a backtrace at this point. It doesn't matter whether the OS tells the Julia process via a segfault, or via returning NULL from malloc.

To avoid this, the Julia process needs to abort itself before it runs out of memory. Hence an environment variable. Alternatively, call getrlimit and getrusage, and abort when "not much" is left.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 15, 2015

why does it help to abort yourself a random amount of time before you would be notified that the system does not want to honor your malloc request?

@eschnett
Copy link
Contributor

@vtjnash It helps for two reasons:

First, at this time, you can still get a meaningful backtrace. If you can get a good backtrace when the OS or malloc complains, then that's better.

It also helps to find out who is the culprit. If there are 10 workers running simultaneously and one of the requires much more memory than the others, then with an explicit check in Julia, one can catch this one worker and track down where the memory allocation occurs. Otherwise, the OS will abort the first process that allocates memory once the memory is exhausted, and that may not be the one using too much memory.

If we can get a process map from the OS that tells us which processes were running and how much memory each had allocated when it aborts one of the workers, then that's better.

@ihnorton
Copy link
Member

If the OOM is killing the process, shouldn't we see a SIGKILL?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 15, 2015

We print a process exited exception, but we don't print the error code. Perhaps we should? With SIGKILL, the child process doesn't get time to cleanup, but the parent process here could be more informative.

Otherwise, the OS will abort the first process that allocates memory once the memory is exhausted, and that may not be the one using too much memory.

That's not how the linux OOM killer works Nor how malloc is supposed to work (but you need to change a kernel flag to make it work the way the posix spec says it should work).

@eschnett
Copy link
Contributor

If we are running p workers on a system with M amount of memory, and want to ensure that each worker uses no more than M/p memory, why should we wait for the OS or libc to tell us something went wrong?

But maybe your argument isn't with the mechanism, but rather with the choice whether one wants to impose a memory limit at all. I think we should impose one: self-tests should run with a "reasonable" amount of memory, and we need to define what "reasonable" means, and we need to catch those tests that accidentally use more memory.

I don't care about the mechanism one uses for this. But running two large-memory tests that may or may not run simultaneously, and that may or may not lead to OOM, is a situation that leads to random failure and is difficult to debug.

My use case here is a test that used 540 MByte of memory, to store a bit set that had 1 bit set. Maybe this was on purpose, maybe it was an accident because the user was not aware of how the bit set was stored.

@ihnorton
Copy link
Member

We print a process exited exception, but we don't print the error code. Perhaps we should? With SIGKILL, the child process doesn't get time to cleanup, but the parent process here could be more informative.

Yes, and/or we could have our travis script dump OOM messages from the log files:
http://unix.stackexchange.com/questions/128642/debug-out-of-memory-with-var-log-messages

@tkelman
Copy link
Contributor

tkelman commented Jan 16, 2015

Things are looking a little more stable now since 3f4be47 and #9766, though both cases were disabling some tests that we should figure out how to put back, either by fixing deeper underlying bugs or moving to a stress-test target in test/perf.

@tkelman
Copy link
Contributor

tkelman commented Feb 15, 2015

I think we still have at least one outstanding bug and a commented-out test that has yet to be moved to stress tests in perf.

And we're starting to see a new-seeming intermittent failure more often than usual, see https://s3.amazonaws.com/archive.travis-ci.org/jobs/50792847/log.txt for an example. Parallel test failing due to ERROR: write: operation canceled (ECANCELED)? Workers 2, 3, and 9 were terminated there. 2 was running serialize, 3 running examples, 9 waiting in parallel. There have been some changes to parallel functionality and/or testing recently, and there's an open PR that changes serialization but it hasn't been merged yet.

@tkelman
Copy link
Contributor

tkelman commented Feb 25, 2015

Another odd parallel failure on Win64, ENOBUFS in https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2773/job/kstlvdmelslggo3p

@amitmurthy
Copy link
Contributor

It is probably due to these lines https://github.com/JuliaLang/julia/blob/master/test/parallel.jl#L217-L233

It will require an extra 240MB (80 local, 80 remote, 80 returned result) at a minimum. Could be more. Do you think that could be a problem?

@tkelman
Copy link
Contributor

tkelman commented Feb 25, 2015

Yeah, especially at the end of running the tests, the CI VM's could easily be hitting OOM killers - we saw that earlier with a few tests that were commented out last month. For hugely memory-intensive tests we should move them out of the regular CI suite and into a not-run-as-often (but hopefully still once in a while, nightly or a few times a week?) stress test in perf.

@amitmurthy
Copy link
Contributor

Any chance of getting beefier machines? I would really to keep them in the regular CI, at least till we have a stress/perf suit running on a regular schedule.

I'll tweak the those numbers in parallel.jl a bit lower anyways.

@amitmurthy
Copy link
Contributor

Good idea!

Maybe, more like

 - addprocs(N)
 - julia runtests-in-parallel (including CPU intensive tests)
 - rmprocs()
 - julia run-memory-intensive-tests-serially

But we are already near the time limit on Travis OSX, serial execution may just push it over.

@ivarne
Copy link
Sponsor Member

ivarne commented Feb 25, 2015

Time limits is a different problem, that can be fixed by improving performance, writing fewer tests, or prioritizing which tests we can run in different situations. Most commits have a very small probability for breaking the high memory tests anyway.

We should at least not leave the decision to OOM.

@nalimilan
Copy link
Member

Maybe workers should be killed and restarted for each test? IIUC memory usage grows because the tests are exercising many different code paths, which generates a lot of functions in cache. If instead of adding these up, we started from scratch for each test, the memory usage would probably be much lower.

@tkelman
Copy link
Contributor

tkelman commented Apr 19, 2015

Aside from the OSX timeouts, things had been sorta stable for a while here. But the tuple change looks like it's introduced 3 new intermittent failures on CI:

replcompletions #10875
markdown #10380 (comment)
dict #10380 (comment)

@mbauman
Copy link
Sponsor Member

mbauman commented Apr 20, 2015

The easiest and most obvious nondeterminism in our CI systems is the way in which the different tests get split between workers. Has anyone looked to see if some of these failure modes always happen with a certain combination of tests on the same worker?

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

I'm getting the dict failure to happen reliably with a win32 source build via make testall1 at 2a81411. Different commits would pass dict and fail at replcompletions. Anybody have a list-of-tests bisecting script handy?

@timholy
Copy link
Sponsor Member

timholy commented Apr 20, 2015

Linux 64bit, make testall1 passed.

Sorry @tkelman, I don't have such a script, but choosetests.jl will at least make that a lot easier than it would have been before.

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

Somewhat reduced the dict failure. Removing any one of the following list of tests causes it to pass.

julia> Base.runtests(["linalg/pinv", "linalg/givens", "linalg/cholesky", "linalg/lu", "linalg/symmetric", "linalg/arnoldi", "core", "dict"], 1)
     * linalg/pinv          in  13.81 seconds
     * linalg/givens        in   2.65 seconds
     * linalg/cholesky      in  34.06 seconds
     * linalg/lu            in   7.95 seconds
     * linalg/symmetric     in   4.42 seconds
     * linalg/arnoldi       in  12.37 seconds
     * core                 in  11.90 seconds
     * dict                exception on 1: ERROR: LoadError: test failed: isa([k for k = filter(x->begin  # dict.jl, line 84:
                    length(x) == 1
                end,collect(keys(_d)))],Vector{Any})
 in expression: isa([k for k = filter(x->begin  # dict.jl, line 84:
                    length(x) == 1
                end,collect(keys(_d)))],Vector{Any})
 in error at error.jl:19
 in default_handler at test.jl:27
 in do_test at test.jl:50
 in runtests at D:\cygwin64\home\Tony\julia32\usr\share\julia\test\testdefs.jl:77
 in anonymous at multi.jl:626
 in run_work_thunk at multi.jl:587
 in remotecall_fetch at multi.jl:675
 in anonymous at task.jl:1386
while loading dict.jl, in expression starting on line 84
ERROR: LoadError: LoadError: test failed: isa([k for k = filter(x->begin  # dict.jl, line 84:
                    length(x) == 1
                end,collect(keys(_d)))],Vector{Any})
 in expression: isa([k for k = filter(x->begin  # dict.jl, line 84:
                    length(x) == 1
                end,collect(keys(_d)))],Vector{Any})
 in error at error.jl:19
 in default_handler at test.jl:27
 in do_test at test.jl:50
 in runtests at D:\cygwin64\home\Tony\julia32\usr\share\julia\test\testdefs.jl:77
 in anonymous at multi.jl:626
 in run_work_thunk at multi.jl:587
 in remotecall_fetch at multi.jl:675
 in anonymous at task.jl:1386
while loading dict.jl, in expression starting on line 84
while loading D:\cygwin64\home\Tony\julia32\usr\share\julia\test\runtests.jl, in expression starting on line 3

ERROR: A test has failed. Please submit a bug report (https://github.com/JuliaLang/julia/issues)
including error messages above and the output of versioninfo():
Julia Version 0.4.0-dev+4388
Commit 2a81411* (2015-04-20 15:23 UTC)
Platform Info:
  System: Windows (i686-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

 in error at error.jl:19
 in runtests at interactiveutil.jl:400

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

This might be a stupid question, but where is isa defined? Neither methods(isa) nor @which isa([k for k in filter(x->length(x)==1, collect(keys(_d)))], Vector{Any}) work.

@JeffBezanson
Copy link
Sponsor Member

In builtins.c.

@simonster
Copy link
Member

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

Ah, thanks. Do we have a "reflection doesn't work on builtins" issue?

So if I add a few lines to test/dict.jl before the offending test

_d = Dict("a"=>0)
println(typeof([k for k in filter(x->length(x)==1, collect(keys(_d)))]))
println(typeof([k for k in filter(x->length(x)==1, collect(keys(_d)))]) <: Vector{Any})
@test isa([k for k in filter(x->length(x)==1, collect(keys(_d)))], Vector{Any})

and rerun the same combination of preceding tests I get

     * dict                Array{Any,1}
true
exception on 1: ERROR: LoadError: test failed:

I hope someone else finds a platform/commit combination where they can reliably reproduce this, because I'm stumped. Will see if I can get subsets of tests that cause the other failures.

@timholy
Copy link
Sponsor Member

timholy commented Apr 20, 2015

That sequence passes for me on 64-bit linux (unfortunately).

If the tests are failing on Travis, I know from experience that one can get 24hour direct access.

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

Ahah - using this dockerfile #9153 (comment) to build a 32-bit Julia from 64-bit Ubuntu Precise (similar to what we do on Travis) I can reproduce the dict failure, so others should be able to as well. Since it uses system everything it should only take 10-15 minutes to build.

edit: it fails with Base.runtests("all", 1) but the smaller set from windows doesn't show a problem

@timholy
Copy link
Sponsor Member

timholy commented Apr 20, 2015

Got the same filter failure here: https://travis-ci.org/JuliaLang/julia/jobs/59263844. But I'm going to restart it, so for the record: that worker executed the following tests:

  • linalg/lapack
  • linalg/diagonal
  • linalg/symmetric
  • core
  • dict (failure)

@JeffBezanson
Copy link
Sponsor Member

@tkelman I'd love to get your dockerfile working. I've hit a point in the build where it's saying Temporary failure resolving 'archive.ubuntu.com' repeatedly. Any advice? I'm using ubuntu 14.04.

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

Is it right off the bat on the first apt-get update line? What version of docker? You might need to do some of the forwarding/networking setup from https://docs.docker.com/installation/ubuntulinux/ depending on how your network is configured...

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

Once docker is installed the actual steps I run are:

curl -LO https://gist.githubusercontent.com/tkelman/63c2ce16dd2863cae17a/raw/ae7cc7ef82789d19eeb127db70fa2ff198aa694f/Dockerfile
sudo docker build .
# copy-paste the container code from "Successfully built ....."
sudo docker run -t -i $CONTAINER /bin/bash
usr/bin/julia
julia> Base.runtests("all", 1)

@JeffBezanson
Copy link
Sponsor Member

Excellent, thanks for the pointer. Giving docker a DNS server fixed it.

@JeffBezanson
Copy link
Sponsor Member

Hmm, now I get this:

Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/libg/libgcrypt11/libgcrypt11_1.5.0-3ubuntu0.3_amd64.deb  404  Not Found [IP: 91.189.91.15 80]
Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/libt/libtasn1-3/libtasn1-3_2.10-1ubuntu1.2_amd64.deb  404  Not Found [IP: 91.189.91.15 80]
Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/g/gnutls26/libgnutls26_2.12.14-5ubuntu3.8_amd64.deb  404  Not Found [IP: 91.189.91.15 80]
Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/libx/libx11/libx11-data_1.4.99.1-0ubuntu2.2_all.deb  404  Not Found [IP: 91.189.91.15 80]
Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/libx/libx11/libx11-6_1.4.99.1-0ubuntu2.2_amd64.deb  404  Not Found [IP: 91.189.91.15 80]
Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/libx/libxext/libxext6_1.3.0-3ubuntu0.1_amd64.deb  404  Not Found [IP: 91.189.91.15 80]
Unable to correct missing packages.

I'm guessing due to running a different version of ubuntu?

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

The version of ubuntu that you run docker from shouldn't matter much (and fwiw I'm also using 14.04 as host), and the version that runs inside the container comes from the first line of the dockerfile so should always be 12.04 here.

@staticfloat
Copy link
Sponsor Member

You might try running apt-get update so that it's not looking for old
versions of packages to download? Shot in the dark.
-E

On Mon, Apr 20, 2015 at 10:26 PM, Tony Kelman notifications@github.com
wrote:

The version of ubuntu that you run docker from shouldn't matter much (and
fwiw I'm also using 14.04 as host), and the version that runs inside the
container comes from the first line of the dockerfile so should always be
12.04 here.


Reply to this email directly or view it on GitHub
#9763 (comment).

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

The dockerfile already does apt-get update, twice - once first thing, then again after adding the julia-deps ppa

@JeffBezanson
Copy link
Sponsor Member

Running the container build with -no-cache fixed it!

@tkelman
Copy link
Contributor

tkelman commented Apr 23, 2015

oh, right, docker union filesystem caching layers from broken previous runs.

@JeffBezanson
Copy link
Sponsor Member

We have an evolving set of issues on CI. I'll close this in favor of more timely issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests