Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM killer causing the CI EOFError on linux recently #11553

Closed
yuyichao opened this issue Jun 3, 2015 · 86 comments
Closed

OOM killer causing the CI EOFError on linux recently #11553

yuyichao opened this issue Jun 3, 2015 · 86 comments
Labels
status:priority This should be addressed urgently test This change adds or pertains to unit tests

Comments

@yuyichao
Copy link
Contributor

yuyichao commented Jun 3, 2015

Real reason is the OOM killer (see the updated title and comment below)

Original post:

With the patch below (the /dev/tty part might not be necessary) I've got the following backtrace on a machine that can reproduce the EOFError issue (or what looks like it)

signal (8): Floating point exception
__gmp_exception at /usr/lib/libgmp.so (unknown line)
unknown function (ip: 1705129166)
__gmpz_powm at /usr/lib/libgmp.so (unknown line)
powermod at ./gmp.jl:424
powermod at ./gmp.jl:430
unknown function (ip: 1399079124)
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
anonymous at ./test.jl:105
do_test_throws at ./test.jl:59
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
do_call at /home/yuyichao/project/julia/src/interpreter.c:66
eval at /home/yuyichao/project/julia/src/interpreter.c:212
jl_toplevel_eval_flex at /home/yuyichao/project/julia/src/toplevel.c:539
jl_toplevel_eval_flex at /home/yuyichao/project/julia/src/toplevel.c:568
jl_load at /home/yuyichao/project/julia/src/toplevel.c:615
include at ./boot.jl:253
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
runtests at /home/yuyichao/project/julia/test/testdefs.jl:198
unknown function (ip: 1711445219)
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
jl_f_apply at /home/yuyichao/project/julia/src/builtins.c:473
anonymous at ./multi.jl:854
run_work_thunk at ./multi.jl:605
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
anonymous at ./multi.jl:854
start_task at /home/yuyichao/project/julia/src/task.c:234
unknown function (ip: 0)
diff --git a/src/init.c b/src/init.c
index 06a58c7..a394b31 100644
--- a/src/init.c
+++ b/src/init.c
@@ -249,6 +249,12 @@ static int is_addr_on_stack(void *addr)

 void sigdie_handler(int sig, siginfo_t *info, void *context)
 {
+    int fd = open("/dev/tty", O_RDWR);
+    int oldfd1 = dup(1);
+    int oldfd2 = dup(2);
+    dup2(fd, 1);
+    dup2(fd, 2);
+    close(fd);
     if (sig != SIGINFO) {
         sigset_t sset;
         uv_tty_reset_mode();
@@ -269,6 +275,10 @@ void sigdie_handler(int sig, siginfo_t *info, void *context)
         sig != SIGINFO) {
         raise(sig);
     }
+    dup2(oldfd1, 1);
+    dup2(oldfd2, 2);
+    close(oldfd1);
+    close(oldfd2);
 }
 #endif

@@ -1201,6 +1211,9 @@ void jl_install_default_signal_handlers(void)
     sigemptyset(&act_die.sa_mask);
     act_die.sa_sigaction = sigdie_handler;
     act_die.sa_flags = SA_SIGINFO;
+    for (int i = 0;i < 32;i++) {
+        sigaction(i, &act_die, NULL);
+    }
     if (sigaction(SIGINFO, &act_die, NULL) < 0) {
         jl_errorf("fatal error: sigaction: %s\n", strerror(errno));
     }
@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

Although I write it in the title, I could not grantee that this is the issue behind those failures.

Also, is there a reason SIGFPE is only handled for windows?

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

Interesting. I've seen in gdb there is a sigfpe thrown during normal operation in the numbers test, but it's supposed to get caught and handled as a Julia exception I think.

@quinnj
Copy link
Member

quinnj commented Jun 3, 2015

Possibly related to #11351?? (normally caught write error to read-only memory, except when run from include)

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

If it's repeatable with a patch, can you try bisecting, applying the patch as needed at all steps?

@tkelman tkelman added the status:priority This should be addressed urgently label Jun 3, 2015
@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

Sorry I didn't notice that the signal handler was installed before my loop. Moving the loop to the front suppress the printing of SIGFPE. So I'm closing this for now.

The reason it is failing on my VPS is actually the OOM killer, which also sounds like a possible candidate for the CI failure. What's the configuration of the Linux CI and is there a memory log / dmesg for it?

@yuyichao yuyichao closed this as completed Jun 3, 2015
@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

This is what I see in the output of dmesg.

[7464878.811504] Out of memory: Kill process 18793 (julia) score 156 or sacrifice child
[7464878.811509] Killed process 18793 (julia) total-vm:9373516kB, anon-rss:656052kB, file-rss:2452kB

It happens on a machine with 4G of memory (although ~1G is used by other processes)

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

We have had situations where the cause of the CI failure was probably hitting the OOM killer, but we were never absolutely positive about it. The resolution of some common-seeming failures at the time was to find some particular tests that allocated a large amount of memory and comment them out (we should find and move those to test/perf instead, or make a new test/torture directory).

Any suggestions for how to find out if travis is hitting an OOM killer, and/or make Julia give a better backtrace when it does? I suspect the VM's don't have much memory there, but don't know enough details to determine how to debug it. We could open an issue at travis-ci/travis-ci and ask for help.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

Well, the OOM killer should leave a log in the kernel and dmesg at the end of the build should print it.

IIRC, the return code of the process also indicate whether the process is killed.

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

Well it looks like instead of propagating the return code of child processes we're ending up with EOFError so something isn't working right. I've never used dmesg before but I'll make a test branch tk/dmesg where I output that at the end of Travis, see what happens.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

Or if the full output of dmesg is too much. dmesg | grep -i kill should be enough.

It should print sth like

[7465405.462999] dhcpcd-run-hook invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[7465405.463109]  [<ffffffff81172cc6>] ? oom_kill_process+0xbe/0x380
[7465405.463839] Out of memory: Kill process 19094 (julia) score 155 or sacrifice child
[7465405.463845] Killed process 19094 (julia) total-vm:9401780kB, anon-rss:654728kB, file-rss:1636kB

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

OOM ahoy https://travis-ci.org/JuliaLang/julia/jobs/65200677

Time to bust out --track-allocation ?

@tkelman tkelman changed the title Floating point exceptions which seems to be causing the CI EOFError on linux recently OOM killer causing the CI EOFError on linux recently Jun 3, 2015
@tkelman tkelman reopened this Jun 3, 2015
@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

Apparently julia is not the only victim either. :P

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

So what's the roadmap for solving (workaround) this issue?

  • Get more ram?
  • Run less processes?
  • Fix/move/remove tests that allocates a lot of memory?

And by fix I mean sth like clear the global variables and let the GC free the memory.

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

Get more ram?

We could get slightly more resources by switching to Travis' docker-based workers. But those don't allow you to use sudo, so we could no longer use the juliadeps PPA. We'd have to either download a generic nightly binary and extract the dependency so's from there, or build the deps from source and cache them (which Travis lets you do on a Docker worker). I can help with this if someone wants to start experimenting in this direction.

Run less processes?

Worth trying. Would be a bit annoying to make Travis take twice as long as it does now, but generally we have to wait for the AppVeyor queue anyway so it might not be that bad.

Fix/move/remove tests that allocates a lot of memory?

This is probably the best option for now, assuming we can find which tests are allocating the most. There might be some egregious outliers. Dunno, I haven't profiled memory use in a while. You could print out memory remaining at the end of each test file as a coarse way of finding out which test is allocating the most. There might also be an underlying bug or regression in terms of memory consumption while running the tests, or maybe it's some newly added test[s] that push this over the edge to being killed more often.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

I'll probably play with the memory consumption of the tests sometime later unless someone is so annoyed and fix it, which is extremely likely :).

I'm a little surprised that the dmesg doesn't print all/most the kernel logs, as it does on ArchLinux now, probably a combination of older version and not being root. In any case, I guess it's probably a good idea to add dmesg | grep -i kill || true or just dmesg to the master CI script as well so that people know what's wrong. (especially if the symptom changes again in the future.)

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

I'll add dmesg to Travis. We're not running on osx Travis right now, but just to be sure, would that also work there?

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

Sorry I have no idea for OSX.... Someone else probably know it better (or could try it).

A quick google search seems to suggest it at least exist.

Edit: although I don't know if there's a OOM killer on OSX or how it works.

tkelman added a commit that referenced this issue Jun 3, 2015
this may help at least identify OOM-killed failures, ref #11553

[av skip]
tkelman added a commit that referenced this issue Jun 3, 2015
this may help at least identify OOM-killed failures, ref #11553

[av skip]
(cherry picked from commit 93e672d)
@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

Apparently it needs sudo on osx, and I'm guessing it might be much chattier with sudo on Linux. One more thing that'll need fixing whenever we try turning OSX Travis back on for master.

Thanks for editing the top post btw.

@amitmurthy
Copy link
Contributor

Is this http://docs.travis-ci.com/user/speeding-up-the-build/ something to look into?

Specifically, split the test suite into 4 sets and run them concurrently on 4 travis nodes. Considering that the test suite will only be added to over time.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

I just did a simple test by printing out the resident set size (RSS) after each test is finished (patch below).

Clearing the globals does help a little but with 4 workers the memory consumption for each one still inevitably grows to ~1GB when the tests finishes.

Is there a easy way to figure out what is taking those space? (e.g. is the GC not freeing some object? the code generated? or the GC not reusing space effeciently enough?)

diff --git a/test/testdefs.jl b/test/testdefs.jl
index e7ddcda..1cb9f11 100644
--- a/test/testdefs.jl
+++ b/test/testdefs.jl
@@ -2,10 +2,38 @@

 using Base.Test

+function isnotconst(m, name)
+    # taken from inference.jl
+    isdefined(m, name) && (ccall(:jl_is_const, Int32, (Any, Any), m, name) == 0)
+end
+
+function clear_globals()
+    m = current_module()
+    for name in names(m)
+        if isnotconst(m, name)
+            try
+                m.eval(:($name = nothing))
+            end
+        end
+    end
+end
+
+function print_mem_size()
+    pid = getpid()
+    open("/proc/$pid/statm") do fd
+        statm = split(readall(fd))
+        rss = parse(Int, statm[2]) * 4
+        println("Ram ($pid): $rss KB")
+    end
+end
+
 function runtests(name)
     @printf("     \033[1m*\033[0m \033[31m%-20s\033[0m", name)
     tt = @elapsed Core.include(abspath("$name.jl"))
     @printf(" in %6.2f seconds\n", tt)
+    # clear_globals()
+    # gc()
+    print_mem_size()
     nothing
 end

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

@amitmurthy Yeah, we might want to try that, if we don't mind our queue time going up accordingly. That would also be a valid approach to try to bring back OSX Travis and hopefully keep it under the time limit. The downside of that approach is compiling the C and sysimg of Julia from source takes up a significant portion of the runtime on CI, and I think that would end up being repeated across each matrix build unless we tried some really clever caching logic.

@amitmurthy
Copy link
Contributor

If the 1GB per worker is due to complied code, it may be worth trying out something like

for test_set in test_sets
  addprocs(n)
  run test_set
  rmprocs(workers)
end

Where each test_set is a subset of the current large number of tests.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

@amitmurthy Yeah. I guess that should fix any kind of "memory leak" as well. It should be fine to do that for the CI but I'm also interested in whether there's other problems that needs to be fixed.

@timholy
Copy link
Sponsor Member

timholy commented Aug 31, 2015

Any chances of using the "containers" infrastructure (http://docs.travis-ci.com/user/workers/container-based-infrastructure/)? More memory available (4GB rather than 3). Presumably our usage of sudo makes this rather hard, however.

@stevengj
Copy link
Member

stevengj commented Sep 1, 2015

Because this is an urgent issue (lots of PRs now have failing builds because of this), my feeling is that we should just set JULIA_CPU_CORES=2 (as in #12855) for now, and swallow the longer Travis times until there is a better solution.

@tkelman
Copy link
Contributor

tkelman commented Sep 1, 2015

Any chances of using the "containers" infrastructure

Not easily, we need to do a lot of apt-get work from the juliadeps PPA which we would need to get whitelisted first, and the set of i386 packages that we install to make 32-bit Linux builds work would likely be a challenge since many of them conflict with the normal 64-bit packages and I don't think Travis lets you do a matrix build with the apt-get addon (that you need to use for package installation on the container setup). In short I haven't tried this and haven't had time to look into it in detail.

@stevengj
Copy link
Member

stevengj commented Sep 1, 2015

Okay, I merged JULIA_CPU_CORES=2 as a temporary fix. Slow Travis is better than continual failures.

@JeffBezanson JeffBezanson removed the status:priority This should be addressed urgently label Sep 17, 2015
@yuyichao
Copy link
Contributor Author

Starts happening again.... https://travis-ci.org/JuliaLang/julia/jobs/84749848 😢

@stevengj
Copy link
Member

Re-adding the priority label if we are starting to see continual failures again... We can set JULIA_CPU_CORES=1, of course, but we clearly need a longer-term fix.

@timholy
Copy link
Sponsor Member

timholy commented Oct 12, 2015

If JC can spare some cash, would it be worth considering a paid plan? (I note that more memory does not seem to be a feature offered in the plans here, but I imagine one could ask...) There's been enough time lost to this issue that ponying up might be expedient and cost-effective.

@StefanKarpinski
Copy link
Sponsor Member

JC can fund this if Travis offers it and the price is reasonable.

@stevengj
Copy link
Member

I asked the Travis support people, and got the following response:

We don't offer paid plans on travis-ci.org (public repositories/builds) at the moment. This is something we are looking into, but due to the requirements this might take a while.

In the meantime, have you considered running these builds on our GCE/Trusty beta infrastructure? There are 7.5 GB of memory available.

@yuyichao
Copy link
Contributor Author

Close since this should be fixed by #13577 now. Add benchmark label since I think we should also have a benchmark to keep track of memory usage. (Feel free to remove the label if there's already one for this purpose.)

@KristofferC KristofferC removed the kind:potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:priority This should be addressed urgently test This change adds or pertains to unit tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.