Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBSD CI workers got ENOMEM randomly #23143

Closed
iblislin opened this issue Aug 5, 2017 · 16 comments · Fixed by #35040
Closed

FreeBSD CI workers got ENOMEM randomly #23143

iblislin opened this issue Aug 5, 2017 · 16 comments · Fixed by #35040
Labels
system:freebsd Affects only FreeBSD

Comments

@iblislin
Copy link
Member

iblislin commented Aug 5, 2017

ERROR (unhandled task failure): write: not enough memory (ENOMEM)
Stacktrace:
 [1] try_yieldto at ./event.jl:189 [inlined]
 [2] wait() at ./event.jl:248
 [3] uv_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:811
 [4] unsafe_write(::Base.PipeEndpoint, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:832
 [5] write(::Pipe, ::String) at ./strings/string.jl:71
 [6] write_cookie(::Base.Process) at ./distributed/cluster.jl:1120
 [7] launch(::Base.Distributed.LocalManager, ::Dict{Any,Any}, ::Array{WorkerConfig,1}, ::Condition) at ./distributed/managers.jl:328
 [8] (::Base.Distributed.##39#42{Base.Distributed.LocalManager,Dict{Any,Any}})() at ./event.jl:73

But the available memory observed from top on each work is enough.

Not sure it's bsd only issue or not.

Ref builds:

  1. https://julia.iblis.cnmc.tw/#/builders/1/builds/903
  2. https://julia.iblis.cnmc.tw/#/builders/1/builds/887
  3. https://julia.iblis.cnmc.tw/#/builders/1/builds/891
  4. https://julia.iblis.cnmc.tw/#/builders/1/builds/772
    ...
@ararslan
Copy link
Member

ararslan commented Aug 5, 2017

I believe @fredrikekre said he's seen this happen locally on Ubuntu as well. This is triggered by the parallel tests, right?

@fredrikekre
Copy link
Member

I think it was the same, can not give any more info until getting back to work in couple of weeks though, since I have only seen it there.

@iblislin
Copy link
Member Author

iblislin commented Aug 7, 2017

I bumped JULIA_TEST_MAXRSS_MB to 1000 now.
hope it's helpful. 🙏

@iblislin
Copy link
Member Author

iblislin commented Aug 7, 2017

some build reports don't contain the statistic testsuit

Any unreasonable memoy usage in that testsuit?

@ararslan
Copy link
Member

ararslan commented Aug 7, 2017

This is odd, I haven't seen this locally on FreeBSD with JULIA_TEST_MAXRSS_MB=600 on a fairly limited memory machine. Seems like setting it to 600 would be more likely to trigger ENOMEM than 1000?

@iamed2
Copy link
Contributor

iamed2 commented Aug 7, 2017

I have seen this when SIGINTing while doing parallel work with Dispatcher. Maybe processes are being killed?

@iblislin
Copy link
Member Author

iblislin commented Aug 8, 2017

dmesg keep showing this:

Aug  8 06:17:17 ionic kernel: kern.ipc.maxpipekva exceeded; see tuning(7)
Aug  8 06:17:19 ionic last message repeated 2 times
Aug  8 06:17:20 ionic kernel: pid 82904 (julia), uid 1001: exited on signal 6 (core dumped)
Aug  8 06:17:21 ionic kernel: kern.ipc.maxpipekva exceeded; see tuning(7)
Aug  8 07:34:07 ionic kernel: kern.ipc.maxpipekva exceeded; see tuning(7)
Aug  8 07:34:10 ionic last message repeated 3 times
Aug  8 08:52:07 ionic kernel: kern.ipc.maxpipekva exceeded; see tuning(7)
Aug  8 08:52:10 ionic last message repeated 3 times
Aug  8 10:10:26 ionic kernel: kern.ipc.maxpipekva exceeded; see tuning(7)
Aug  8 10:10:29 ionic last message repeated 3 times

My local machine also have these messages, but do not raise any ENOMEM.

I have seen this when SIGINTing while doing parallel work with Dispatcher. Maybe processes are being killed?

any suggestion/tools about how to obseving that?

@iblislin
Copy link
Member Author

iblislin commented Aug 8, 2017

after bumped MAX_RSS up, the frequence of ENOMEM ⬇️ , but freezing ⬆️

this ENOMEM is related to freezing? 😱

@iblislin iblislin mentioned this issue May 25, 2018
@mbauman
Copy link
Sponsor Member

mbauman commented Oct 2, 2019

I don't think we're seeing this anymore — is this still relevant with the new buildbot infrastructure?

@ararslan
Copy link
Member

ararslan commented Oct 2, 2019

We still get freezes occasionally, but not as often as we used to, and I haven't seen the ENOMEM in a long time, probably since we switched to the new buildbots.

@ViralBShah
Copy link
Member

Reopen if necessary?

@Keno
Copy link
Member

Keno commented Mar 4, 2020

@Keno Keno reopened this Mar 4, 2020
@Keno
Copy link
Member

Keno commented Mar 4, 2020

The FreeBSD kernel behavior here seems odd to me. I'll reach out to the kernel devs.

@Keno
Copy link
Member

Keno commented Mar 4, 2020

https://lists.freebsd.org/pipermail/freebsd-hackers/2020-March/055714.html

@Keno
Copy link
Member

Keno commented Mar 4, 2020

@staticfloat tells me our current kern.ipc.maxpipekva is 512MB on the test machine, which seems like quite a bit, but maybe we can bump it to 1GB if we start seeing this again with some regularity. I also wouldn't be surprised if we leak pipes somewhere, which would also explain the fd leak we have seen on other platforms. (I still think FreeBSD should handle this more gracefully, but we can workaround this independent of any potential improvements to the FreeBSD kernel).

@staticfloat
Copy link
Sponsor Member

staticfloat commented Mar 5, 2020

I've started graphing kern.ipc.pipekva, so we can check to see if the ENOMEM errors actually correspond to us running out of memory, whether it's a slow leak or a quick exhaustion, etc... If someone sees a bunch of ENOMEM errors, post the error logs and I'll look up the corresponding graphs.

Keno added a commit that referenced this issue Mar 7, 2020
As noted in #35011, the `stress` test is likely causing ENOMEM
errors in unrelated processes on FreeBSD as it's causing kernel
resource exhaustion. This fixes that by running that test with
a low FD ulimit (100). Should fix #23143. Closes #30511.
Keno added a commit that referenced this issue Mar 8, 2020
As noted in #35011, the `stress` test is likely causing ENOMEM
errors in unrelated processes on FreeBSD as it's causing kernel
resource exhaustion. This fixes that by running that test with
a low FD ulimit (100). Should fix #23143. Closes #30511.
ravibitsgoa pushed a commit to ravibitsgoa/julia that referenced this issue Apr 9, 2020
As noted in JuliaLang#35011, the `stress` test is likely causing ENOMEM
errors in unrelated processes on FreeBSD as it's causing kernel
resource exhaustion. This fixes that by running that test with
a low FD ulimit (100). Should fix JuliaLang#23143. Closes JuliaLang#30511.
KristofferC pushed a commit that referenced this issue Apr 11, 2020
As noted in #35011, the `stress` test is likely causing ENOMEM
errors in unrelated processes on FreeBSD as it's causing kernel
resource exhaustion. This fixes that by running that test with
a low FD ulimit (100). Should fix #23143. Closes #30511.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:freebsd Affects only FreeBSD
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants