mmap failure with address space quotas #10390

JonathanAnderson · 2015-03-03T16:08:43Z

I'm having a problem where when I build from 3c7136e when I run julia, I get the error could not allocate pools If I run as a different user, Julia runs successfully.

I think there might be something specific to my user on this box, but I am happy to help identify what is happening here.

I think this is related to #8699

also, from the julia-users group: https://groups.google.com/forum/#!topic/julia-users/FSIC1E6aaXk

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2015-03-03T17:12:10Z

This error is from a failing mmap, where we try to allocate 8GB of virtual address space. There might be a quota on virtual memory for some users.

pao · 2015-03-03T19:28:25Z

Oops, misread the issue, sorry.

ivarne · 2015-03-03T19:31:06Z

-v: address space (kb) 8000000 (from the julia-users thread) seems to indicate that a 8GB allocation is guaranteed to cause trouble

JeffBezanson · 2015-03-03T19:57:02Z

@carnaval Could we decrease this to, say, 4GB, to make this issue less likely?

tkelman · 2015-03-16T07:48:53Z

an 8gb array is also too large for msvc to compile, fwiw

tkelman · 2015-03-20T21:56:09Z

Can you try reducing the number on

julia/src/gc.c

Line 88 in e1d6e56

#define REGION_PG_COUNT 16*8*4096 // 8G because virtual memory is cheap

by a factor of 2 or 4, see if it helps? I could also make a test branch with that change to have the buildbot make test binaries if that would be easier.

carnaval · 2015-03-22T12:12:04Z

We can certainly lower this. I set it that high under the reasoning that address space was essentially free on x64. As I understand it, operations are either O(f(number of memory mappings)) or O(f(size of resident portion)) so it should not hurt performance.
I didn't think of arbitrary quotas, but it's probably better to ask, does anyone know any other drawback in allocating "unreasonable" amounts of virtual memory on 64bit arch ?

ScottPJones · 2015-06-19T01:00:36Z

@carnaval Yes, indeed... lots of performance issues if you have very large amounts of memory mapped... which is why people use huge page support...

carnaval · 2015-06-20T21:34:25Z

Keep in mind I'm still talking about uncomitted memory. The advantage of huge pages is reducing TLB contention as far as I know, and uncomitted memory sure won't end up in the TLB.

Generally, as far as my understanding of the kernel VM system goes, "dense" data structures (such as the page table, for which the TLB acts as a cache) are only filled with committed memory. The mapping itself stays into a "sparse" structure (like a list of mappings), so you only pay costs relative to the number of mappings. I may be wrong though, so I'll be happy to be corrected.

ScottPJones · 2015-06-20T21:53:12Z

I'm talking about memory that has actually been touched, i.e. commited.
The issue is if you have an (opt-in at least) limit in the language, instead of just relying on things like ulimit, you can (at least in my experience) better control things, keep things from getting to the point where the OS goes bellyup. Say you have 60,000 processes running, which you know only need say 128M (unless they somehow get out of control, due to some bug)... having the limit protects you.
You may also have different classes of processes that need more memory (say, loading a huge XML document), it's important to be able to also be able to allow those to dynamically (based on user roles) have a higher limit).

carnaval · 2015-06-20T21:55:43Z

That's not what my question was about though. We are already careful to decommit useless pages.

The limit is another issue, to enforce it strictly would probably require parsing /proc/self/smaps from time to time anyway to be sure some C library is not sneaking around making mappings.

ScottPJones · 2015-06-20T22:01:29Z

Yes, but does the current system ever try to proactively cut down on caches, etc., so that it can free up some memory?

It doesn't really have to be done strictly, to be useful, without fancy approaches like parsing /proc/...
also, for people embedding julia, couldn't things be compiled so that at least malloc/calloc/realloc end up using a julia version, that does keep track?
Having some facility to try to increase stability is better than none, even if it can't handle external memory pressures.

carnaval · 2015-06-20T22:32:50Z

I'm not arguing that we should not do those things. But those are features. I was just trying to check if someone knew that some kernel would be slow with large mappings : it would be a regression, not a missing feature.

mauro3 · 2015-08-12T20:41:14Z

I'm running into a could not allocate pools issue on a new build on a new machine (0.3 works fine).
(Not sure whether this warrants a new issue or not, let me know.)

It builds fine but it crashes on running the tests, the culprit is addprocs:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+6683 (2015-08-12 17:53 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 103f7a3* (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> addprocs(3; exeflags=`--check-bounds=yes --depwarn=error`)
could not allocate pools

However, addprocs(2; exeflags=``--check-bounds=yes --depwarn=error``) works. Also starting more than three REPLs at once produces the error.

As far as I can tell there are no relevant ulimits:

 $ ulimit -a
-t: cpu time (seconds)         unlimited
-f: file size (blocks)         unlimited
-d: data seg size (kbytes)     unlimited
-s: stack size (kbytes)        8192
-c: core file size (blocks)    0
-m: resident set size (kbytes) unlimited
-u: processes                  63889
-n: file descriptors           1024
-l: locked-in-memory size (kb) 64
-v: address space (kb)         unlimited
-x: file locks                 unlimited
-i: pending signals            63889
-q: bytes in POSIX msg queues  819200
-e: max nice                   0
-r: max rt priority            0
-N 15:                         unlimited

On my normal machine the -l option is unlimited, but limiting it there to 64 does not reproduce this behavior.

mauro3 · 2015-08-13T08:37:00Z

The same problem arises using the Julia nightlies julia-0.4.0-24a92a9f5d-linux64.tar.gz.

Any ideas on how I could resolve this? Should I contact the admin of that machine to change some settings?

carnaval · 2015-08-13T15:45:54Z

yes, you can remove the 16* here

julia/src/gc.c

Line 164 in f40da0f

#define REGION_PG_COUNT 16*8*4096 // 8G because virtual memory is cheap

and recompile.

Maybe I should make that the default but it feels so silly to me for admins to restrict addr space, I don't get it really.

mauro3 · 2015-08-13T19:55:46Z

Yes, that works, thanks! Just to clarify, my understanding from this thread is that it is limiting the -v: address space (kb) which causes this. However, this is unlimited on my machine. So which one is the culprit?

waTeim · 2016-05-05T20:04:30Z

As for me, this is happening on ARM for unclear reasons. The GC memory space is currently not expandable I take it. If it were, I think minimal low-end stuff could afford a heap size of at least 64M, without an issue while expecting size approaching 1G is ridiculous. Somewhere in between is the target.

Additionally, I request that this needs to be configurable via library jl_init or similar, and not expect to be controlled by running julia the executable.

yuyichao · 2016-05-05T20:57:55Z

The arm issue is completely different. This is only an issue for those who cannot control the virtual address limit. The amount of physical memory is irrelevant here.

waTeim · 2016-05-05T22:48:13Z

Well since the error message is the same it at least seems related. Are you saying this happens not because of the size of the allocation but the location? These aren't related? The previous discussion made it sound that people were having problems because the system prevented oversubscription which seems to indicate a problem with size. Well if that's the case, then that kind of makes sense too; it is true that there is a lack of OS support for 64-bit virtual addresses.

r-barnes · 2016-05-10T19:06:06Z

Attempting to compile on XSEDE's Comet raised this error. Removing 16* from gc.c allowed compilation to continue.

eschnett · 2016-05-10T20:53:53Z

Comet's front end has a severely restricted memory limit setting (ulimit). You can only allocate 2 GByte. The solution is to request a compute node interactively, and build there:

/share/apps/compute/interactive/qsubi.bash -p debug --nodes=1 --ntasks-per-node=24 -t 00:30:00 --export=ALL

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix #10390

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix JuliaLang#10390

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix #10390

floswald · 2016-06-08T10:25:52Z

Hi all,
is this going to be backported to 0.4.x at some point? I'm stuck with this problem on a cluster. thanks!

tkelman · 2016-06-08T10:28:05Z

#16385 was a pretty large change, I'm not sure whether it can be easily backported. Are you building from source or using binaries? If the former, just change the number in the code and recompile. If the latter, I guess we could trigger an unofficial build with a smaller value.

floswald · 2016-06-08T10:31:02Z

i was using binaries. building is a nightmare on that system as well, I run into diskspace quota exceeded on the login node all the time, and i can't get the build to work on a compute node either. If you can trigger an unofficial 0.4.5 build that would save my week. thanks.

tkelman · 2016-06-08T10:37:23Z

might take a while to build, but check back at https://build.julialang.org/builders/package_tarball64/builds/435 and when it's done it should be available at https://julianightlies.s3.amazonaws.com/bin/linux/x64/0.4/julia-0.4.6-c7cd8171df-linux64.tar.gz (assuming you want 64 bit linux, and dropping by a factor of 8 will get you below your ulimit)

floswald · 2016-06-08T11:11:35Z

Awesome! Thanks.

On Wednesday, 8 June 2016, Tony Kelman notifications@github.com wrote:

might take a while to build, but check back at
https://build.julialang.org/builders/package_tarball64/builds/435 and
when it's done it should be available at
julianightlies.s3.amazonaws.com/bin/linux/x64/0.4/julia-0.4.6-c7cd8171df-linux64.tar.gz
(assuming you want 64 bit linux)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10390 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA-WdofbXDWPFFjmNbso-fAX6Dq4Dk_Bks5qJpt5gaJpZM4Doy6D
.

floswald · 2016-06-08T12:38:13Z

@tkelman thanks so much works out of the box like a charm! so much for "broken software". outstanding support as usual. 👍 👍 👍

mauro3 · 2016-09-13T20:53:54Z

In case someone else stumbles over this: I was under the impression that this issue was resolve but it still surfaced for me with 0.5-rc4 binaries and source build for rc4, see #18477.

The error now looks a bit different for me with either just hanging at the tests when running make testall or when doing addprocs with a suitably high number I get Master process (id 1) could not connect within 60.0 seconds. The fix is as before but now in file src/gc-pages.c.

StefanKarpinski · 2016-09-13T20:55:29Z

Reopened to be fixed in 0.5.x.

yuyichao · 2016-09-13T21:04:16Z

As mentioned in the related issue, this is really #17987. It does not fail because we are asking for a huge fixed size anymore and the remaining is better handled by allowing users with special memory constraint to specify that directly.

floswald · 2016-09-21T09:01:26Z

Sorry to bother with this but I am still looking for a solution to this problem. I am working on a cluster where I have to request the max amount of virtual and physical memory that I will be using, and I have to request very large amounts in order for my job to run at all. This puts me on a significantly longer queue, because I basically need an entire compute node all for myself. julia v0.5-rc3.

my job has the following memory requirements when run on a single compute node on that same cluster.

2.5GB of RES, 7038MB VIRT: https://www.dropbox.com/s/jii21v56i5v86vr/Screenshot%202016-09-21%2010.43.14.png?dl=0
when run on my office mac os: max RES is 2648MB, so same ballpark.
https://www.dropbox.com/s/wkmawhxo0i78b95/Screenshot%202016-09-21%2010.40.00.png?dl=0

Any advice for how to deal with this greatly appreciated.

pao added the domain:building Build system, or building Julia or its dependencies label Mar 3, 2015

JeffBezanson removed the domain:building Build system, or building Julia or its dependencies label Mar 3, 2015

vtjnash added the kind:bug Indicates an unexpected problem or unintended behavior label Mar 4, 2015

vtjnash added this to the 0.4.1 milestone Mar 4, 2015

JeffBezanson changed the title ~~could not allocate pools~~ mmap failure with address space quotas Mar 5, 2015

tkelman mentioned this issue Mar 10, 2015

EXCEPTION_ACCESS_VIOLATION in strlen when gc is run. #10249

Closed

tkelman mentioned this issue Mar 20, 2015

IUP gc segfault #10570

Closed

JeffBezanson added the GC Garbage collector label Apr 23, 2015

simonster mentioned this issue May 8, 2015

Running on 64bit system with virtual memory constraint. #11201

Closed

tkelman mentioned this issue Jun 19, 2015

Infinite behavior of collect(countfrom()) #11749

Closed

vtjnash self-assigned this Jul 31, 2015

tkelman mentioned this issue Aug 3, 2015

Add a system sync() to start_worker() for aggressive SGE buffering #12439

Closed

yuyichao mentioned this issue Aug 13, 2015

could not allocate pools #12610

Closed

yuyichao added a commit that referenced this issue May 16, 2016

Dynamic region size

f493176

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix #10390

yuyichao mentioned this issue May 16, 2016

Clean up of gc debugging and page management code #16385

Merged

yuyichao added a commit that referenced this issue May 16, 2016

Dynamic region size

6d66904

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix #10390

tkelman pushed a commit to tkelman/julia that referenced this issue May 16, 2016

Dynamic region size

7d2c846

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix JuliaLang#10390

yuyichao added a commit that referenced this issue May 16, 2016

Dynamic region size

fb04031

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix #10390

yuyichao added a commit that referenced this issue May 16, 2016

Dynamic region size

7d1c8a6

* Set region sizes based on `ulimit`. * Automatically shrink region size when allocation fails. Fix #10390

yuyichao closed this as completed in #16385 May 19, 2016

yuyichao mentioned this issue Jul 15, 2016

Installing packages fails with could not allocate pools #17436

Closed

tkelman removed this from the 0.4.x milestone Jul 16, 2016

mauro3 mentioned this issue Sep 13, 2016

julia-rc3 & rc4 binary segfault with enough processes #18477

Closed

StefanKarpinski reopened this Sep 13, 2016

StefanKarpinski added this to the 0.5.x milestone Sep 13, 2016

yuyichao closed this as completed Sep 13, 2016

joshjob42 mentioned this issue Mar 11, 2017

command line flag to limit heap memory usage? #17987

Open

vtjnash mentioned this issue Mar 22, 2017

refactor gc to only need to allocate virtual addresses as needed #21135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmap failure with address space quotas #10390

mmap failure with address space quotas #10390

JonathanAnderson commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

pao commented Mar 3, 2015

ivarne commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

tkelman commented Mar 16, 2015

tkelman commented Mar 20, 2015

carnaval commented Mar 22, 2015

ScottPJones commented Jun 19, 2015

carnaval commented Jun 20, 2015

ScottPJones commented Jun 20, 2015

carnaval commented Jun 20, 2015

ScottPJones commented Jun 20, 2015

carnaval commented Jun 20, 2015

mauro3 commented Aug 12, 2015

mauro3 commented Aug 13, 2015

carnaval commented Aug 13, 2015

mauro3 commented Aug 13, 2015

waTeim commented May 5, 2016

yuyichao commented May 5, 2016

waTeim commented May 5, 2016

r-barnes commented May 10, 2016

eschnett commented May 10, 2016

floswald commented Jun 8, 2016

tkelman commented Jun 8, 2016 •

edited

Loading

floswald commented Jun 8, 2016

tkelman commented Jun 8, 2016 •

edited

Loading

floswald commented Jun 8, 2016

floswald commented Jun 8, 2016

mauro3 commented Sep 13, 2016

StefanKarpinski commented Sep 13, 2016

yuyichao commented Sep 13, 2016

floswald commented Sep 21, 2016

mmap failure with address space quotas #10390

mmap failure with address space quotas #10390

Comments

JonathanAnderson commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

pao commented Mar 3, 2015

ivarne commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

tkelman commented Mar 16, 2015

tkelman commented Mar 20, 2015

carnaval commented Mar 22, 2015

ScottPJones commented Jun 19, 2015

carnaval commented Jun 20, 2015

ScottPJones commented Jun 20, 2015

carnaval commented Jun 20, 2015

ScottPJones commented Jun 20, 2015

carnaval commented Jun 20, 2015

mauro3 commented Aug 12, 2015

mauro3 commented Aug 13, 2015

carnaval commented Aug 13, 2015

mauro3 commented Aug 13, 2015

waTeim commented May 5, 2016

yuyichao commented May 5, 2016

waTeim commented May 5, 2016

r-barnes commented May 10, 2016

eschnett commented May 10, 2016

floswald commented Jun 8, 2016

tkelman commented Jun 8, 2016 • edited Loading

floswald commented Jun 8, 2016

tkelman commented Jun 8, 2016 • edited Loading

floswald commented Jun 8, 2016

floswald commented Jun 8, 2016

mauro3 commented Sep 13, 2016

StefanKarpinski commented Sep 13, 2016

yuyichao commented Sep 13, 2016

floswald commented Sep 21, 2016

tkelman commented Jun 8, 2016 •

edited

Loading

tkelman commented Jun 8, 2016 •

edited

Loading