ROSS hangs when out of event memory #6

JohnPJenkins · 2014-08-08T13:20:57Z

Hi all,

When out of event memory in optimistic mode, the message "WARNING: No free event buffers. Try increasing memory via the --extramem option" is printed to stdout and what looks like memory recollection via forcing a GVT update is attempted (tw-sched.c, lines 177-185). However, if no memory is able to be recollected, then the program seems to enter an infinite loop of checking for free mem and running the GVT. Any way to detect this behavior and terminate the program?

laprej · 2014-08-08T14:35:09Z

Not really. It's basically the Halting Problem. We can't guarantee that an event won't get freed by running tw_gvt_force_update() one more time, but after 10 iterations, it's looking doubtful... But you never know! At least we have the warning now. Back in my day we didn't even have that! :D I suppose instead of printing the warning you could call tw_exit()?

mmubarak · 2014-08-08T14:44:57Z

Actually you don't always see a warning if the simulation runs out of event memory. Recently, I ran into cases where I was getting some unexpected simulation output without any warning message that its running out of memory. Needless to say that I spent quite some time digging my code to find whats wrong :-) Things got resolved once I increased the event memory.

So I think if there is a way we could ensure that the warning always appears, that would be useful in the debugging process.

JohnPJenkins · 2014-08-08T15:42:34Z

I wonder if there's a practical ceiling for no_free_event_buffers before ROSS should give up and tw_error out? 10? 20? 100?

W.r.t. halting, is the condition of all PEs making no progress or stalling between GVTs sufficient to give up / finalize? The two major possibilities here I can think of are: 1) no more events to process (not an error); 2) PE(s) with insufficient memory propagate inability to progress to all other PEs.

carothersc-zz · 2014-08-08T17:41:41Z

Thanks John;

So, memory management is a bit of a sticky issue in optimistic parallel
event simulators.
In particular, a model could have a small set of LPs that race ahead and
effectively consume
the available event memory, even all the available system memory.
Additionally, a model
developer could do some bad things like unknowingly b-cast events to a
large group of
LPs - this causes a swell in the pending event population. If left
unchecked, the model
will exhaust all available event memory even if only running in serial mode.

The trick is the model developer needs to have some sense of what there
peak event
memory needs are for the when executing on a single processor. For network
models,
you can get a sense of that based on what you think the average hop counts
are
coupled with arrival rate of new packets.

Then you want to add just enough memory for efficient optimistic execution.
This is typically no
more than a 8K to 16K event memory buffers assuming your batch and GVT
interval values
8 and 512 respectively -- e.g., on avg between successive GVTs each MPI
rank will process
about batch X GVT events -- so at 8 batch and 512 GVT, you'll process about
4K events per
GVT epoch.

Hope that helps,
Chris

On Fri, Aug 8, 2014 at 11:42 AM, John Jenkins notifications@github.com
wrote:

I wonder if there's a practical ceiling for no_free_event_buffers before
ROSS should give up and tw_error out? 10? 20? 100?

W.r.t. halting, is the condition of all PEs making no progress or stalling
between GVTs sufficient to give up / finalize? The two major possibilities
here I can think of are: 1) no more events to process (not an error); 2)
PE(s) with insufficient memory propagate inability to progress to all other
PEs.

—
Reply to this email directly or view it on GitHub
https://github.com/carothersc/ROSS/issues/6#issuecomment-51618815.

Christopher D. Carothers

Director, Center for Computational Innovations
Professor, Department of Computer Science
Rensselaer Polytechnic Institute
110 8th Street
Troy, New York 12180-3590

e-mail: chrisc@cs.rpi.edu
web page: www.cs.rpi.edu/~chrisc
phone: (518) 276-2930

fax: (518) 276-4033

mmubarak · 2014-08-08T18:07:49Z

Not directly related but one thing that we can do is to update the event memory allocation formula in codes-base (the codes-mapping API). Currently, its pretty static i.e. it multiplies a constant value with the number of LPs per PEs and allocates event memory accordingly. This works for some of the models while it fails for some others. Ideally, if we could dynamically calculate the event memory based on the model parameters, that could help us resolve some of the memory problems. Not something that has to be done right away but we might want to do it at some point.

JohnPJenkins · 2014-08-08T18:36:07Z

Thanks Chris! A lot of useful information in there. Perhaps your response would make for good content in the wiki?

Misbah: that's been a backburner item I haven't quite gotten around to, but is relatively easy to achieve. The only question is whether to have it as a run-time argument (i.e. passed through argv and required in all tw_opts) or as a configuration parameter (processed through the codes-config code path). I'm leaning towards the latter.

As you mention, making the "mem_factor" multiplier configurable makes it easier to play with available memory but does not eliminate the core (undecidable?) problem of hanging on out-of-event-memory conditions.

JohnPJenkins · 2014-08-13T23:21:26Z

As discussed in the meeting today, the hanging problem isn't something that can be solved due to time warp semantics and ROSS memory upper bounds. It might not be a bad idea to kill the simulation after a large number of successive failures as Justin suggested, but I'll leave that up to you guys. Closing the issue...

JohnPJenkins closed this as completed Aug 13, 2014

JohnPJenkins mentioned this issue Jan 22, 2015

Trivial: print "No free event buffers" warning to stderr instead of stdout #22

Closed

laprej mentioned this issue Jan 23, 2015

Terminate simulation after failed allocations #25

Closed

carns mentioned this issue Aug 25, 2015

possible memory corruption with many LPs on one process #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROSS hangs when out of event memory #6

ROSS hangs when out of event memory #6

JohnPJenkins commented Aug 8, 2014

laprej commented Aug 8, 2014

mmubarak commented Aug 8, 2014

JohnPJenkins commented Aug 8, 2014

carothersc-zz commented Aug 8, 2014

mmubarak commented Aug 8, 2014

JohnPJenkins commented Aug 8, 2014

JohnPJenkins commented Aug 13, 2014

ROSS hangs when out of event memory #6

ROSS hangs when out of event memory #6

Comments

JohnPJenkins commented Aug 8, 2014

laprej commented Aug 8, 2014

mmubarak commented Aug 8, 2014

JohnPJenkins commented Aug 8, 2014

carothersc-zz commented Aug 8, 2014

fax: (518) 276-4033

mmubarak commented Aug 8, 2014

JohnPJenkins commented Aug 8, 2014

JohnPJenkins commented Aug 13, 2014