Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROSS hangs when out of event memory #6

Closed
JohnPJenkins opened this issue Aug 8, 2014 · 7 comments
Closed

ROSS hangs when out of event memory #6

JohnPJenkins opened this issue Aug 8, 2014 · 7 comments

Comments

@JohnPJenkins
Copy link

Hi all,

When out of event memory in optimistic mode, the message "WARNING: No free event buffers. Try increasing memory via the --extramem option" is printed to stdout and what looks like memory recollection via forcing a GVT update is attempted (tw-sched.c, lines 177-185). However, if no memory is able to be recollected, then the program seems to enter an infinite loop of checking for free mem and running the GVT. Any way to detect this behavior and terminate the program?

@laprej
Copy link
Member

laprej commented Aug 8, 2014

Not really. It's basically the Halting Problem. We can't guarantee that an event won't get freed by running tw_gvt_force_update() one more time, but after 10 iterations, it's looking doubtful... But you never know! At least we have the warning now. Back in my day we didn't even have that! :D I suppose instead of printing the warning you could call tw_exit()?

@mmubarak
Copy link
Collaborator

mmubarak commented Aug 8, 2014

Actually you don't always see a warning if the simulation runs out of event memory. Recently, I ran into cases where I was getting some unexpected simulation output without any warning message that its running out of memory. Needless to say that I spent quite some time digging my code to find whats wrong :-) Things got resolved once I increased the event memory.

So I think if there is a way we could ensure that the warning always appears, that would be useful in the debugging process.

@JohnPJenkins
Copy link
Author

I wonder if there's a practical ceiling for no_free_event_buffers before ROSS should give up and tw_error out? 10? 20? 100?

W.r.t. halting, is the condition of all PEs making no progress or stalling between GVTs sufficient to give up / finalize? The two major possibilities here I can think of are: 1) no more events to process (not an error); 2) PE(s) with insufficient memory propagate inability to progress to all other PEs.

@carothersc-zz
Copy link
Member

Thanks John;

So, memory management is a bit of a sticky issue in optimistic parallel
event simulators.
In particular, a model could have a small set of LPs that race ahead and
effectively consume
the available event memory, even all the available system memory.
Additionally, a model
developer could do some bad things like unknowingly b-cast events to a
large group of
LPs - this causes a swell in the pending event population. If left
unchecked, the model
will exhaust all available event memory even if only running in serial mode.

The trick is the model developer needs to have some sense of what there
peak event
memory needs are for the when executing on a single processor. For network
models,
you can get a sense of that based on what you think the average hop counts
are
coupled with arrival rate of new packets.

Then you want to add just enough memory for efficient optimistic execution.
This is typically no
more than a 8K to 16K event memory buffers assuming your batch and GVT
interval values
8 and 512 respectively -- e.g., on avg between successive GVTs each MPI
rank will process
about batch X GVT events -- so at 8 batch and 512 GVT, you'll process about
4K events per
GVT epoch.

Hope that helps,
Chris

On Fri, Aug 8, 2014 at 11:42 AM, John Jenkins notifications@github.com
wrote:

I wonder if there's a practical ceiling for no_free_event_buffers before
ROSS should give up and tw_error out? 10? 20? 100?

W.r.t. halting, is the condition of all PEs making no progress or stalling
between GVTs sufficient to give up / finalize? The two major possibilities
here I can think of are: 1) no more events to process (not an error); 2)
PE(s) with insufficient memory propagate inability to progress to all other
PEs.


Reply to this email directly or view it on GitHub
https://github.com/carothersc/ROSS/issues/6#issuecomment-51618815.


Christopher D. Carothers

Director, Center for Computational Innovations
Professor, Department of Computer Science
Rensselaer Polytechnic Institute
110 8th Street
Troy, New York 12180-3590

e-mail: chrisc@cs.rpi.edu
web page: www.cs.rpi.edu/~chrisc
phone: (518) 276-2930

fax: (518) 276-4033

@mmubarak
Copy link
Collaborator

mmubarak commented Aug 8, 2014

Not directly related but one thing that we can do is to update the event memory allocation formula in codes-base (the codes-mapping API). Currently, its pretty static i.e. it multiplies a constant value with the number of LPs per PEs and allocates event memory accordingly. This works for some of the models while it fails for some others. Ideally, if we could dynamically calculate the event memory based on the model parameters, that could help us resolve some of the memory problems. Not something that has to be done right away but we might want to do it at some point.

@JohnPJenkins
Copy link
Author

Thanks Chris! A lot of useful information in there. Perhaps your response would make for good content in the wiki?

Misbah: that's been a backburner item I haven't quite gotten around to, but is relatively easy to achieve. The only question is whether to have it as a run-time argument (i.e. passed through argv and required in all tw_opts) or as a configuration parameter (processed through the codes-config code path). I'm leaning towards the latter.

As you mention, making the "mem_factor" multiplier configurable makes it easier to play with available memory but does not eliminate the core (undecidable?) problem of hanging on out-of-event-memory conditions.

@JohnPJenkins
Copy link
Author

As discussed in the meeting today, the hanging problem isn't something that can be solved due to time warp semantics and ROSS memory upper bounds. It might not be a bad idea to kill the simulation after a large number of successive failures as Justin suggested, but I'll leave that up to you guys. Closing the issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants