-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROSS hangs when out of event memory #6
Comments
Not really. It's basically the Halting Problem. We can't guarantee that an event won't get freed by running tw_gvt_force_update() one more time, but after 10 iterations, it's looking doubtful... But you never know! At least we have the warning now. Back in my day we didn't even have that! :D I suppose instead of printing the warning you could call tw_exit()? |
Actually you don't always see a warning if the simulation runs out of event memory. Recently, I ran into cases where I was getting some unexpected simulation output without any warning message that its running out of memory. Needless to say that I spent quite some time digging my code to find whats wrong :-) Things got resolved once I increased the event memory. So I think if there is a way we could ensure that the warning always appears, that would be useful in the debugging process. |
I wonder if there's a practical ceiling for no_free_event_buffers before ROSS should give up and tw_error out? 10? 20? 100? W.r.t. halting, is the condition of all PEs making no progress or stalling between GVTs sufficient to give up / finalize? The two major possibilities here I can think of are: 1) no more events to process (not an error); 2) PE(s) with insufficient memory propagate inability to progress to all other PEs. |
Thanks John; So, memory management is a bit of a sticky issue in optimistic parallel The trick is the model developer needs to have some sense of what there Then you want to add just enough memory for efficient optimistic execution. Hope that helps, On Fri, Aug 8, 2014 at 11:42 AM, John Jenkins notifications@github.com
Christopher D. Carothers Director, Center for Computational Innovations e-mail: chrisc@cs.rpi.edu fax: (518) 276-4033 |
Not directly related but one thing that we can do is to update the event memory allocation formula in codes-base (the codes-mapping API). Currently, its pretty static i.e. it multiplies a constant value with the number of LPs per PEs and allocates event memory accordingly. This works for some of the models while it fails for some others. Ideally, if we could dynamically calculate the event memory based on the model parameters, that could help us resolve some of the memory problems. Not something that has to be done right away but we might want to do it at some point. |
Thanks Chris! A lot of useful information in there. Perhaps your response would make for good content in the wiki? Misbah: that's been a backburner item I haven't quite gotten around to, but is relatively easy to achieve. The only question is whether to have it as a run-time argument (i.e. passed through argv and required in all tw_opts) or as a configuration parameter (processed through the codes-config code path). I'm leaning towards the latter. As you mention, making the "mem_factor" multiplier configurable makes it easier to play with available memory but does not eliminate the core (undecidable?) problem of hanging on out-of-event-memory conditions. |
As discussed in the meeting today, the hanging problem isn't something that can be solved due to time warp semantics and ROSS memory upper bounds. It might not be a bad idea to kill the simulation after a large number of successive failures as Justin suggested, but I'll leave that up to you guys. Closing the issue... |
Hi all,
When out of event memory in optimistic mode, the message "WARNING: No free event buffers. Try increasing memory via the --extramem option" is printed to stdout and what looks like memory recollection via forcing a GVT update is attempted (tw-sched.c, lines 177-185). However, if no memory is able to be recollected, then the program seems to enter an infinite loop of checking for free mem and running the GVT. Any way to detect this behavior and terminate the program?
The text was updated successfully, but these errors were encountered: