-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
legion: assertion failure in debug mode #1123
Comments
Pull and try again |
Now the program hangs (on a single node). It completes when I remove the future tracking loop. Adding |
Try again |
This has fixed the hang on a single node, but my multi-node runs are still hanging. Here is a command line to reproduce the hang:
within the |
Still waiting for sapling to have all the GPU nodes available... |
It seems to hang on 3 nodes as well with this command line:
|
Which GASNet conduit are you using? GASNet is dropping inter-process communication when the processes are on the same node, which has been observed before with MPI on sapling. There's something wrong with sapling's configuration of shared-memory inter process communication: #605 (comment) Run with one process per node and see if it still reproduces. |
Nevermind, you can ignore the previous comment, I see what is going wrong. |
Hey mike, did you get a chance to fix this? |
I started work on the fix tonight. It's unclear when I'll be able to get it done. |
I was looking into this a little bit today, and I think that there has been a regression / bug somewhere in the some of the changes you made. I had done this future-tracking loop described above so that the tasks being launched in a loop would be serialized (as they allocate a lot of memory, so they need to be serialized to avoid ooms). However, on 8989cac, adding the future to the task launcher does not appear to serialize the tasks anymore -- a legion prof with spy shows that legion doesn't think that there are any dependencies between the individual launches of task_5 (http://sapling.stanford.edu/~rohany/lassen-mttkrp/mttkrp-space-4/?start=7311444.180602007&end=7964719.882849198&collapseAll=false&resolution=10). When I go back to 60091fd, the tasks are serialized again, as seen in this profile http://sapling.stanford.edu/~rohany/lassen-mttkrp/mttkrp-space-7/?start=2646595.919384058&end=4385787.523550726&collapseAll=false&resolution=10. I'm not able to effectively bisect this, because your changes fixed different hangs in this code. However, some commits that could cause this are f9ddd84 and 8989cac. I'm not sure what's the best way to give you a repro for this (if you want one). It shows up either via manual inspection of profiles, or OOMs at large problem sizes. I can prepare one of these as a repro for you. |
That's not a regression, it was actually a bug fix and the runtime is working by design now. Adding data dependences in the machine-independent application code shouldn't be able to effect or slow down the process of mapping tasks or other operations; mappers should always be able to continue mapping into the future as long as possible in order to hide latencies. If you want to rate limit how far you are mapping into the future to manage memory consumption then the way to do that is with the |
Data dependencies should affect when tasks start to execute though! If i add the result of a task t1 a future dependence to task t2, how can t2 start executing before t1 is done? Does t2 need to explicitly wait on the future (that seems less efficient than just starting when all the futures have been triggered)? The problem I'm having is not that the regions being mapped are too large etc -- I want the mapper to run ahead and map all of these tasks. The problem is that each task allocates a large
I fell back to this data dependence strategy because I've tried using the mapper for this in the past and found it quite difficult. You have to manage queues for the tasks in question within the mapper, and then trigger those tasks to start executing when prior tasks finish -- all of which are things the runtime is already doing. It felt very error prone and potentially inefficient to try and reimplement things that Legion/Realm already know how to do. |
If you think there is a bug in future dependence tracking then you should be able to make a trivial reproducer. That code is dirt simple and has been working for years at this point.
I strongly disagree with this point. It's not in any way obvious what the right rate to map tasks is because it should be tightly coupled to the machine on which the program is running, specifically the speeds of different processors on that machine and the amount of memory and other resources available to those processors. Realm and Legion only provide mechanisms for carrying out execution and have not the first clue what the right rate is to map tasks. If you think there is a universal algorithm that Legion and Realm could implement that would do exactly the right thing for mapping and scheduling tasks for every possible application on every possible machine ever devised both now and in the future then write it down so we can implement it. |
To do this, I need to know what to expect. If I give a task
Let me clarify. I agree that Realm/Legion don't know how the best way map / schedule things for you. What I meant by "things that Legion/Realm already know how to do" was the implementations of all the concurrency and dependence tracking that go around scheduling these tasks so that they execute one after another (as well as the side point that they implement these things efficiently!). Having my mapper (at least the with current interface) try to do this involves making queues for these tasks to wait in in the mapper, communicating through |
Depends. If the chosen task variant is marked as in
You should definitely try to do this. I very much doubt that there is a bug with this code. If there was, all sorts of tests in the CI would be failing.
Which they do, no exceptions. There's no way you can get incorrect results from your program by anything you do in the mapper.
You should only be needing to do this if you're trying to rate limit resource utilization, never for program correctness. Resource utilization is very machine- and schedule-dependent and therefore is definitely within the purview of the mapper's responsibilities.
Talk to @manopapad and @streichler |
Will do. I'll post an issue if I get something working. The correctness I'm referring to here is lower level than the ordering of tasks etc -- mainly deadlocks / races / memory errors within the implementation of this scheduling in the mapper itself. |
You buy into that with your choice of the mapper synchronization model. If you want atomicity from the runtime on mapper calls then you can get that and also choose whether preemption is allowed or not. Once you make that choice, then it's up to you to implement the mapper in a consistent way. |
Hey mike, sorry to bug you, but I wanted to know what your timeline is on this bug -- I'm hoping to run the experiment that is blocked by this soon. |
If you're referring to the deadlock that was occurring, then you can go ahead and try again. |
The deadlock looks like it's gone (on 8 nodes at least). However, I get a segfault when I go up to 16 nodes. Here's a backtrace:
|
Can you make a frozen process on sapling? |
I'm not having any luck reproducing the crash on Sapling. I can attach to the crashed process with GDB on Lassen and print out things that you are interested in seeing though. Let me know if that works / what you want to see. |
Pull and try again. |
It looks like its working for me now, thanks! I'll open a separate issue if anything else comes up. |
When running on a single node with debug mode, I see this assertion failure:
To reproduce the error, go to
/home/rohany/taco-ctrl-rep-bug/build/
and run./runner.sh
. It the binary is compiled against a debug build of legion.In my code, there is a loop like this:
when I remove this tracking of the future (i.e. the loop just launches the tasks), the code succeeds. However, I don't think that this is incorrect, as I've done the same in a testing application with no errors, so I think that there is some interaction with this and other parts of the system that I don't understand.
The text was updated successfully, but these errors were encountered: