Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

legion: assertion failure in debug mode #1123

Closed
rohany opened this issue Jul 22, 2021 · 25 comments
Closed

legion: assertion failure in debug mode #1123

rohany opened this issue Jul 22, 2021 · 25 comments

Comments

@rohany
Copy link
Contributor

rohany commented Jul 22, 2021

When running on a single node with debug mode, I see this assertion failure:

mttkrp-cuda: /home/rohany/taco-ctrl-rep-bug/legion/legion/runtime/legion/runtime.cc:2548: Legion::Internal::FutureInstance::FutureInstance(const void*, size_t, Legion::Memory, Legion::Internal::ApEvent, Legion::Internal::Runtime*, bool, bool, bool, Legion::Internal::PhysicalInstance, void (*)(void*, size_t), Legion::Processor, Legion::Internal::RtEvent): Assertion `size > 0' failed.
Signal 6 received by node 0, process 3943112 (thread 7fd73403a000) - obtaining backtrace
Signal 6 received by process 3943112 (thread 7fd73403a000) at: stack trace: 21 frames
  [0] = /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fd747cd63c0]
  [1] = /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7fd745f6418b]
  [2] = /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x7fd745f43859]
  [3] = /lib/x86_64-linux-gnu/libc.so.6(+0x25729) [0x7fd745f43729]
  [4] = /lib/x86_64-linux-gnu/libc.so.6(+0x36f36) [0x7fd745f54f36]
  [5] = bin/mttkrp-cuda(Legion::Internal::FutureInstance::FutureInstance(void const*, unsigned long, Realm::Memory, Legion::Internal::ApEvent, Legion::Internal::Runtime*, bool, bool, bool, Realm::RegionInstance, void (*)(void*, unsigned long), Realm::Processor, Legion::Internal::RtEvent)+0x148) [0x56168c98ca90]
  [6] = bin/mttkrp-cuda(Legion::Internal::MemoryManager::create_future_instance(Legion::Internal::Operation*, unsigned long long, Legion::Internal::ApEvent, unsigned long, bool)+0x5a7) [0x56168c9b0b43]
  [7] = bin/mttkrp-cuda(Legion::Internal::FutureImpl::request_application_instance(Realm::Memory, Legion::Internal::SingleTask*, unsigned long long, unsigned int, Legion::Internal::ApUserEvent, unsigned long)+0x579) [0x56168c98677f]
  [8] = bin/mttkrp-cuda(Legion::Internal::SingleTask::finalize_map_task_output(Legion::Mapping::Mapper::MapTaskInput&, Legion::Mapping::Mapper::MapTaskOutput&, Legion::Internal::MustEpochOp*, std::vector<Legion::Internal::InstanceSet, std::allocator<Legion::Internal::InstanceSet> >&)+0x949) [0x56168c77e979]
  [9] = bin/mttkrp-cuda(Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*)+0x184) [0x56168c7831e8]
  [10] = bin/mttkrp-cuda(Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*)+0x367) [0x56168c784d5f]
  [11] = bin/mttkrp-cuda(Legion::Internal::IndividualTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*)+0x3f) [0x56168c790431]
  [12] = bin/mttkrp-cuda(Legion::Internal::SingleTask::trigger_mapping()+0x2d5) [0x56168c77d2ff]
  [13] = bin/mttkrp-cuda(Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x45a) [0x56168ca020e0]
  [14] = bin/mttkrp-cuda(+0x217ce4a) [0x56168d4a7e4a]
  [15] = bin/mttkrp-cuda(+0x21e3ae1) [0x56168d50eae1]
  [16] = bin/mttkrp-cuda(+0x21e8c64) [0x56168d513c64]
  [17] = bin/mttkrp-cuda(+0x21e6bf5) [0x56168d511bf5]
  [18] = bin/mttkrp-cuda(+0x21ee98e) [0x56168d51998e]
  [19] = bin/mttkrp-cuda(+0x21fcc6a) [0x56168d527c6a]
  [20] = /lib/x86_64-linux-gnu/libc.so.6(+0x5e660) [0x7fd745f7c660]

To reproduce the error, go to /home/rohany/taco-ctrl-rep-bug/build/ and run ./runner.sh. It the binary is compiled against a debug build of legion.

In my code, there is a loop like this:

Future f;
for (...) {
  TaskLauncher launcher(...);
  if (f.valid()) {
    launcher.add_future(f);
  }
  f = runtime->execute_task(ctx, launcher);
}

when I remove this tracking of the future (i.e. the loop just launches the tasks), the code succeeds. However, I don't think that this is incorrect, as I've done the same in a testing application with no errors, so I think that there is some interaction with this and other parts of the system that I don't understand.

@lightsighter
Copy link
Contributor

Pull and try again

streichler pushed a commit that referenced this issue Jul 23, 2021
@rohany
Copy link
Contributor Author

rohany commented Jul 23, 2021

Now the program hangs (on a single node). It completes when I remove the future tracking loop. Adding -ll:defalloc 0 (i tried it anyway) doesn't seem to change anything. It is in the same location on sapling.

@lightsighter
Copy link
Contributor

Try again

@rohany
Copy link
Contributor Author

rohany commented Jul 25, 2021

This has fixed the hang on a single node, but my multi-node runs are still hanging. Here is a command line to reproduce the hang:

mpirun -H g0001:4,g0002:4,g0003:4,g0004:4 --bind-to none -npernode 4 ./runner.sh bin/mttkrp-cuda -n 512 -gx 4 -gy 4 -gz 4 -lg:eager_alloc_percentage 50 -ll:ocpu 1 -ll:othr 2 -ll:csize 20000 -ll:util 1 -dm:replicate 1 -ll:gpu 1 -ll:fsize 15000 -gex:bindcuda 0 -gex:batch 0 -tm:validate_cpu -ll:defalloc 0

within the /home/rohany/taco/build/ directory.

@lightsighter
Copy link
Contributor

Still waiting for sapling to have all the GPU nodes available...

@rohany
Copy link
Contributor Author

rohany commented Jul 27, 2021

It seems to hang on 3 nodes as well with this command line:

mpirun -H g0002:4,g0003:4,g0004:4 --bind-to none -npernode 4 ./runner.sh bin/mttkrp-cuda -n 512 -gx 4 -gy 4 -gz 4 -lg:eager_alloc_percentage 50 -ll:ocpu 1 -ll:othr 2 -ll:csize 20000 -ll:util 1 -dm:replicate 1 -ll:gpu 1 -ll:fsize 15000 -gex:bindcuda 0 -gex:batch 0 -tm:validate_cpu -ll:defalloc 0

@lightsighter
Copy link
Contributor

Which GASNet conduit are you using? GASNet is dropping inter-process communication when the processes are on the same node, which has been observed before with MPI on sapling. There's something wrong with sapling's configuration of shared-memory inter process communication: #605 (comment) Run with one process per node and see if it still reproduces.

@lightsighter
Copy link
Contributor

Nevermind, you can ignore the previous comment, I see what is going wrong.

@rohany
Copy link
Contributor Author

rohany commented Aug 6, 2021

Hey mike, did you get a chance to fix this?

@lightsighter
Copy link
Contributor

I started work on the fix tonight. It's unclear when I'll be able to get it done.

@rohany
Copy link
Contributor Author

rohany commented Aug 16, 2021

I was looking into this a little bit today, and I think that there has been a regression / bug somewhere in the some of the changes you made. I had done this future-tracking loop described above so that the tasks being launched in a loop would be serialized (as they allocate a lot of memory, so they need to be serialized to avoid ooms). However, on 8989cac, adding the future to the task launcher does not appear to serialize the tasks anymore -- a legion prof with spy shows that legion doesn't think that there are any dependencies between the individual launches of task_5 (http://sapling.stanford.edu/~rohany/lassen-mttkrp/mttkrp-space-4/?start=7311444.180602007&end=7964719.882849198&collapseAll=false&resolution=10). When I go back to 60091fd, the tasks are serialized again, as seen in this profile http://sapling.stanford.edu/~rohany/lassen-mttkrp/mttkrp-space-7/?start=2646595.919384058&end=4385787.523550726&collapseAll=false&resolution=10.

I'm not able to effectively bisect this, because your changes fixed different hangs in this code. However, some commits that could cause this are f9ddd84 and 8989cac.

I'm not sure what's the best way to give you a repro for this (if you want one). It shows up either via manual inspection of profiles, or OOMs at large problem sizes. I can prepare one of these as a repro for you.

@lightsighter
Copy link
Contributor

That's not a regression, it was actually a bug fix and the runtime is working by design now. Adding data dependences in the machine-independent application code shouldn't be able to effect or slow down the process of mapping tasks or other operations; mappers should always be able to continue mapping into the future as long as possible in order to hide latencies. If you want to rate limit how far you are mapping into the future to manage memory consumption then the way to do that is with the select_tasks_to_map mapper call. If you feel like that is inadequate for your purposes, then you might talk to @manopapad and @streichler who have been working on some ideas for extensions to the mapper interface to make it easier to do that. Adding data dependences though in the application is not the right way to do it because it's not machine-indepdent (you might want to slow things down on one machine, but not on another one with more memory), and that goes against Legion's philosophy and therefore the runtime doesn't even allow you to get away with it.

@rohany
Copy link
Contributor Author

rohany commented Aug 17, 2021

That's not a regression, it was actually a bug fix and the runtime is working by design now. Adding data dependences in the machine-independent application code shouldn't be able to effect or slow down the process of mapping tasks or other operations; mappers should always be able to continue mapping into the future as long as possible in order to hide latencies.

Data dependencies should affect when tasks start to execute though! If i add the result of a task t1 a future dependence to task t2, how can t2 start executing before t1 is done? Does t2 need to explicitly wait on the future (that seems less efficient than just starting when all the futures have been triggered)?

The problem I'm having is not that the regions being mapped are too large etc -- I want the mapper to run ahead and map all of these tasks. The problem is that each task allocates a large DeferredBuffer that each one reuses the space for instead of all running concurrently and allocating their own. Having a data dependence to stop all of them from executing at the same was a way of doing this.

If you want to rate limit how far you are mapping into the future to manage memory consumption then the way to do that is with the select_tasks_to_map mapper call. If you feel like that is inadequate for your purposes, then you might talk to @manopapad and @streichler who have been working on some ideas for extensions to the mapper interface to make it easier to do that.

I fell back to this data dependence strategy because I've tried using the mapper for this in the past and found it quite difficult. You have to manage queues for the tasks in question within the mapper, and then trigger those tasks to start executing when prior tasks finish -- all of which are things the runtime is already doing. It felt very error prone and potentially inefficient to try and reimplement things that Legion/Realm already know how to do.

@lightsighter
Copy link
Contributor

Data dependencies should affect when tasks start to execute though! If i add the result of a task t1 a future dependence to task t2, how can t2 start executing before t1 is done?

If you think there is a bug in future dependence tracking then you should be able to make a trivial reproducer. That code is dirt simple and has been working for years at this point.

It felt very error prone and potentially inefficient to try and reimplement things that Legion/Realm already know how to do.

I strongly disagree with this point. It's not in any way obvious what the right rate to map tasks is because it should be tightly coupled to the machine on which the program is running, specifically the speeds of different processors on that machine and the amount of memory and other resources available to those processors. Realm and Legion only provide mechanisms for carrying out execution and have not the first clue what the right rate is to map tasks. If you think there is a universal algorithm that Legion and Realm could implement that would do exactly the right thing for mapping and scheduling tasks for every possible application on every possible machine ever devised both now and in the future then write it down so we can implement it.

@rohany
Copy link
Contributor Author

rohany commented Aug 17, 2021

If you think there is a bug in future dependence tracking then you should be able to make a trivial reproducer. That code is dirt simple and has been working for years at this point.

To do this, I need to know what to expect. If I give a task t1 a future f via launcher.add_future(f), should t1 not begin execution until f is ready? Or does t1 need to just call f.wait() if it wants to have a dependence on f? If it's the latter then I was just using the system wrong. If it's the former then I can try to make a reproducer for you. After looking around the Legion repository I suspect it's the latter.

I strongly disagree with this point. It's not in any way obvious what the right rate to map tasks is because it should be tightly coupled to the machine on which the program is running, specifically the speeds of different processors on that machine and the amount of memory and other resources available to those processors. Realm and Legion only provide mechanisms for carrying out execution and have not the first clue what the right rate is to map tasks. If you think there is a universal algorithm that Legion and Realm could implement that would do exactly the right thing for mapping and scheduling tasks for every possible application on every possible machine ever devised both now and in the future then write it down so we can implement it.

Let me clarify. I agree that Realm/Legion don't know how the best way map / schedule things for you. What I meant by "things that Legion/Realm already know how to do" was the implementations of all the concurrency and dependence tracking that go around scheduling these tasks so that they execute one after another (as well as the side point that they implement these things efficiently!). Having my mapper (at least the with current interface) try to do this involves making queues for these tasks to wait in in the mapper, communicating through reportProfiling when they finish, and then rescheduling the queued tasks as soon as one finishes. This was pretty difficult and felt like a breeding ground for errors -- not to mention the brittleness of it. Getting it right in the presence of multiple mappers (one per processor) and a larger application feels challenging. If there's an easier way to do this, then let me know! I agree that the application code is not the right place for this synchronization to occur, but it feels like the simplest and most bulletproof way of getting the behavior I want.

@lightsighter
Copy link
Contributor

To do this, I need to know what to expect. If I give a task t1 a future f via launcher.add_future(f), should t1 not begin execution until f is ready? Or does t1 need to just call f.wait() if it wants to have a dependence on f?

Depends. If the chosen task variant is marked as in inner task then then it is the later, but if the chosen task variant is not inner then it is the former.

If it's the former then I can try to make a reproducer for you.

You should definitely try to do this. I very much doubt that there is a bug with this code. If there was, all sorts of tests in the CI would be failing.

was the implementations of all the concurrency and dependence tracking that go around scheduling these tasks so that they execute one after another

Which they do, no exceptions. There's no way you can get incorrect results from your program by anything you do in the mapper.

Having my mapper (at least the with current interface) try to do this involves making queues for these tasks to wait in in the mapper, communicating through reportProfiling when they finish, and then rescheduling the queued tasks as soon as one finishes

You should only be needing to do this if you're trying to rate limit resource utilization, never for program correctness. Resource utilization is very machine- and schedule-dependent and therefore is definitely within the purview of the mapper's responsibilities.

If there's an easier way to do this, then let me know

Talk to @manopapad and @streichler

@rohany
Copy link
Contributor Author

rohany commented Aug 17, 2021

You should definitely try to do this. I very much doubt that there is a bug with this code. If there was, all sorts of tests in the CI would be failing.

Will do. I'll post an issue if I get something working.

The correctness I'm referring to here is lower level than the ordering of tasks etc -- mainly deadlocks / races / memory errors within the implementation of this scheduling in the mapper itself.

@lightsighter
Copy link
Contributor

The correctness I'm referring to here is lower level than the ordering of tasks etc -- mainly deadlocks / races / memory errors within the implementation of this scheduling in the mapper itself.

You buy into that with your choice of the mapper synchronization model. If you want atomicity from the runtime on mapper calls then you can get that and also choose whether preemption is allowed or not. Once you make that choice, then it's up to you to implement the mapper in a consistent way.

@rohany
Copy link
Contributor Author

rohany commented Sep 3, 2021

Hey mike, sorry to bug you, but I wanted to know what your timeline is on this bug -- I'm hoping to run the experiment that is blocked by this soon.

@lightsighter
Copy link
Contributor

If you're referring to the deadlock that was occurring, then you can go ahead and try again.

@rohany
Copy link
Contributor Author

rohany commented Sep 6, 2021

The deadlock looks like it's gone (on 8 nodes at least). However, I get a segfault when I go up to 16 nodes. Here's a backtrace:

[3] Thread 9 (Thread 0x20002ecdf8b0 (LWP 108089)):
[3] #0  0x00002000065de4e8 in waitpid () at ../sysdeps/unix/syscall-template.S:81
[3] #1  0x000020000654fd3c in do_system (line=0x13d559d0 <cmd.8118> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_2hPDEW '/g/g15/yadav2/taco/build/bin/mttkrp-cuda' 107851") at ../sysdeps/posix/system.c:148
[3] #2  0x0000200006550424 in __libc_system (line=0x13d559d0 <cmd.8118> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_2hPDEW '/g/g15/yadav2/taco/build/bin/mttkrp-cuda' 107851") at ../sysdeps/posix/system.c:189
[3] #3  0x0000200000105348 in system (line=<optimized out>) at pt-system.c:28
[3] #4  0x000000001303ca40 in gasneti_system_redirected ()
[3] #5  0x000000001303d218 in gasneti_bt_gdb ()
[3] #6  0x00000000130433f8 in gasneti_print_backtrace ()
[3] #7  0x0000000011112fb4 in gasneti_defaultSignalHandler ()
[3] #8  <signal handler called>
[3] #9  0x00002034d00006f8 in ?? ()
[3] #10 0x0000000011f2ba50 in Legion::Internal::EquivalenceSet::invalidate_trackers (this=0x2034d03b2550, mask=..., applied_events=std::set with 0 elements, origin_space=15, invalidate_mapping=0x0, filter_context=0x2034d09ab9c0) at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_analysis.cc:16447
[3] #11 0x0000000011f2c398 in Legion::Internal::EquivalenceSet::handle_invalidate_trackers (derez=..., runtime=0x16b25f80) at /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_analysis.cc:16516
[3] #12 0x000000001181e1cc in Legion::Internal::Runtime::handle_equivalence_set_invalidate_trackers (this=0x16b25f80, derez=...) at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:24510
[3] #13 0x00000000117e1fc4 in Legion::Internal::VirtualChannel::handle_messages (this=0x2034d01ec3d0, num_messages=1, runtime=0x16b25f80, remote_address_space=15, args=0x2034b0198510 "", arglen=116) at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:12631
[3] #14 0x00000000117e04f0 in Legion::Internal::VirtualChannel::process_message (this=0x2034d01ec3d0, args=0x2034b01984f4, arglen=136, runtime=0x16b25f80, remote_address_space=15) at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:11703
[3] #15 0x00000000117e2f58 in Legion::Internal::MessageManager::receive_message (this=0x2034d0173db0, args=0x2034b01984f0, arglen=144) at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:13133
[3] #16 0x0000000011822da0 in Legion::Internal::Runtime::process_message_task (this=0x16b25f80, args=0x2034b01984ec, arglen=148) at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:25578
[3] #17 0x000000001183dc44 in Legion::Internal::Runtime::legion_runtime_task (args=0x2034b01984e0, arglen=152, userdata=0x194dd7a0, userlen=8, p=...) at /g/g15/yadav2/taco/legion/legion/runtime/legion/runtime.cc:31003
[3] #18 0x00000000126d67e0 in Realm::LocalTaskProcessor::execute_task (this=0x16c37370, func_id=4, task_args=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/proc_impl.cc:1131
[3] #19 0x0000000012766bac in Realm::Task::execute_on_processor (this=0x2034b01987d0, p=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:306
[3] #20 0x000000001276cf9c in Realm::UserThreadTaskScheduler::execute_task (this=0x19497750, task=0x2034b01987d0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1646
[3] #21 0x000000001276a7dc in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x19497750) at /g/g15/yadav2/taco/legion/legion/runtime/realm/tasks.cc:1127
[3] #22 0x00000000127761cc in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x19497750) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.inl:97
[3] #23 0x000000001278b394 in Realm::UserThread::uthread_entry () at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.cc:1337
[3] #24 0x000020000655388c in makecontext () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/makecontext.S:137
[3] #25 0x0000000000000000 in ?? ()

@lightsighter
Copy link
Contributor

Can you make a frozen process on sapling?

@rohany
Copy link
Contributor Author

rohany commented Sep 6, 2021

I'm not having any luck reproducing the crash on Sapling. I can attach to the crashed process with GDB on Lassen and print out things that you are interested in seeing though. Let me know if that works / what you want to see.

@lightsighter
Copy link
Contributor

Pull and try again.

@rohany
Copy link
Contributor Author

rohany commented Sep 7, 2021

It looks like its working for me now, thanks! I'll open a separate issue if anything else comes up.

@rohany rohany closed this as completed Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants