-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add instant submission mode #1128
Conversation
79d15ff
to
eb804df
Compare
With 3056a88, the following code hangs in #include <sycl/sycl.hpp>
using T = char;
int main()
{
for (const auto& dev : sycl::device::get_devices())
{
sycl::queue q{ dev, { sycl::property::queue::in_order() } };
char* arr_d = sycl::malloc_device<T>(123, q);
char* arr_h = sycl::malloc_host<T>(123, q);
std::iota(arr_h, arr_h + 123, 42);
q.submit(sycl::property_list{ sycl::property::command_group::hipSYCL_coarse_grained_events() },
[&](sycl::handler& cgh) { cgh.memcpy(arr_d, arr_h, 123 * sizeof(T)); });
q.submit(sycl::property_list{ sycl::property::command_group::hipSYCL_coarse_grained_events() },
[&](sycl::handler& cgh) { cgh.memcpy(arr_h, arr_d, 123 * sizeof(T)); });
q.wait_and_throw();
sycl::free(arr_d, q);
sycl::free(arr_h, q);
}
} EDIT: fixed the code and the log. |
Thanks, I've already seen this issue in CI. It seems to only happen for some backends. CUDA seems to be fine. |
Seems like it incorrectly generates memcopies between the CPU device for some reason while dispatching the work to GPU, which cannot go well... |
The CPU copies are only there because the test is first executed on the CPU device. If I only run on a GPU, no CPU->CPU copies are executed. |
@al42and This is not what I meant; the issue I observed was that in certain cases a CPU->CPU copy was dispatched to the GPU queue. However, I could since fix this issue. It was limited to OpenCL and caused by a bug in the OpenCL backend's USM pointer query implementation, causing it to not correctly recognizing the device attached to the device pointer (and thus assuming the pointer is a non-USM host pointer). |
@al42and I think I found it, I was being stupid and did not correctly configure the nodes for submission.. This has caused the |
With this PR, for Lulesh with C++ standard parallelism offloading to NVIDIA in the CPU-bound regime, runtime goes down from 86s to 70s. It's now only 13s slower than nvc++ in this case, and most of these 13s can likely be attributed to SSCP JIT and runtime startup. |
Whether instant submission is allowed is now controlled via the In SYCL mode, instant submission is now opt-in, requiring that Once things are more mature and tested, we should consider also enabling it by default in SYCL. |
The preliminary results look good. All tests pass, and, for a small system (24k atoms), in GPU-resident mode with, we get noticeable speed-up: On my workstation, RTX3060: 383 (374) -> 397 ns/day. We usually set |
@al42and Excellent, thanks for the feedback! How do we compare to HIP? |
Looking at a more pathological case (6k atoms), the picture gets weirder. For example, we have three kernels launched almost back-to-back (they are in different modules, so there is some host activity in-between, but not too much). Here's how native HIP looks (the first hipLaunchKernel corresponds to the yellow kernel in the lower right corner); it takes ~13µs to launch all kernels: If we look at the old code, the version of hipSYCL without this patch with set Now, if we look at the instant submission case, things get much worse, 38µs: Not sure what's causing the slowdown, but it also lowers the overall application performance, 1073.321 ns/day with HIP, 781.733 with Instant submission, 833.722 with old submission logic and no caching in this case (running 5 GROMACS threads bound to a 8c/16t CCX, no reserved cores, When the CPU is oversubscribed, though, this patch significantly improves performance (although, that's inadvisable way to run things to begin with). |
Can you check with the following program? With this program, for me instant submission is always substantially faster than the old mechanism (30-50%) either with or without node caching: // Change to enable / disable instant submission
#define HIPSYCL_ALLOW_INSTANT_SUBMISSION 0
#include <sycl/sycl.hpp>
#include <iostream>
#include <chrono>
class profiler_clock {
public:
using rep = uint64_t;
using period = std::nano;
using duration = std::chrono::duration<rep, period>;
using time_point = std::chrono::time_point<profiler_clock>;
constexpr static bool is_steady = true;
static time_point now() {
return time_point{
duration{std::chrono::steady_clock::now().time_since_epoch()}};
}
static std::size_t ns_ticks(const time_point& tp) {
return tp.time_since_epoch().count();
}
static double seconds(const time_point& tp) {
auto ticks = ns_ticks(tp);
return static_cast<double>(ticks) /
1.e9;
}
};
int main() {
sycl::queue q{
{sycl::property::queue::in_order{},
sycl::property::queue::hipSYCL_coarse_grained_events{}}
};
int* data = sycl::malloc_device<int>(1, q);
// Run once to trigger any potential JIT
q.single_task([=](){*data += 1;});
double start = profiler_clock::seconds(profiler_clock::now());
for(int i = 0; i < 100000; ++i)
q.single_task([=](){*data += 1;});
q.wait();
double stop = profiler_clock::seconds(profiler_clock::now());
std::cout << stop-start << " seconds" << std::endl;
sycl::free(data, q);
}
The only case where I can see the old mechanism being faster is if you can benefit from it submitting asynchronously (either because it JITs which is expensive, or because you can overlap with other work on the host). |
Normally, the "new" (instant) behavior is indeed faster: $ ROCR_VISIBLE_DEVICES=4 ./1128_instant_on
0.32079 seconds
$ ROCR_VISIBLE_DEVICES=4 ./1128_instant_off
0.395244 seconds But if we bind all the cores to the CCX corresponding to the GPU (EDIT: or any single CCX), the "old" case gets much faster: $ ROCR_VISIBLE_DEVICES=4 hwloc-bind --cpubind core:1-7 ./1128_instant_on
0.318388 seconds
$ ROCR_VISIBLE_DEVICES=4 hwloc-bind --cpubind core:1-7 ./1128_instant_off
0.27572 seconds The "old" case also sometimes crashes when run with $ HIPSYCL_RT_MAX_CACHED_NODES=0 ROCR_VISIBLE_DEVICES=4 ./1128_instant_off
double free or corruption (fasttop)free(): invalid pointer
Aborted
andreyal@nid002910:~$ HIPSYCL_RT_MAX_CACHED_NODES=0 ROCR_VISIBLE_DEVICES=4 gdb --args ./1128_instant_off
Reading symbols from ./1128_instant_off...
(gdb) r
(gdb) bt
#0 0x00000000006d51c0 in ?? ()
#1 0x00007ffff7fdebc9 in hipsycl::rt::dag_direct_scheduler::submit(std::shared_ptr<hipsycl::rt::dag_node>) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#2 0x00007ffff7fe5725 in hipsycl::rt::dag_unbound_scheduler::submit(std::shared_ptr<hipsycl::rt::dag_node>) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#3 0x00007ffff7fe6f65 in std::_Function_handler<void (), hipsycl::rt::dag_manager::flush_async()::$_0>::_M_invoke(std::_Any_data const&) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#4 0x00007ffff7fecbd6 in hipsycl::rt::worker_thread::work() () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#5 0x00007ffff62773b4 in ?? () from /usr/lib64/libstdc++.so.6
#6 0x00007ffff54266ea in start_thread () from /lib64/libpthread.so.0
#7 0x00007ffff595aa8f in clone () from /lib64/libc.so.6 |
Ok... I cannot reproduce this behavior, even with I also cannot reproduce the crash so far. Obviously, if something UB is going on there, then it might be difficult to compare performance. Is ROCm LLVM involved here? If so I'd suggest to try stock LLVM. Wouldn't be the first time ROCm LLVM miscompiles code. |
AMD EPYC 7A53
Yes... Also, it's ROCm 5.3.3, which has some runtime latency issues that the newer versions do not.
I can try Cray LLVM, but last time I checked (~1 year ago), it had troubles compiling hipSYCL, so I wouldn't hold it as a gold standard. I can take a stab at building stock LLVM from scratch in the next days. |
Interesting, I can't find information on that one; is that a special LUMI CPU or did you mean 7453?
Is there perhaps any container solution available (e.g. singularity or charliecloud)? That would help with reproducability. |
…ke clear why they are thread-safe
A quick-and-dirty annotation with rocTX was not particularly enlightening, mostly due to 1µs resolution and, apparently, the fact that the overhead is spread between multiple locations. But perhaps @illuhad can gain some insights? "Submit GPU Work" markers are right around the |
Interesting... The gap between the end of There's also memory allocation before the launch (to construct the DAG node). That might play a role in what is going on prior to submission inside The queue will also copy the queue configuration into the handler (which is an Apart from that there's really not much that is happening. For most cases this might not play a large role, but if the comparison is direct calls to |
One "tick" along the X axis is 1µs = profiler resolution, so we should be careful when interpreting such single events. But there's clearly a lot happening before the HIP API is called. Here's the full trace:
It will also be an issue if we have CPU work we want to overlap with the GPU. When we have a lot of cores, having a separate launcher thread would be a good solution, perhaps even better than native HIP. But for cases like LUMI, where we have 8c/8t per GPU, with one core reserved by the scheduler (presumably, for the HSA thread), the CPU compute tasks only have 7 cores, and yielding one of them to the worker thread would be not great; similarly, delaying the compute tasks by having an inefficient same-thread submit is bad, and delaying the communication can be even worse. So, I agree that most application might not notice a difference. But it's not limited to 1µs-long kernels either. |
Agreed. It sounds like we may need two different modes:
Depending on the use case, one or the other might be chosen by the client application and/or the runtime. |
That happens even without this PR, so not a blocker here. I'll submit another issues once I get a better idea what happens there. Valgrind points at a long stack of recursive |
Thanks for the confirmation, I had expected this. This PR does not change the non-instant submission code paths at all, so I would have been very surprised if this had changed things. |
The interesting questions is what is the CPU / GPU latency tradeoff with both schemes fully optimized, e.g. by how much are GPU task launches delayed by async instant submission, whether/when is this outweighed by the benefit of overlapping some GPU API overhead, whether this has an impact on the critical path of an application. I think making an automated choice will be very hard without knowing what the application does, but there may be cases easy to identify (e.g. pathological cases of high GPU API overhead). |
I agree. I guess for simple cases we can have some autotuning code path where we make some measurements of the latency to offload to another thread versus just submitting synchronously, but there should also be some mechanism to let expert users enforce a particular behavior. |
I haven't analyzed it in depth, but what if, instead of analyzing all the dependencies of a task, we track the possibility of the submission via a flag on the queue? Task can be instantly submitted iff:
Conditions on a queue being "instant":
This way, we spare ourselves the need to walk through all the dependencies before submission, hopefully cutting away some overhead. |
I suspect that would not help much. For an in-order queue, the number of dependencies is typical 0 to 1 such that there won't be much difference to what you suggest. I really don't think that walking the dependencies itself is what is costing here. |
Perfetto is to blame; the JSONs generated by roctrace have proper nanosecond resolution (note: profiler overhead is unknown): 8390_instant_on.json.gz Below is not a proper analysis, but an average of three randomly-selected submission events. "Delay at start" is the delay between the start of this function and the start of the function below, in nanoseconds. "Delay at end" is the delay between the end of the function below and the start of this one. For "submit_kernel", the "function below" is the trio of HIP API calls. Times are measured by manually placed markers (see code linked above). A lot of variability in total submission times come from
So the conclusion still holds, the overhead is annoyingly spread out between multiple places :( |
Some things I'd like to try:
I'm officially on vacation until Oct 4, so my ability to do this at the moment is limited. |
Perhaps using PMR with (un)synchronized_pool_resource would be easier? It should still cut down the overhead of memory allocations significantly without a need to change code much besides just passing the allocator around. |
Now depends on #1178. On my system with gfx906, the instant submission test program above is now ~10% faster (it was however always substantially faster than non-instant submission for me so YMMV). One thing I noticed is that if Gromacs sets the |
Thanks for the suggestion, we'll look into that. BTW, do you know any reason there is no |
Intel is working on adding support for compile-time properties to |
Given that this is opt-in for now, I think we should merge given that for some use cases it will provide a performance-boosting option. |
This adds a new task submission model: Instant submission. In instant submission, the entire existing scheduling and DAG construction logic is bypassed, and the node is instead immediately submitted to an
inorder_executor
.This also allows us to circumvent the existing garbage collection and submission worker thread usage, which can make the behavior more predictable in scenarios with high CPU utilization due to user code.
The hope is that instant submission can reduce latencies in strong-scaling scenarios.
Instant submission is automatically activated when appropriate conditions are met:
backend_executor
, which happens e.g. when using a SYCL in-order queue.Note that stdpar code always satisfies the conditions for instant submission.
CC @al42and @pszi1ard