Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instant submission mode #1128

Merged
merged 22 commits into from
Oct 18, 2023
Merged

Add instant submission mode #1128

merged 22 commits into from
Oct 18, 2023

Conversation

illuhad
Copy link
Collaborator

@illuhad illuhad commented Sep 8, 2023

This adds a new task submission model: Instant submission. In instant submission, the entire existing scheduling and DAG construction logic is bypassed, and the node is instead immediately submitted to an inorder_executor.
This also allows us to circumvent the existing garbage collection and submission worker thread usage, which can make the behavior more predictable in scenarios with high CPU utilization due to user code.

The hope is that instant submission can reduce latencies in strong-scaling scenarios.

Instant submission is automatically activated when appropriate conditions are met:

  • task uses no buffers (USM-only)
  • task uses no SYCL 2020 reductions
  • task has no running dependencies on non-instant tasks
  • task is bound to a device
  • task is hard-wired to an in-order backend_executor, which happens e.g. when using a SYCL in-order queue.

Note that stdpar code always satisfies the conditions for instant submission.

  • The current implementation is NOT THREAD-SAFE, as in instant mode, multiple user threads can in principle call directly into runtime internals, which were developed under the assumption that only the runtime worker thread would be using them. Before merging, we need to add all the mutexes.
  • coarse-grained events / queue completion events are not yet thread-safe when decaying to fine-grained events.
  • We should also consider whether we want to enable this automatically by default as it is now, or whether it should - at least initially - be opt-in. In that case the user would need to set a macro to enable this optimization, if tasks are eligible. Introducing an entire new submission model makes it likely that some issues might pop up, so for stability reasons, it might be a good idea to not immediately expose this to all the code.

CC @al42and @pszi1ard

@al42and
Copy link
Contributor

al42and commented Sep 8, 2023

With 3056a88, the following code hangs in hipStreamSynchronize(); log attached below; ROCm 5.3.3.

#include <sycl/sycl.hpp>

using T = char;

int main()
{
    for (const auto& dev : sycl::device::get_devices())
    {
        sycl::queue q{ dev, { sycl::property::queue::in_order() } };
        char*       arr_d = sycl::malloc_device<T>(123, q);
        char*       arr_h = sycl::malloc_host<T>(123, q);

        std::iota(arr_h, arr_h + 123, 42);

        q.submit(sycl::property_list{ sycl::property::command_group::hipSYCL_coarse_grained_events() },
                 [&](sycl::handler& cgh) { cgh.memcpy(arr_d, arr_h, 123 * sizeof(T)); });

        q.submit(sycl::property_list{ sycl::property::command_group::hipSYCL_coarse_grained_events() },
                 [&](sycl::handler& cgh) { cgh.memcpy(arr_h, arr_d, 123 * sizeof(T)); });


        q.wait_and_throw();

        sycl::free(arr_d, q);
        sycl::free(arr_h, q);
    }
}

verbose.log

EDIT: fixed the code and the log.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 8, 2023

Thanks, I've already seen this issue in CI. It seems to only happen for some backends. CUDA seems to be fine.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 8, 2023

Seems like it incorrectly generates memcopies between the CPU device for some reason while dispatching the work to GPU, which cannot go well...

@al42and
Copy link
Contributor

al42and commented Sep 9, 2023

The CPU copies are only there because the test is first executed on the CPU device. If I only run on a GPU, no CPU->CPU copies are executed.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 11, 2023

@al42and This is not what I meant; the issue I observed was that in certain cases a CPU->CPU copy was dispatched to the GPU queue. However, I could since fix this issue. It was limited to OpenCL and caused by a bug in the OpenCL backend's USM pointer query implementation, causing it to not correctly recognizing the device attached to the device pointer (and thus assuming the pointer is a non-USM host pointer).
The hang is independent of this, and unfortunately I don't know what causes it yet.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 11, 2023

@al42and I think I found it, I was being stupid and did not correctly configure the nodes for submission.. This has caused the inorder_executor to assume that every task is from some unknown backend, and thus needs special synchronization...
Please try again :)

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 11, 2023

With this PR, for Lulesh with C++ standard parallelism offloading to NVIDIA in the CPU-bound regime, runtime goes down from 86s to 70s. It's now only 13s slower than nvc++ in this case, and most of these 13s can likely be attributed to SSCP JIT and runtime startup.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 12, 2023

Whether instant submission is allowed is now controlled via the HIPSYCL_ALLOW_INSTANT_SUBMISSION macro.

In SYCL mode, instant submission is now opt-in, requiring that #define HIPSYCL_ALLOW_INSTANT_SUBMISSION 1 is present prior to including sycl.hpp. In stdpar mode, it is active by default.

Once things are more mature and tested, we should consider also enabling it by default in SYCL.

@al42and
Copy link
Contributor

al42and commented Sep 13, 2023

The preliminary results look good. All tests pass, and, for a small system (24k atoms), in GPU-resident mode with, we get noticeable speed-up:

On my workstation, RTX3060: 383 (374) -> 397 ns/day.
On Dardel, MI250X: 542 (472) -> 585 ns/day.

We usually set HIPSYCL_RT_MAX_CACHED_NODES=0, but the numbers in () show the results without it, using the default caching behavior. So, in the "default" mode without any environment variables, the difference is much more dramatic. The CPU usage is also reduced.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 14, 2023

@al42and Excellent, thanks for the feedback! How do we compare to HIP?

@al42and
Copy link
Contributor

al42and commented Sep 14, 2023

Looking at a more pathological case (6k atoms), the picture gets weirder.

For example, we have three kernels launched almost back-to-back (they are in different modules, so there is some host activity in-between, but not too much).

Here's how native HIP looks (the first hipLaunchKernel corresponds to the yellow kernel in the lower right corner); it takes ~13µs to launch all kernels:

image

If we look at the old code, the version of hipSYCL without this patch with set HIPSYCL_RT_MAX_CACHED_NODES=0, it takes slightly longer, ~21µs:

image

Now, if we look at the instant submission case, things get much worse, 38µs:

image

Not sure what's causing the slowdown, but it also lowers the overall application performance, 1073.321 ns/day with HIP, 781.733 with Instant submission, 833.722 with old submission logic and no caching in this case (running 5 GROMACS threads bound to a 8c/16t CCX, no reserved cores, HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0).

When the CPU is oversubscribed, though, this patch significantly improves performance (although, that's inadvisable way to run things to begin with).

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 14, 2023

Can you check with the following program? With this program, for me instant submission is always substantially faster than the old mechanism (30-50%) either with or without node caching:

// Change to enable / disable instant submission
#define HIPSYCL_ALLOW_INSTANT_SUBMISSION 0
#include <sycl/sycl.hpp>
#include <iostream>
#include <chrono>

class profiler_clock {
public:
  using rep = uint64_t;
  using period = std::nano;
  using duration = std::chrono::duration<rep, period>;
  using time_point = std::chrono::time_point<profiler_clock>;

  constexpr static bool is_steady = true;

  static time_point now() {
    return time_point{
        duration{std::chrono::steady_clock::now().time_since_epoch()}};
  }

  static std::size_t ns_ticks(const time_point& tp) {
    return tp.time_since_epoch().count();
  }

  static double seconds(const time_point& tp) {
    auto ticks = ns_ticks(tp);
    return static_cast<double>(ticks) /
           1.e9;
  }
};


int main() {
  sycl::queue q{
    {sycl::property::queue::in_order{},
     sycl::property::queue::hipSYCL_coarse_grained_events{}}
  };
  int* data = sycl::malloc_device<int>(1, q);
  // Run once to trigger any potential JIT
  q.single_task([=](){*data += 1;});
  
  double start = profiler_clock::seconds(profiler_clock::now());
  for(int i = 0; i < 100000; ++i)
    q.single_task([=](){*data += 1;});
  q.wait();
  double stop = profiler_clock::seconds(profiler_clock::now());
  
  std::cout << stop-start << " seconds" << std::endl;
  
  sycl::free(data, q);
}

The only case where I can see the old mechanism being faster is if you can benefit from it submitting asynchronously (either because it JITs which is expensive, or because you can overlap with other work on the host).

@al42and
Copy link
Contributor

al42and commented Sep 14, 2023

Normally, the "new" (instant) behavior is indeed faster:

$ ROCR_VISIBLE_DEVICES=4 ./1128_instant_on
0.32079 seconds
$ ROCR_VISIBLE_DEVICES=4 ./1128_instant_off
0.395244 seconds

But if we bind all the cores to the CCX corresponding to the GPU (EDIT: or any single CCX), the "old" case gets much faster:

$ ROCR_VISIBLE_DEVICES=4 hwloc-bind --cpubind core:1-7 ./1128_instant_on
0.318388 seconds
$ ROCR_VISIBLE_DEVICES=4 hwloc-bind --cpubind core:1-7 ./1128_instant_off
0.27572 seconds

The "old" case also sometimes crashes when run with HIPSYCL_RT_MAX_CACHED_NODES=0, but that's besides the point:

$ HIPSYCL_RT_MAX_CACHED_NODES=0 ROCR_VISIBLE_DEVICES=4 ./1128_instant_off
double free or corruption (fasttop)free(): invalid pointer

Aborted
andreyal@nid002910:~$ HIPSYCL_RT_MAX_CACHED_NODES=0 ROCR_VISIBLE_DEVICES=4 gdb --args ./1128_instant_off
Reading symbols from ./1128_instant_off...
(gdb) r
(gdb) bt
#0  0x00000000006d51c0 in ?? ()
#1  0x00007ffff7fdebc9 in hipsycl::rt::dag_direct_scheduler::submit(std::shared_ptr<hipsycl::rt::dag_node>) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#2  0x00007ffff7fe5725 in hipsycl::rt::dag_unbound_scheduler::submit(std::shared_ptr<hipsycl::rt::dag_node>) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#3  0x00007ffff7fe6f65 in std::_Function_handler<void (), hipsycl::rt::dag_manager::flush_async()::$_0>::_M_invoke(std::_Any_data const&) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#4  0x00007ffff7fecbd6 in hipsycl::rt::worker_thread::work() () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#5  0x00007ffff62773b4 in ?? () from /usr/lib64/libstdc++.so.6
#6  0x00007ffff54266ea in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff595aa8f in clone () from /lib64/libc.so.6

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 14, 2023

Ok... I cannot reproduce this behavior, even with hwloc-bind. What is the host CPU?

I also cannot reproduce the crash so far. Obviously, if something UB is going on there, then it might be difficult to compare performance. Is ROCm LLVM involved here? If so I'd suggest to try stock LLVM. Wouldn't be the first time ROCm LLVM miscompiles code.

@al42and
Copy link
Contributor

al42and commented Sep 14, 2023

What is the host CPU?

AMD EPYC 7A53

Is ROCm LLVM involved here?

Yes... Also, it's ROCm 5.3.3, which has some runtime latency issues that the newer versions do not.

If so I'd suggest to try stock LLVM. Wouldn't be the first time ROCm LLVM miscompiles code.

I can try Cray LLVM, but last time I checked (~1 year ago), it had troubles compiling hipSYCL, so I wouldn't hold it as a gold standard.

I can take a stab at building stock LLVM from scratch in the next days.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 14, 2023

AMD EPYC 7A53

Interesting, I can't find information on that one; is that a special LUMI CPU or did you mean 7453?
In any case, I don't have any Milan CPUs here for testing, but I can try to get access to some.

I can try Cray LLVM, but last time I checked (~1 year ago), it had troubles compiling hipSYCL, so I wouldn't hold it as a gold standard.
I can take a stab at building stock LLVM from scratch in the next days.

Is there perhaps any container solution available (e.g. singularity or charliecloud)? That would help with reproducability.

@al42and
Copy link
Contributor

al42and commented Sep 15, 2023

A quick-and-dirty annotation with rocTX was not particularly enlightening, mostly due to 1µs resolution and, apparently, the fact that the overhead is spread between multiple locations. But perhaps @illuhad can gain some insights?

"Submit GPU Work" markers are right around the queue.submit(...) call, the others correspond to hipSYCL functions:

Screenshot_20230915_221921

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 15, 2023

Interesting...

The gap between the end of hip_queue::submit_kernel and the end of inorder_executor::submit_directly is likely the coarse-grained event construction as really nothing else happens after the kernel submission. Due to SYCL's automatic memory management semantics this has to be a shared_ptr, so my guess is that the memory allocation on the heap is the culprit.

There's also memory allocation before the launch (to construct the DAG node). That might play a role in what is going on prior to submission inside create_task().

The queue will also copy the queue configuration into the handler (which is an std::vector, which then gets processed and copied into the DAG node). I've wanted to experiment with small vector optimizations for a long time here...

Apart from that there's really not much that is happening.

For most cases this might not play a large role, but if the comparison is direct calls to hipLaunchKernel with trivial kernels, you might see all that.

@al42and
Copy link
Contributor

al42and commented Sep 16, 2023

The gap between the end of hip_queue::submit_kernel and the end of inorder_executor::submit_directly is likely the coarse-grained event construction as really nothing else happens after the kernel submission.

One "tick" along the X axis is 1µs = profiler resolution, so we should be careful when interpreting such single events. But there's clearly a lot happening before the HIP API is called. Here's the full trace:
8390_instant_on.json.gz

For most cases this might not play a large role, but if the comparison is direct calls to hipLaunchKernel with trivial kernels, you might see all that.

It will also be an issue if we have CPU work we want to overlap with the GPU. When we have a lot of cores, having a separate launcher thread would be a good solution, perhaps even better than native HIP. But for cases like LUMI, where we have 8c/8t per GPU, with one core reserved by the scheduler (presumably, for the HSA thread), the CPU compute tasks only have 7 cores, and yielding one of them to the worker thread would be not great; similarly, delaying the compute tasks by having an inefficient same-thread submit is bad, and delaying the communication can be even worse.

So, I agree that most application might not notice a difference. But it's not limited to 1µs-long kernels either.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 16, 2023

It will also be an issue if we have CPU work we want to overlap with the GPU. When we have a lot of cores, having a separate launcher thread would be a good solution, perhaps even better than native HIP. But for cases like LUMI, where we have 8c/8t per GPU, with one core reserved by the scheduler (presumably, for the HSA thread), the CPU compute tasks only have 7 cores, and yielding one of them to the worker thread would be not great; similarly, delaying the compute tasks by having an inefficient same-thread submit is bad, and delaying the communication can be even worse.

Agreed. It sounds like we may need two different modes:

  • Synchronous instant submission, which requires optimizing the submission path (likely including moving to small vector classes instead of std::vector and similar optimizations)
  • Asynchronous instant submission where a worker thread is involved (Or at least we should investigate the benefit of this - maybe not much is to be gained once the submission path is sufficiently optimized)

Depending on the use case, one or the other might be chosen by the client application and/or the runtime.

@al42and
Copy link
Contributor

al42and commented Sep 17, 2023

The "old" case also sometimes crashes when run with HIPSYCL_RT_MAX_CACHED_NODES=0, but that's besides the point:

That happens even without this PR, so not a blocker here. I'll submit another issues once I get a better idea what happens there. Valgrind points at a long stack of recursive void hipsycl::rt::(anonymous namespace)::descend_requirement_tree calls, and randomness hints at a data race.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 18, 2023

That happens even without this PR, so not a blocker here. I'll submit another issues once I get a better idea what happens there. Valgrind points at a long stack of recursive void hipsycl::rt::(anonymous namespace)::descend_requirement_tree calls, and randomness hints at a data race.

Thanks for the confirmation, I had expected this. This PR does not change the non-instant submission code paths at all, so I would have been very surprised if this had changed things.
The randomness could also be due to the DAG descent ignoring nodes that have been marked as completed. So it could also depend on whether an operation has already finished and the GC has noticed this.

@pszi1ard
Copy link

pszi1ard commented Sep 18, 2023

* Synchronous instant submission, which requires optimizing the submission path (likely including moving to small vector classes instead of `std::vector` and similar optimizations)

* Asynchronous instant submission where a worker thread is involved (Or at least we should investigate the benefit of this - maybe not much is to be gained once the submission path is sufficiently optimized)

Depending on the use case, one or the other might be chosen by the client application and/or the runtime.

The interesting questions is what is the CPU / GPU latency tradeoff with both schemes fully optimized, e.g. by how much are GPU task launches delayed by async instant submission, whether/when is this outweighed by the benefit of overlapping some GPU API overhead, whether this has an impact on the critical path of an application. I think making an automated choice will be very hard without knowing what the application does, but there may be cases easy to identify (e.g. pathological cases of high GPU API overhead).

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 19, 2023

I agree. I guess for simple cases we can have some autotuning code path where we make some measurements of the latency to offload to another thread versus just submitting synchronously, but there should also be some mechanism to let expert users enforce a particular behavior.
As I said, in any case I think we need to first examine how much performance we can extract from the synchronous submission code path to check whether it's even worth the effort to offload to another thread.

@al42and
Copy link
Contributor

al42and commented Sep 26, 2023

I haven't analyzed it in depth, but what if, instead of analyzing all the dependencies of a task, we track the possibility of the submission via a flag on the queue?

Task can be instantly submitted iff:

  • task uses no buffers (USM-only)
  • task uses no SYCL 2020 reductions
  • task has no running dependencies on non-instant tasks
  • task is bound to a device
  • task is hard-wired to an in-order backend_executor, which happens e.g. when using a SYCL in-order queue.
  • task has no explicit dependencies (handler::depends_on directly or via shortcuts)
  • task is submitted to an "instant" queue.

Conditions on a queue being "instant":

  • An out-of-order queue is always "non-instant";
  • An in-order queue starts "instant";
  • An in-order queue becomes "non-instant" if a non-instant task is submitted into it (i.e., the one with buffers, reductions, or dependencies)
  • An in-order queue becomes "instant" again after queue::wait() is called (there might be some edge-cases here, so perhaps it can never revert back to being instant).

This way, we spare ourselves the need to walk through all the dependencies before submission, hopefully cutting away some overhead.

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 26, 2023

I suspect that would not help much. For an in-order queue, the number of dependencies is typical 0 to 1 such that there won't be much difference to what you suggest. I really don't think that walking the dependencies itself is what is costing here.

@al42and
Copy link
Contributor

al42and commented Sep 26, 2023

A quick-and-dirty annotation with rocTX was not particularly enlightening, mostly due to 1µs resolution and, apparently, the fact that the overhead is spread between multiple locations.

Perfetto is to blame; the JSONs generated by roctrace have proper nanosecond resolution (note: profiler overhead is unknown): 8390_instant_on.json.gz

Below is not a proper analysis, but an average of three randomly-selected submission events. "Delay at start" is the delay between the start of this function and the start of the function below, in nanoseconds. "Delay at end" is the delay between the end of the function below and the start of this one. For "submit_kernel", the "function below" is the trio of HIP API calls. Times are measured by manually placed markers (see code linked above). A lot of variability in total submission times come from hipLaunchKernel (3-5 µs), but the overheads are more consistent.

Region Delay at start Delay at end
Submit GPU work 150 1112
queue::submit 778 201
queue::execute_submission 1216 137
handler::create_task 1757 93
submit_directly 547 969
submit_kernel 1149 284

So the conclusion still holds, the overhead is annoyingly spread out between multiple places :(

@illuhad
Copy link
Collaborator Author

illuhad commented Sep 26, 2023

Some things I'd like to try:

  • Use bitmasks/bitmap in execution_hints for very efficiently querying whether a specific hint is there (very easy)
  • Use a small vector implementation instead of std::vector for execution hints and requirement_list and dag node dependencies (easy, but a bit more involved)
  • Spawn dag nodes and operations from object pools. (This is probably a larger change, so might be a project for later)

I'm officially on vacation until Oct 4, so my ability to do this at the moment is limited.

@al42and
Copy link
Contributor

al42and commented Oct 4, 2023

Spawn dag nodes and operations from object pools. (This is probably a larger change, so might be a project for later)

Perhaps using PMR with (un)synchronized_pool_resource would be easier? It should still cut down the overhead of memory allocations significantly without a need to change code much besides just passing the allocator around.

@illuhad
Copy link
Collaborator Author

illuhad commented Oct 9, 2023

Now depends on #1178. On my system with gfx906, the instant submission test program above is now ~10% faster (it was however always substantially faster than non-instant submission for me so YMMV).

One thing I noticed is that if Gromacs sets the hipSYCL_coarse_grained_event property per submission, I would recommend constructing the property list once and then reusing it, as constructing a property_list also requires dynamic memory allocation. This is harder to avoid, because the interface that SYCL has for property_list requires some form of type erasure. Once you have constructed it though, it can be passed to other submissions with little cost.

@al42and
Copy link
Contributor

al42and commented Oct 9, 2023

One thing I noticed is that if Gromacs sets the hipSYCL_coarse_grained_event property per submission, I would recommend constructing the property list once and then reusing it, as constructing a property_list also requires dynamic memory allocation. This is harder to avoid, because the interface that SYCL has for property_list requires some form of type erasure. Once you have constructed it though, it can be passed to other submissions with little cost.

Thanks for the suggestion, we'll look into that.

BTW, do you know any reason there is no constexpr constructor for property_list in the standard? Some properties, of course, might require runtime info, but currently the standard forbids constexpr construction of even empty property list. Here, we can live with static const, so I'm just curious.

@illuhad
Copy link
Collaborator Author

illuhad commented Oct 9, 2023

BTW, do you know any reason there is no constexpr constructor for property_list in the standard? Some properties, of course, might require runtime info, but currently the standard forbids constexpr construction of even empty property list. Here, we can live with static const, so I'm just curious.

property_list was already there in SYCL 1.2.1, so I suspect it was originally designed before constexpr was a priority.
Performance-wise, an empty property list should not perform any operations either (or at least not more than default-constructing std::vector), but of course there are some additional semantic implications of constexpr.

Intel is working on adding support for compile-time properties to property_list, I assume constexpr support would also come as part of this.

@illuhad illuhad marked this pull request as ready for review October 13, 2023 20:39
@illuhad
Copy link
Collaborator Author

illuhad commented Oct 18, 2023

Given that this is opt-in for now, I think we should merge given that for some use cases it will provide a performance-boosting option.

@illuhad illuhad merged commit c744650 into develop Oct 18, 2023
38 of 40 checks passed
@illuhad illuhad deleted the feature/instant-submission branch October 18, 2023 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants