Add instant submission mode #1128

illuhad · 2023-09-08T14:42:14Z

This adds a new task submission model: Instant submission. In instant submission, the entire existing scheduling and DAG construction logic is bypassed, and the node is instead immediately submitted to an inorder_executor.
This also allows us to circumvent the existing garbage collection and submission worker thread usage, which can make the behavior more predictable in scenarios with high CPU utilization due to user code.

The hope is that instant submission can reduce latencies in strong-scaling scenarios.

Instant submission is automatically activated when appropriate conditions are met:

task uses no buffers (USM-only)
task uses no SYCL 2020 reductions
task has no running dependencies on non-instant tasks
task is bound to a device
task is hard-wired to an in-order backend_executor, which happens e.g. when using a SYCL in-order queue.

Note that stdpar code always satisfies the conditions for instant submission.

The current implementation is NOT THREAD-SAFE, as in instant mode, multiple user threads can in principle call directly into runtime internals, which were developed under the assumption that only the runtime worker thread would be using them. Before merging, we need to add all the mutexes.
coarse-grained events / queue completion events are not yet thread-safe when decaying to fine-grained events.
We should also consider whether we want to enable this automatically by default as it is now, or whether it should - at least initially - be opt-in. In that case the user would need to set a macro to enable this optimization, if tasks are eligible. Introducing an entire new submission model makes it likely that some issues might pop up, so for stability reasons, it might be a good idea to not immediately expose this to all the code.

CC @al42and @pszi1ard

…ssion

al42and · 2023-09-08T22:27:40Z

With 3056a88, the following code hangs in hipStreamSynchronize(); log attached below; ROCm 5.3.3.

#include <sycl/sycl.hpp>

using T = char;

int main()
{
    for (const auto& dev : sycl::device::get_devices())
    {
        sycl::queue q{ dev, { sycl::property::queue::in_order() } };
        char*       arr_d = sycl::malloc_device<T>(123, q);
        char*       arr_h = sycl::malloc_host<T>(123, q);

        std::iota(arr_h, arr_h + 123, 42);

        q.submit(sycl::property_list{ sycl::property::command_group::hipSYCL_coarse_grained_events() },
                 [&](sycl::handler& cgh) { cgh.memcpy(arr_d, arr_h, 123 * sizeof(T)); });

        q.submit(sycl::property_list{ sycl::property::command_group::hipSYCL_coarse_grained_events() },
                 [&](sycl::handler& cgh) { cgh.memcpy(arr_h, arr_d, 123 * sizeof(T)); });


        q.wait_and_throw();

        sycl::free(arr_d, q);
        sycl::free(arr_h, q);
    }
}

verbose.log

EDIT: fixed the code and the log.

illuhad · 2023-09-08T22:58:49Z

Thanks, I've already seen this issue in CI. It seems to only happen for some backends. CUDA seems to be fine.

illuhad · 2023-09-08T23:15:35Z

Seems like it incorrectly generates memcopies between the CPU device for some reason while dispatching the work to GPU, which cannot go well...

al42and · 2023-09-09T15:05:49Z

The CPU copies are only there because the test is first executed on the CPU device. If I only run on a GPU, no CPU->CPU copies are executed.

…rying device of USM pointer

illuhad · 2023-09-11T21:26:45Z

@al42and This is not what I meant; the issue I observed was that in certain cases a CPU->CPU copy was dispatched to the GPU queue. However, I could since fix this issue. It was limited to OpenCL and caused by a bug in the OpenCL backend's USM pointer query implementation, causing it to not correctly recognizing the device attached to the device pointer (and thus assuming the pointer is a non-USM host pointer).
The hang is independent of this, and unfortunately I don't know what causes it yet.

…executor

illuhad · 2023-09-11T22:04:24Z

@al42and I think I found it, I was being stupid and did not correctly configure the nodes for submission.. This has caused the inorder_executor to assume that every task is from some unknown backend, and thus needs special synchronization...
Please try again :)

illuhad · 2023-09-11T22:18:29Z

With this PR, for Lulesh with C++ standard parallelism offloading to NVIDIA in the CPU-bound regime, runtime goes down from 86s to 70s. It's now only 13s slower than nvc++ in this case, and most of these 13s can likely be attributed to SSCP JIT and runtime startup.

illuhad · 2023-09-12T19:04:26Z

Whether instant submission is allowed is now controlled via the HIPSYCL_ALLOW_INSTANT_SUBMISSION macro.

In SYCL mode, instant submission is now opt-in, requiring that #define HIPSYCL_ALLOW_INSTANT_SUBMISSION 1 is present prior to including sycl.hpp. In stdpar mode, it is active by default.

Once things are more mature and tested, we should consider also enabling it by default in SYCL.

al42and · 2023-09-13T19:31:41Z

The preliminary results look good. All tests pass, and, for a small system (24k atoms), in GPU-resident mode with, we get noticeable speed-up:

On my workstation, RTX3060: 383 (374) -> 397 ns/day.
On Dardel, MI250X: 542 (472) -> 585 ns/day.

We usually set HIPSYCL_RT_MAX_CACHED_NODES=0, but the numbers in () show the results without it, using the default caching behavior. So, in the "default" mode without any environment variables, the difference is much more dramatic. The CPU usage is also reduced.

illuhad · 2023-09-14T12:04:09Z

@al42and Excellent, thanks for the feedback! How do we compare to HIP?

al42and · 2023-09-14T14:17:34Z

Looking at a more pathological case (6k atoms), the picture gets weirder.

For example, we have three kernels launched almost back-to-back (they are in different modules, so there is some host activity in-between, but not too much).

Here's how native HIP looks (the first hipLaunchKernel corresponds to the yellow kernel in the lower right corner); it takes ~13µs to launch all kernels:

If we look at the old code, the version of hipSYCL without this patch with set HIPSYCL_RT_MAX_CACHED_NODES=0, it takes slightly longer, ~21µs:

Now, if we look at the instant submission case, things get much worse, 38µs:

Not sure what's causing the slowdown, but it also lowers the overall application performance, 1073.321 ns/day with HIP, 781.733 with Instant submission, 833.722 with old submission logic and no caching in this case (running 5 GROMACS threads bound to a 8c/16t CCX, no reserved cores, HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0).

When the CPU is oversubscribed, though, this patch significantly improves performance (although, that's inadvisable way to run things to begin with).

illuhad · 2023-09-14T14:53:14Z

Can you check with the following program? With this program, for me instant submission is always substantially faster than the old mechanism (30-50%) either with or without node caching:

// Change to enable / disable instant submission
#define HIPSYCL_ALLOW_INSTANT_SUBMISSION 0
#include <sycl/sycl.hpp>
#include <iostream>
#include <chrono>

class profiler_clock {
public:
  using rep = uint64_t;
  using period = std::nano;
  using duration = std::chrono::duration<rep, period>;
  using time_point = std::chrono::time_point<profiler_clock>;

  constexpr static bool is_steady = true;

  static time_point now() {
    return time_point{
        duration{std::chrono::steady_clock::now().time_since_epoch()}};
  }

  static std::size_t ns_ticks(const time_point& tp) {
    return tp.time_since_epoch().count();
  }

  static double seconds(const time_point& tp) {
    auto ticks = ns_ticks(tp);
    return static_cast<double>(ticks) /
           1.e9;
  }
};


int main() {
  sycl::queue q{
    {sycl::property::queue::in_order{},
     sycl::property::queue::hipSYCL_coarse_grained_events{}}
  };
  int* data = sycl::malloc_device<int>(1, q);
  // Run once to trigger any potential JIT
  q.single_task([=](){*data += 1;});
  
  double start = profiler_clock::seconds(profiler_clock::now());
  for(int i = 0; i < 100000; ++i)
    q.single_task([=](){*data += 1;});
  q.wait();
  double stop = profiler_clock::seconds(profiler_clock::now());
  
  std::cout << stop-start << " seconds" << std::endl;
  
  sycl::free(data, q);
}

The only case where I can see the old mechanism being faster is if you can benefit from it submitting asynchronously (either because it JITs which is expensive, or because you can overlap with other work on the host).

al42and · 2023-09-14T18:35:37Z

Normally, the "new" (instant) behavior is indeed faster:

$ ROCR_VISIBLE_DEVICES=4 ./1128_instant_on
0.32079 seconds
$ ROCR_VISIBLE_DEVICES=4 ./1128_instant_off
0.395244 seconds

But if we bind all the cores to the CCX corresponding to the GPU (EDIT: or any single CCX), the "old" case gets much faster:

$ ROCR_VISIBLE_DEVICES=4 hwloc-bind --cpubind core:1-7 ./1128_instant_on
0.318388 seconds
$ ROCR_VISIBLE_DEVICES=4 hwloc-bind --cpubind core:1-7 ./1128_instant_off
0.27572 seconds

The "old" case also sometimes crashes when run with HIPSYCL_RT_MAX_CACHED_NODES=0, but that's besides the point:

$ HIPSYCL_RT_MAX_CACHED_NODES=0 ROCR_VISIBLE_DEVICES=4 ./1128_instant_off
double free or corruption (fasttop)free(): invalid pointer

Aborted
andreyal@nid002910:~$ HIPSYCL_RT_MAX_CACHED_NODES=0 ROCR_VISIBLE_DEVICES=4 gdb --args ./1128_instant_off
Reading symbols from ./1128_instant_off...
(gdb) r
(gdb) bt
#0  0x00000000006d51c0 in ?? ()
#1  0x00007ffff7fdebc9 in hipsycl::rt::dag_direct_scheduler::submit(std::shared_ptr<hipsycl::rt::dag_node>) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#2  0x00007ffff7fe5725 in hipsycl::rt::dag_unbound_scheduler::submit(std::shared_ptr<hipsycl::rt::dag_node>) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#3  0x00007ffff7fe6f65 in std::_Function_handler<void (), hipsycl::rt::dag_manager::flush_async()::$_0>::_M_invoke(std::_Any_data const&) () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#4  0x00007ffff7fecbd6 in hipsycl::rt::worker_thread::work() () from /cfs/klemming/home/a/andreyal/OpenSYCL/build-5.3.3-instant-submission/install/lib/libhipSYCL-rt.so
#5  0x00007ffff62773b4 in ?? () from /usr/lib64/libstdc++.so.6
#6  0x00007ffff54266ea in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff595aa8f in clone () from /lib64/libc.so.6

illuhad · 2023-09-14T19:27:02Z

Ok... I cannot reproduce this behavior, even with hwloc-bind. What is the host CPU?

I also cannot reproduce the crash so far. Obviously, if something UB is going on there, then it might be difficult to compare performance. Is ROCm LLVM involved here? If so I'd suggest to try stock LLVM. Wouldn't be the first time ROCm LLVM miscompiles code.

al42and · 2023-09-14T19:42:03Z

What is the host CPU?

AMD EPYC 7A53

Is ROCm LLVM involved here?

Yes... Also, it's ROCm 5.3.3, which has some runtime latency issues that the newer versions do not.

If so I'd suggest to try stock LLVM. Wouldn't be the first time ROCm LLVM miscompiles code.

I can try Cray LLVM, but last time I checked (~1 year ago), it had troubles compiling hipSYCL, so I wouldn't hold it as a gold standard.

I can take a stab at building stock LLVM from scratch in the next days.

illuhad · 2023-09-14T19:46:32Z

AMD EPYC 7A53

Interesting, I can't find information on that one; is that a special LUMI CPU or did you mean 7453?
In any case, I don't have any Milan CPUs here for testing, but I can try to get access to some.

I can try Cray LLVM, but last time I checked (~1 year ago), it had troubles compiling hipSYCL, so I wouldn't hold it as a gold standard.
I can take a stab at building stock LLVM from scratch in the next days.

Is there perhaps any container solution available (e.g. singularity or charliecloud)? That would help with reproducability.

…ead-safe

…ke clear why they are thread-safe

al42and · 2023-09-15T20:25:59Z

A quick-and-dirty annotation with rocTX was not particularly enlightening, mostly due to 1µs resolution and, apparently, the fact that the overhead is spread between multiple locations. But perhaps @illuhad can gain some insights?

"Submit GPU Work" markers are right around the queue.submit(...) call, the others correspond to hipSYCL functions:

illuhad · 2023-09-15T20:49:34Z

Interesting...

The gap between the end of hip_queue::submit_kernel and the end of inorder_executor::submit_directly is likely the coarse-grained event construction as really nothing else happens after the kernel submission. Due to SYCL's automatic memory management semantics this has to be a shared_ptr, so my guess is that the memory allocation on the heap is the culprit.

There's also memory allocation before the launch (to construct the DAG node). That might play a role in what is going on prior to submission inside create_task().

The queue will also copy the queue configuration into the handler (which is an std::vector, which then gets processed and copied into the DAG node). I've wanted to experiment with small vector optimizations for a long time here...

Apart from that there's really not much that is happening.

For most cases this might not play a large role, but if the comparison is direct calls to hipLaunchKernel with trivial kernels, you might see all that.

al42and · 2023-09-16T15:50:35Z

The gap between the end of hip_queue::submit_kernel and the end of inorder_executor::submit_directly is likely the coarse-grained event construction as really nothing else happens after the kernel submission.

One "tick" along the X axis is 1µs = profiler resolution, so we should be careful when interpreting such single events. But there's clearly a lot happening before the HIP API is called. Here's the full trace:
8390_instant_on.json.gz

For most cases this might not play a large role, but if the comparison is direct calls to hipLaunchKernel with trivial kernels, you might see all that.

It will also be an issue if we have CPU work we want to overlap with the GPU. When we have a lot of cores, having a separate launcher thread would be a good solution, perhaps even better than native HIP. But for cases like LUMI, where we have 8c/8t per GPU, with one core reserved by the scheduler (presumably, for the HSA thread), the CPU compute tasks only have 7 cores, and yielding one of them to the worker thread would be not great; similarly, delaying the compute tasks by having an inefficient same-thread submit is bad, and delaying the communication can be even worse.

So, I agree that most application might not notice a difference. But it's not limited to 1µs-long kernels either.

illuhad · 2023-09-16T16:38:32Z

It will also be an issue if we have CPU work we want to overlap with the GPU. When we have a lot of cores, having a separate launcher thread would be a good solution, perhaps even better than native HIP. But for cases like LUMI, where we have 8c/8t per GPU, with one core reserved by the scheduler (presumably, for the HSA thread), the CPU compute tasks only have 7 cores, and yielding one of them to the worker thread would be not great; similarly, delaying the compute tasks by having an inefficient same-thread submit is bad, and delaying the communication can be even worse.

Agreed. It sounds like we may need two different modes:

Synchronous instant submission, which requires optimizing the submission path (likely including moving to small vector classes instead of std::vector and similar optimizations)
Asynchronous instant submission where a worker thread is involved (Or at least we should investigate the benefit of this - maybe not much is to be gained once the submission path is sufficiently optimized)

Depending on the use case, one or the other might be chosen by the client application and/or the runtime.

al42and · 2023-09-17T23:23:44Z

The "old" case also sometimes crashes when run with HIPSYCL_RT_MAX_CACHED_NODES=0, but that's besides the point:

That happens even without this PR, so not a blocker here. I'll submit another issues once I get a better idea what happens there. Valgrind points at a long stack of recursive void hipsycl::rt::(anonymous namespace)::descend_requirement_tree calls, and randomness hints at a data race.

illuhad · 2023-09-18T12:52:26Z

That happens even without this PR, so not a blocker here. I'll submit another issues once I get a better idea what happens there. Valgrind points at a long stack of recursive void hipsycl::rt::(anonymous namespace)::descend_requirement_tree calls, and randomness hints at a data race.

Thanks for the confirmation, I had expected this. This PR does not change the non-instant submission code paths at all, so I would have been very surprised if this had changed things.
The randomness could also be due to the DAG descent ignoring nodes that have been marked as completed. So it could also depend on whether an operation has already finished and the GC has noticed this.

pszi1ard · 2023-09-18T14:57:18Z

* Synchronous instant submission, which requires optimizing the submission path (likely including moving to small vector classes instead of `std::vector` and similar optimizations)

* Asynchronous instant submission where a worker thread is involved (Or at least we should investigate the benefit of this - maybe not much is to be gained once the submission path is sufficiently optimized)

Depending on the use case, one or the other might be chosen by the client application and/or the runtime.

The interesting questions is what is the CPU / GPU latency tradeoff with both schemes fully optimized, e.g. by how much are GPU task launches delayed by async instant submission, whether/when is this outweighed by the benefit of overlapping some GPU API overhead, whether this has an impact on the critical path of an application. I think making an automated choice will be very hard without knowing what the application does, but there may be cases easy to identify (e.g. pathological cases of high GPU API overhead).

illuhad · 2023-09-19T14:38:38Z

I agree. I guess for simple cases we can have some autotuning code path where we make some measurements of the latency to offload to another thread versus just submitting synchronously, but there should also be some mechanism to let expert users enforce a particular behavior.
As I said, in any case I think we need to first examine how much performance we can extract from the synchronous submission code path to check whether it's even worth the effort to offload to another thread.

al42and · 2023-09-26T12:58:24Z

I haven't analyzed it in depth, but what if, instead of analyzing all the dependencies of a task, we track the possibility of the submission via a flag on the queue?

Task can be instantly submitted iff:

task uses no buffers (USM-only)
task uses no SYCL 2020 reductions
~~task has no running dependencies on non-instant tasks~~
~~task is bound to a device~~
~~task is hard-wired to an in-order backend_executor, which happens e.g. when using a SYCL in-order queue.~~
task has no explicit dependencies (handler::depends_on directly or via shortcuts)
task is submitted to an "instant" queue.

Conditions on a queue being "instant":

An out-of-order queue is always "non-instant";
An in-order queue starts "instant";
An in-order queue becomes "non-instant" if a non-instant task is submitted into it (i.e., the one with buffers, reductions, or dependencies)
An in-order queue becomes "instant" again after queue::wait() is called (there might be some edge-cases here, so perhaps it can never revert back to being instant).

This way, we spare ourselves the need to walk through all the dependencies before submission, hopefully cutting away some overhead.

illuhad · 2023-09-26T15:03:20Z

I suspect that would not help much. For an in-order queue, the number of dependencies is typical 0 to 1 such that there won't be much difference to what you suggest. I really don't think that walking the dependencies itself is what is costing here.

al42and · 2023-09-26T15:50:00Z

A quick-and-dirty annotation with rocTX was not particularly enlightening, mostly due to 1µs resolution and, apparently, the fact that the overhead is spread between multiple locations.

Perfetto is to blame; the JSONs generated by roctrace have proper nanosecond resolution (note: profiler overhead is unknown): 8390_instant_on.json.gz

Below is not a proper analysis, but an average of three randomly-selected submission events. "Delay at start" is the delay between the start of this function and the start of the function below, in nanoseconds. "Delay at end" is the delay between the end of the function below and the start of this one. For "submit_kernel", the "function below" is the trio of HIP API calls. Times are measured by manually placed markers (see code linked above). A lot of variability in total submission times come from hipLaunchKernel (3-5 µs), but the overheads are more consistent.

Region	Delay at start	Delay at end
Submit GPU work	150	1112
queue::submit	778	201
queue::execute_submission	1216	137
handler::create_task	1757	93
submit_directly	547	969
submit_kernel	1149	284

So the conclusion still holds, the overhead is annoyingly spread out between multiple places :(

illuhad · 2023-09-26T16:08:25Z

Some things I'd like to try:

Use bitmasks/bitmap in execution_hints for very efficiently querying whether a specific hint is there (very easy)
Use a small vector implementation instead of std::vector for execution hints and requirement_list and dag node dependencies (easy, but a bit more involved)
Spawn dag nodes and operations from object pools. (This is probably a larger change, so might be a project for later)

I'm officially on vacation until Oct 4, so my ability to do this at the moment is limited.

al42and · 2023-10-04T08:39:14Z

Spawn dag nodes and operations from object pools. (This is probably a larger change, so might be a project for later)

Perhaps using PMR with (un)synchronized_pool_resource would be easier? It should still cut down the overhead of memory allocations significantly without a need to change code much besides just passing the allocator around.

…submission

illuhad · 2023-10-09T17:12:31Z

Now depends on #1178. On my system with gfx906, the instant submission test program above is now ~10% faster (it was however always substantially faster than non-instant submission for me so YMMV).

One thing I noticed is that if Gromacs sets the hipSYCL_coarse_grained_event property per submission, I would recommend constructing the property list once and then reusing it, as constructing a property_list also requires dynamic memory allocation. This is harder to avoid, because the interface that SYCL has for property_list requires some form of type erasure. Once you have constructed it though, it can be passed to other submissions with little cost.

al42and · 2023-10-09T20:31:23Z

One thing I noticed is that if Gromacs sets the hipSYCL_coarse_grained_event property per submission, I would recommend constructing the property list once and then reusing it, as constructing a property_list also requires dynamic memory allocation. This is harder to avoid, because the interface that SYCL has for property_list requires some form of type erasure. Once you have constructed it though, it can be passed to other submissions with little cost.

Thanks for the suggestion, we'll look into that.

BTW, do you know any reason there is no constexpr constructor for property_list in the standard? Some properties, of course, might require runtime info, but currently the standard forbids constexpr construction of even empty property list. Here, we can live with static const, so I'm just curious.

illuhad · 2023-10-09T21:39:53Z

BTW, do you know any reason there is no constexpr constructor for property_list in the standard? Some properties, of course, might require runtime info, but currently the standard forbids constexpr construction of even empty property list. Here, we can live with static const, so I'm just curious.

property_list was already there in SYCL 1.2.1, so I suspect it was originally designed before constexpr was a priority.
Performance-wise, an empty property list should not perform any operations either (or at least not more than default-constructing std::vector), but of course there are some additional semantic implications of constexpr.

Intel is working on adding support for compile-time properties to property_list, I assume constexpr support would also come as part of this.

illuhad · 2023-10-18T15:43:50Z

Given that this is opt-in for now, I think we should merge given that for some use cases it will provide a performance-boosting option.

Add instant submission mode

eb804df

illuhad force-pushed the feature/instant-submission branch from 79d15ff to eb804df Compare September 8, 2023 14:43

handler: Fix missing arguments of execution hints for DAG-style submi…

3056a88

…ssion

illuhad added 2 commits September 11, 2023 23:21

[OpenCL] Support USM pointer queries if the target device id is not 0

8380cf7

[OpenCL] Fix incorrect size passed to clGetMemAllocInfoINTEL when que…

364ee32

…rying device of USM pointer

illuhad added 2 commits September 11, 2023 23:51

[L0] Support dedicated in-order queues

a3b2463

Instant submission: Configure node correctly by assigning device and …

46dc0c8

…executor

Instant submission: Remove obsolete warning

77aaa8e

Only allow instant submission if appropriate macro is defined

215096a

illuhad added 2 commits September 12, 2023 21:06

Allow instant submission in SYCL tests

922a0f7

Setup instrumentations correctly for instant submissions

12454ba

illuhad added 3 commits September 14, 2023 21:50

[Doc] Mention instant submission in performance hints section

b9e86ae

Make inorder_executor thread-safe

453d167

Make kernel_cache thread-safe

48e2f53

illuhad added 3 commits September 15, 2023 21:31

[NFC] Add comment to mention that inorder_queue interface must be thr…

19a3f68

…ead-safe

[OpenCL] Make ocl_queue thread-safe

02d9537

[HIP][CUDA] Make hip/cuda_queue members const where appropriate to ma…

eed3e6b

…ke clear why they are thread-safe

al42and mentioned this pull request Oct 1, 2023

Segfaults when submitting many kernels fast on AMD devices #1164

Open

illuhad added 2 commits October 9, 2023 18:41

Merge branch 'develop' into feature/instant-submission

a74484c

Merge branch 'feature/redesign-execution-hints' into feature/instant-…

cd2ebcf

…submission

Make coarse-grained event decay thread-safe

0fb8eae

illuhad mentioned this pull request Oct 11, 2023

Introduce small_vector support #1182

Merged

illuhad marked this pull request as ready for review October 13, 2023 20:39

Merge branch 'develop' into feature/instant-submission

b4b8dca

illuhad merged commit c744650 into develop Oct 18, 2023
38 of 40 checks passed

illuhad deleted the feature/instant-submission branch October 18, 2023 22:31

fknorr mentioned this pull request Mar 3, 2024

Allow submitting kernels to an existing CUDA stream #1389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instant submission mode #1128

Add instant submission mode #1128

illuhad commented Sep 8, 2023 •

edited

al42and commented Sep 8, 2023 •

edited

illuhad commented Sep 8, 2023

illuhad commented Sep 8, 2023

al42and commented Sep 9, 2023

illuhad commented Sep 11, 2023 •

edited

illuhad commented Sep 11, 2023

illuhad commented Sep 11, 2023

illuhad commented Sep 12, 2023

al42and commented Sep 13, 2023 •

edited

illuhad commented Sep 14, 2023

al42and commented Sep 14, 2023

illuhad commented Sep 14, 2023 •

edited

al42and commented Sep 14, 2023 •

edited

illuhad commented Sep 14, 2023

al42and commented Sep 14, 2023

illuhad commented Sep 14, 2023 •

edited

al42and commented Sep 15, 2023

illuhad commented Sep 15, 2023 •

edited

al42and commented Sep 16, 2023

illuhad commented Sep 16, 2023

al42and commented Sep 17, 2023

illuhad commented Sep 18, 2023

pszi1ard commented Sep 18, 2023 •

edited

illuhad commented Sep 19, 2023

al42and commented Sep 26, 2023

illuhad commented Sep 26, 2023

al42and commented Sep 26, 2023 •

edited

illuhad commented Sep 26, 2023

al42and commented Oct 4, 2023

illuhad commented Oct 9, 2023 •

edited

al42and commented Oct 9, 2023

illuhad commented Oct 9, 2023

illuhad commented Oct 18, 2023

Add instant submission mode #1128

Add instant submission mode #1128

Conversation

illuhad commented Sep 8, 2023 • edited

al42and commented Sep 8, 2023 • edited

illuhad commented Sep 8, 2023

illuhad commented Sep 8, 2023

al42and commented Sep 9, 2023

illuhad commented Sep 11, 2023 • edited

illuhad commented Sep 11, 2023

illuhad commented Sep 11, 2023

illuhad commented Sep 12, 2023

al42and commented Sep 13, 2023 • edited

illuhad commented Sep 14, 2023

al42and commented Sep 14, 2023

illuhad commented Sep 14, 2023 • edited

al42and commented Sep 14, 2023 • edited

illuhad commented Sep 14, 2023

al42and commented Sep 14, 2023

illuhad commented Sep 14, 2023 • edited

al42and commented Sep 15, 2023

illuhad commented Sep 15, 2023 • edited

al42and commented Sep 16, 2023

illuhad commented Sep 16, 2023

al42and commented Sep 17, 2023

illuhad commented Sep 18, 2023

pszi1ard commented Sep 18, 2023 • edited

illuhad commented Sep 19, 2023

al42and commented Sep 26, 2023

illuhad commented Sep 26, 2023

al42and commented Sep 26, 2023 • edited

illuhad commented Sep 26, 2023

al42and commented Oct 4, 2023

illuhad commented Oct 9, 2023 • edited

al42and commented Oct 9, 2023

illuhad commented Oct 9, 2023

illuhad commented Oct 18, 2023

illuhad commented Sep 8, 2023 •

edited

al42and commented Sep 8, 2023 •

edited

illuhad commented Sep 11, 2023 •

edited

al42and commented Sep 13, 2023 •

edited

illuhad commented Sep 14, 2023 •

edited

al42and commented Sep 14, 2023 •

edited

illuhad commented Sep 14, 2023 •

edited

illuhad commented Sep 15, 2023 •

edited

pszi1ard commented Sep 18, 2023 •

edited

al42and commented Sep 26, 2023 •

edited

illuhad commented Oct 9, 2023 •

edited