Skip to content

Conversation

@JackAKirk
Copy link
Collaborator

this maps to the ur cluster_launch device info here oneapi-src/unified-runtime#1792.

this maps to the ur cluster_launch device info.
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk changed the title Cuda cluster launch aspect Cuda cluster group aspect Jun 26, 2024
@AD2605
Copy link
Owner

AD2605 commented Jun 27, 2024

Thanks a lot @JackAKirk !

@AD2605 AD2605 merged commit 0380732 into AD2605:atharva/thread_block_cluster_launch Jun 27, 2024
AD2605 pushed a commit that referenced this pull request Jul 30, 2024
#3) (#93315)

The ThreadLocalCache implementation is used by the MLIRContext (among
other things) to try to manage thread contention in the StorageUniquers.
There is a bunch of fancy shared pointer/weak pointer setups that
basically keeps everything alive across threads at the right time, but a
huge bottleneck is the `weak_ptr::lock` call inside the `::get` method.

This is because the `lock` method has to hit the atomic refcount several
times, and this is bottlenecking performance across many threads.
However, all this is doing is checking whether the storage is
initialized. Importantly, when the `PerThreadInstance` goes out of
scope, it does not remove all of its associated entries from the
thread-local hash map (it contains dangling `PerThreadInstance *` keys).
The `weak_ptr` also allows the thread local cache to synchronize with
the `PerThreadInstance`'s destruction:

1. if `ThreadLocalCache` destructs, the `weak_ptr`s that reference its
contained values are immediately invalidated
2. if `CacheType` destructs within a thread, any entries still live are
removed from the owning `PerThreadInstance`, and it locks the `weak_ptr`
first to ensure it's kept alive long enough for the removal.

This PR changes the TLC entries to contain a `shared_ptr<ValueT*>` and a
`weak_ptr<PerInstanceState>`. It gives the `PerInstanceState` entries a
`weak_ptr<ValueT*>` on top of the `unique_ptr<ValueT>`. This enables
`ThreadLocalCache::get` to check if the value is initialized by
dereferencing the `shared_ptr<ValueT*>` and check if the contained
pointer is null. When `PerInstanceState` destructs, the values inside
the TLC are written to nullptr. The TLC uses the
`weak_ptr<PerInstanceState>` to satisfy (2).

(1) is no longer the case. When `ThreadLocalCache` begins destruction,
the `weak_ptr<PerInstanceState>` are invalidated, but not the
`shared_ptr<ValueT*>`. This is OK: because the overall object is being
destroyed, `::get` cannot get called and because the
`shared_ptr<PerInstanceState>` finishes destruction before freeing the
pointer, it cannot get reallocated to another `ThreadLocalCache` during
destruction. I.e. the values inside the TLC associated with a
`PerInstanceState` cannot be read during destruction. The most important
thing is to make sure destruction of the TLC doesn't race with the
destructor of `PerInstanceState`. Because `PerInstanceState` carries
`weak_ptr` references into the TLC, we guarantee to not have any
use-after-frees.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants