Fix removal state#1981
Conversation
534ebcf to
ef2d582
Compare
ef2d582 to
6fb8e35
Compare
6fb8e35 to
54eb32c
Compare
54eb32c to
90f335b
Compare
|
For context, please review this: #1947 |
90f335b to
8fc64c1
Compare
8fc64c1 to
da48ea5
Compare
da48ea5 to
cdacbd8
Compare
|
@chrisstaite Could you show the performance data you collected or the tests you ran so we can repro? |
MarcusSorealheis
left a comment
There was a problem hiding this comment.
Eliminating async operations inside locks (to avoid blocking under contention), and deferring callback executions outside locks (to prevent deadlocks) should be performance improvements. The only thing is that we need test. I trust, but we need to verify.
| callback: Arc<dyn RemoveItemCallback>, | ||
| ) -> Result<(), Error> { | ||
| self.ac_store.register_remove_callback(callback)?; | ||
| self.ac_store.register_remove_callback(callback.clone())?; |
There was a problem hiding this comment.
I'm not convinced that this is a performance improvement. It may be, but we need evidence. cc @palfrey
There was a problem hiding this comment.
Box types are not ideal
The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch. This trashes the performance of the EvictionMap. Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention.
cdacbd8 to
c1f385e
Compare
chrisstaite-menlo
left a comment
There was a problem hiding this comment.
In initial testing with a Chromium build I was seeing a decent performance bump of 27 actions/sec to 50. However, once the workers are warm there's not much use of the maps so that didn't really last for the whole build. The most important thing here is to ensure that the state is not held across await points as this can cause deadlocks, especially since it's calling arbitrary code outside of the module.
Reviewable status: 0 of 1 LGTMs obtained, and 0 of 25 files reviewed, and pending CI: macos-15, ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, integration-tests, ubuntu-24.04 / stable, windows-2022 / stable, pre-commit-checks, Analyze (javascript-typescript), Analyze (python), Coverage, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Installation / macos-15, Installation / ubuntu-24.04, buildstream, mongo, asan / ubuntu-24.04, Local / bazel / ubuntu-24.04, vale, and 1 discussions need to be resolved
nativelink-store/src/ontap_s3_store.rs line 249 at r4 (raw file):
if last_modified.secs() + self.consider_expired_after_s <= now_s { let remove_callbacks = self.remove_callbacks.lock().clone();
No need to hold the lock over the async callback, a simple efficient clone of the Arcs is all that's required. This means there's no possibility of deadlock here.
nativelink-store/src/existence_cache_store.rs line 125 at r4 (raw file):
inner_store, existence_cache: EvictingMap::new(eviction_policy, anchor_time), pause_remove_callbacks: Mutex::new(None),
No need to Arc this as it shouldn't be required to be held across an await point.
nativelink-store/src/existence_cache_store.rs line 269 at r4 (raw file):
let maybe_keys = self.pause_remove_callbacks.lock().take(); if let Some(keys) = maybe_keys { let mut callbacks: FuturesUnordered<_> = keys
Using FuturesUnordered for all these Future loops is a much more efficient way to perform multiple tasks.
nativelink-store/src/completeness_checking_store.rs line 397 at r4 (raw file):
callback: Arc<dyn RemoveItemCallback>, ) -> Result<(), Error> { self.ac_store.register_remove_callback(callback.clone())?;
Previously this did the clone at the bottom level and now it doesn't. The Arc<Box> is a code smell and never necessary.
nativelink-store/src/callback_utils.rs line 26 at r4 (raw file):
#[derive(Debug)] pub struct RemoveItemCallbackHolder { callback: Arc<dyn RemoveItemCallback>,
There's never any need to wrap Arc<Box> the Arc type is already boxed.
nativelink-store/src/callback_utils.rs line 30 at r4 (raw file):
impl RemoveItemCallbackHolder { pub fn new(callback: Arc<dyn RemoveItemCallback>) -> Self {
Pass by move means far fewer atomic actions need to be performed as there is no Drop called. This keeps all of the CPU caches in place.
nativelink-store/src/callback_utils.rs line 35 at r4 (raw file):
} impl<'a, Q> evicting_map::RemoveItemCallback<Q> for RemoveItemCallbackHolder
This is a bit of a monstrosity to keep the nativelink-util separated from nativelink-store but essentially it allows us to move the StoreKey from one callback to the other maintaining the lifetime.
nativelink-store/src/ref_store.rs line 28 at r3 (raw file):
RemoveItemCallback, Store, StoreDriver, StoreKey, StoreLike, UploadSizeInfo, }; use parking_lot::Mutex;
The rest of the system specifically makes use of parking_lot due to its efficiency.
nativelink-store/src/ref_store.rs line 90 at r3 (raw file):
.err_tip(|| "Store manager is gone")?; if let Some(store) = store_manager.get_store(&self.name) { let remove_callbacks = self.remove_callbacks.lock().clone();
Note the clone is the only thing that happens under the lock now which is much more efficient.
nativelink-util/src/evicting_map.rs line 159 at r3 (raw file):
} let callbacks = self
Holding two locks while performing these callbacks was dangerous and inefficient. This now collects and handles later.
nativelink-util/src/evicting_map.rs line 210 at r3 (raw file):
> { #[metric] state: Mutex<State<K, Q, T, C>>,
This has always specifically been not an Arc to ensure that there is no possibility of a deadlock due to re-entrance. Currently that is absolutely possible.
nativelink-util/src/evicting_map.rs line 319 at r3 (raw file):
#[must_use] fn evict_items(&self, state: &mut State<K, Q, T, C>) -> (Vec<T>, Vec<RemoveFuture>) {
This should never have been async while state was mutable. The lock being held over an await point is a code smell and indicates a possible deadlock.
nativelink-util/src/evicting_map.rs line 407 at r3 (raw file):
info!(?key, "Item expired, evicting"); let (data, futures) = state.remove(key.borrow(), &eviction_item, false); // Store data for later unref - we can't drop state here as we're still iterating
This comment was specifically ignored. Now it's handled correctly as it is elsewhere in the file now.
nativelink-util/src/evicting_map.rs line 452 at r4 (raw file):
let (items_to_unref, removal_futures) = { let mut state = self.state.lock(); self.evict_items(&mut *state)
The state should not be held while unrefing to avoid deadlock.
Yes, very much so. |
MarcusSorealheis
left a comment
There was a problem hiding this comment.
I forgot we need to run a few tests with a few different projects
| fn register_remove_callback( | ||
| self: Arc<Self>, | ||
| _callback: &Arc<Box<dyn RemoveItemCallback>>, | ||
| _callback: Arc<dyn RemoveItemCallback>, |
There was a problem hiding this comment.
Something is wrong here
There was a problem hiding this comment.
What do you mean? It's simply updating to the trait change.
There was a problem hiding this comment.
It's worth noting that this should probably return an error in the same way that the gRPC store does because the Redis LRU does not notify on object removal.
| _callback: &Arc<Box<dyn RemoveItemCallback>>, | ||
| _callback: Arc<dyn RemoveItemCallback>, | ||
| ) -> Result<(), Error> { | ||
| // As redis doesn't drop stuff, we can just ignore this |
There was a problem hiding this comment.
@palfrey this isn't true, Redis runs with a global LRU which automatically clears keys, probably should return an error in the same way as the gRPC store. That's a breaking change though, so should be separate.
The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch. This trashes the performance of the EvictionMap. Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention. Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
Description
The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch. This trashes the performance of the EvictionMap.
Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention.
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
Tests still passing.
Checklist
bazel test //...passes locallygit amendsee some docsThis change is