Skip to content

Fix removal state#1981

Merged
chrisstaite-menlo merged 2 commits intoTraceMachina:mainfrom
chrisstaite:bugfix/Fix-eviction-map
Oct 16, 2025
Merged

Fix removal state#1981
chrisstaite-menlo merged 2 commits intoTraceMachina:mainfrom
chrisstaite:bugfix/Fix-eviction-map

Conversation

@chrisstaite
Copy link
Contributor

@chrisstaite chrisstaite commented Oct 15, 2025

Description

The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch. This trashes the performance of the EvictionMap.

Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention.

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Tests still passing.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@chrisstaite chrisstaite force-pushed the bugfix/Fix-eviction-map branch from 534ebcf to ef2d582 Compare October 15, 2025 14:31
@chrisstaite chrisstaite force-pushed the bugfix/Fix-eviction-map branch from ef2d582 to 6fb8e35 Compare October 15, 2025 14:37
@chrisstaite chrisstaite force-pushed the bugfix/Fix-eviction-map branch from 6fb8e35 to 54eb32c Compare October 15, 2025 14:51
@chrisstaite chrisstaite force-pushed the bugfix/Fix-eviction-map branch from 54eb32c to 90f335b Compare October 15, 2025 14:53
@MarcusSorealheis MarcusSorealheis requested review from MarcusSorealheis and palfrey and removed request for palfrey October 15, 2025 15:09
@MarcusSorealheis
Copy link
Collaborator

For context, please review this: #1947

@MarcusSorealheis
Copy link
Collaborator

@chrisstaite Could you show the performance data you collected or the tests you ran so we can repro?

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eliminating async operations inside locks (to avoid blocking under contention), and deferring callback executions outside locks (to prevent deadlocks) should be performance improvements. The only thing is that we need test. I trust, but we need to verify.

callback: Arc<dyn RemoveItemCallback>,
) -> Result<(), Error> {
self.ac_store.register_remove_callback(callback)?;
self.ac_store.register_remove_callback(callback.clone())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this is a performance improvement. It may be, but we need evidence. cc @palfrey

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Box types are not ideal

The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch.  This trashes the performance of the EvictionMap.

Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention.
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In initial testing with a Chromium build I was seeing a decent performance bump of 27 actions/sec to 50. However, once the workers are warm there's not much use of the maps so that didn't really last for the whole build. The most important thing here is to ensure that the state is not held across await points as this can cause deadlocks, especially since it's calling arbitrary code outside of the module.

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 25 files reviewed, and pending CI: macos-15, ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, integration-tests, ubuntu-24.04 / stable, windows-2022 / stable, pre-commit-checks, Analyze (javascript-typescript), Analyze (python), Coverage, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Installation / macos-15, Installation / ubuntu-24.04, buildstream, mongo, asan / ubuntu-24.04, Local / bazel / ubuntu-24.04, vale, and 1 discussions need to be resolved


nativelink-store/src/ontap_s3_store.rs line 249 at r4 (raw file):

                                    if last_modified.secs() + self.consider_expired_after_s <= now_s
                                    {
                                        let remove_callbacks = self.remove_callbacks.lock().clone();

No need to hold the lock over the async callback, a simple efficient clone of the Arcs is all that's required. This means there's no possibility of deadlock here.


nativelink-store/src/existence_cache_store.rs line 125 at r4 (raw file):

            inner_store,
            existence_cache: EvictingMap::new(eviction_policy, anchor_time),
            pause_remove_callbacks: Mutex::new(None),

No need to Arc this as it shouldn't be required to be held across an await point.


nativelink-store/src/existence_cache_store.rs line 269 at r4 (raw file):

            let maybe_keys = self.pause_remove_callbacks.lock().take();
            if let Some(keys) = maybe_keys {
                let mut callbacks: FuturesUnordered<_> = keys

Using FuturesUnordered for all these Future loops is a much more efficient way to perform multiple tasks.


nativelink-store/src/completeness_checking_store.rs line 397 at r4 (raw file):

        callback: Arc<dyn RemoveItemCallback>,
    ) -> Result<(), Error> {
        self.ac_store.register_remove_callback(callback.clone())?;

Previously this did the clone at the bottom level and now it doesn't. The Arc<Box> is a code smell and never necessary.


nativelink-store/src/callback_utils.rs line 26 at r4 (raw file):

#[derive(Debug)]
pub struct RemoveItemCallbackHolder {
    callback: Arc<dyn RemoveItemCallback>,

There's never any need to wrap Arc<Box> the Arc type is already boxed.


nativelink-store/src/callback_utils.rs line 30 at r4 (raw file):

impl RemoveItemCallbackHolder {
    pub fn new(callback: Arc<dyn RemoveItemCallback>) -> Self {

Pass by move means far fewer atomic actions need to be performed as there is no Drop called. This keeps all of the CPU caches in place.


nativelink-store/src/callback_utils.rs line 35 at r4 (raw file):

}

impl<'a, Q> evicting_map::RemoveItemCallback<Q> for RemoveItemCallbackHolder

This is a bit of a monstrosity to keep the nativelink-util separated from nativelink-store but essentially it allows us to move the StoreKey from one callback to the other maintaining the lifetime.


nativelink-store/src/ref_store.rs line 28 at r3 (raw file):

    RemoveItemCallback, Store, StoreDriver, StoreKey, StoreLike, UploadSizeInfo,
};
use parking_lot::Mutex;

The rest of the system specifically makes use of parking_lot due to its efficiency.


nativelink-store/src/ref_store.rs line 90 at r3 (raw file):

            .err_tip(|| "Store manager is gone")?;
        if let Some(store) = store_manager.get_store(&self.name) {
            let remove_callbacks = self.remove_callbacks.lock().clone();

Note the clone is the only thing that happens under the lock now which is much more efficient.


nativelink-util/src/evicting_map.rs line 159 at r3 (raw file):

        }

        let callbacks = self

Holding two locks while performing these callbacks was dangerous and inefficient. This now collects and handles later.


nativelink-util/src/evicting_map.rs line 210 at r3 (raw file):

> {
    #[metric]
    state: Mutex<State<K, Q, T, C>>,

This has always specifically been not an Arc to ensure that there is no possibility of a deadlock due to re-entrance. Currently that is absolutely possible.


nativelink-util/src/evicting_map.rs line 319 at r3 (raw file):

    #[must_use]
    fn evict_items(&self, state: &mut State<K, Q, T, C>) -> (Vec<T>, Vec<RemoveFuture>) {

This should never have been async while state was mutable. The lock being held over an await point is a code smell and indicates a possible deadlock.


nativelink-util/src/evicting_map.rs line 407 at r3 (raw file):

                            info!(?key, "Item expired, evicting");
                            let (data, futures) = state.remove(key.borrow(), &eviction_item, false);
                            // Store data for later unref - we can't drop state here as we're still iterating

This comment was specifically ignored. Now it's handled correctly as it is elsewhere in the file now.


nativelink-util/src/evicting_map.rs line 452 at r4 (raw file):

            let (items_to_unref, removal_futures) = {
                let mut state = self.state.lock();
                self.evict_items(&mut *state)

The state should not be held while unrefing to avoid deadlock.

@MarcusSorealheis
Copy link
Collaborator

Using FuturesUnordered for all these Future loops is a much more efficient way to perform multiple tasks.

Yes, very much so.

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot we need to run a few tests with a few different projects

fn register_remove_callback(
self: Arc<Self>,
_callback: &Arc<Box<dyn RemoveItemCallback>>,
_callback: Arc<dyn RemoveItemCallback>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? It's simply updating to the trait change.

Copy link
Contributor Author

@chrisstaite chrisstaite Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that this should probably return an error in the same way that the gRPC store does because the Redis LRU does not notify on object removal.

_callback: &Arc<Box<dyn RemoveItemCallback>>,
_callback: Arc<dyn RemoveItemCallback>,
) -> Result<(), Error> {
// As redis doesn't drop stuff, we can just ignore this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@palfrey this isn't true, Redis runs with a global LRU which automatically clears keys, probably should return an error in the same way as the gRPC store. That's a breaking change though, so should be separate.

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for me

@chrisstaite-menlo chrisstaite-menlo merged commit d85e491 into TraceMachina:main Oct 16, 2025
27 of 28 checks passed
MarcusSorealheis added a commit to MarcusSorealheis/nativelink that referenced this pull request Nov 3, 2025
The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch.  This trashes the performance of the EvictionMap.

Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention.

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants