Skip to content

Fixes hybrid cache consistency issues#433

Merged
niemyjski merged 1 commit intomainfrom
bugfix/hybrid-cache-client-invalidation
Jan 8, 2026
Merged

Fixes hybrid cache consistency issues#433
niemyjski merged 1 commit intomainfrom
bugfix/hybrid-cache-client-invalidation

Conversation

@niemyjski
Copy link
Member

Ensures data consistency between local and distributed caches.

The local cache is now only updated when the distributed cache operations succeed. If a distributed cache operation fails, the corresponding key is removed from the local cache to force a re-fetch, preventing stale data.

Specifically addresses scenarios where Set, SetAll, Replace, ReplaceIfEqual, Increment, ListAdd, ListRemove, SetIfHigher, SetIfLower may result in inconsistent state.

Also refactors the local cache initialization and removes the local cache expired event handler, relying solely on the message bus for invalidation.

Remove L1 Expiration Notification

private Task OnLocalCacheItemExpiredAsync(object sender, ItemExpiredEventArgs args) {
    if (!args.SendNotification)
        return Task.CompletedTask;
    _logger.LogTrace("Local cache expired event: key={Key}", args.Key);
    return _messageBus.PublishAsync(new InvalidateCache { CacheId = _cacheId, Keys = [args.Key], Expired = true });
}

Why Remove It

  1. L1 expiration is a local concern - When L1 expires, L2 has either already expired (same TTL) or still has the value (TTL skew). Either way, notifying other instances is unnecessary.
  2. Industry standard - Microsoft HybridCache (.NET 9+) and EasyCaching do NOT publish L1 expiration events. They only publish on write operations.
  3. Unnecessary traffic - Every L1 expiration generates a message to ALL instances, even though they will naturally expire around the same time.
  4. Potential harm - If L1 expires before L2 (clock drift), this forces other instances to re-fetch even though L2 still has valid data.

Now I have a complete picture. Let me analyze why the tests didn't catch these issues:

Analysis: Why Tests Didn't Catch These Issues

Based on my review of the test coverage and the changes made, here's the analysis:

The Changes Made (Summary)

  1. Removed L1 expiration notification - OnLocalCacheItemExpiredAsync was publishing invalidation messages when local cache items expired
  2. Changed write order - Now writes to L2 (distributed) first, then L1 (local) only on success
  3. Added failure handling - When L2 operations fail, L1 is now cleared to prevent stale data
  4. IncrementAsync TTL handling - When expiresIn is null, now removes from L1 instead of setting with null expiration
  5. SetIfHigher/SetIfLower - Now always removes from L1 since we don't know the actual value

Why Tests Didn't Catch These

1. These weren't "bugs" in the traditional sense - they were design decisions that could cause subtle consistency issues

The old code "worked" - it just had potential for stale data in edge cases:

  • L1 could have data that L2 didn't (if L2 write failed)
  • L1 could have permanent entries while L2 had TTL (IncrementAsync with null expiration)
  • L1 expiration notifications were unnecessary traffic, not incorrect behavior

2. The test infrastructure uses InMemoryCacheClient for both L1 and L2

Looking at HybridCacheClientTestBase.cs:

_distributedCache = new InMemoryCacheClient(o => o.CloneValues(true)...);
_messageBus = new InMemoryMessageBus(o => o.LoggerFactory(Log));

With in-memory implementations:

  • L2 never fails - SetAsync, ReplaceAsync, etc. always succeed
  • No network latency - No timing issues between L1 and L2
  • No partial failures - SetAllAsync always sets all items

3. The skipped test reveals the limitation

[Fact(Skip = "Skip because cache invalidation loops on this with 2 in memory cache client instances")]
public override Task AddAsync_WithExpiration_ExpiresRemoteItems()

This test was skipped because using two in-memory caches sharing the same message bus creates invalidation loops - the exact scenario where the L1 expiration notification was problematic.

4. No tests for failure scenarios

The test suite doesn't have:

  • Tests where L2 SetAsync returns false
  • Tests where L2 SetAllAsync partially succeeds
  • Tests where L2 ReplaceAsync fails because key doesn't exist
  • Tests for IncrementAsync with null expiration on a key that has existing TTL

5. The consistency issues are timing-dependent

The stale data scenarios require:

  • L2 to fail while L1 succeeds (can't happen with in-memory)
  • L2 TTL to expire before L1 (requires real time delays and different TTL handling)
  • Multiple instances with clock drift (can't simulate with shared in-memory)

Bottom Line

You're correct - these issues weren't caught because:

  1. In-memory implementations don't fail - The test infrastructure can't simulate L2 failures
  2. The issues are consistency/design problems, not functional bugs - The code "worked" but could serve stale data
  3. Testing distributed cache consistency is hard - Requires either:
    • Mock/fake implementations that can simulate failures
    • Real distributed infrastructure (Redis) with failure injection
    • Complex multi-instance timing tests

The changes we made are defensive improvements to ensure L1 never has data that L2 doesn't have, and to eliminate unnecessary message bus traffic. They're not fixing test failures - they're preventing potential production issues that the test suite architecture couldn't detect.

Ensures data consistency between local and distributed caches.

The local cache is now only updated when the distributed cache operations succeed.
If a distributed cache operation fails, the corresponding key is removed from the local cache to force a re-fetch, preventing stale data.

Specifically addresses scenarios where `Set`, `SetAll`, `Replace`, `ReplaceIfEqual`, `Increment`, `ListAdd`, `ListRemove`, `SetIfHigher`, `SetIfLower` may result in inconsistent state.

Also refactors the local cache initialization and removes the local cache expired event handler, relying solely on the message bus for invalidation.

Remove L1 Expiration Notification
Current Code (HybridCacheClient.cs lines 61-68)
private Task OnLocalCacheItemExpiredAsync(object sender, ItemExpiredEventArgs args)
{
    if (!args.SendNotification)
        return Task.CompletedTask;
    _logger.LogTrace("Local cache expired event: key={Key}", args.Key);
    return _messageBus.PublishAsync(new InvalidateCache { CacheId = _cacheId, Keys = [args.Key], Expired = true });
}
Why Remove It
L1 expiration is a local concern - When L1 expires, L2 has either already expired (same TTL) or still has the value (TTL skew). Either way, notifying other instances is unnecessary.
Industry standard - Microsoft HybridCache (.NET 9+) and EasyCaching do NOT publish L1 expiration events. They only publish on write operations.
Unnecessary traffic - Every L1 expiration generates a message to ALL instances, even though they will naturally expire around the same time.
Potential harm - If L1 expires before L2 (clock drift), this forces other instances to re-fetch even though L2 still has valid data.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses critical cache consistency issues in the HybridCacheClient by ensuring the local cache (L1) only contains data that successfully exists in the distributed cache (L2). The changes implement a "write-to-L2-first, then-L1-on-success" pattern to prevent stale data.

Key Changes:

  • Removed L1 expiration event handler that unnecessarily notified other instances when local cache entries expired
  • Modified all write operations (Set, Replace, Increment, ListAdd/Remove, etc.) to update L2 first, then L1 only on success
  • Added defensive local cache removal when L2 operations fail or partially succeed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@niemyjski niemyjski merged commit 7169944 into main Jan 8, 2026
10 checks passed
@niemyjski niemyjski deleted the bugfix/hybrid-cache-client-invalidation branch January 8, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments