fix: add logging and retry for cache population failures#736
fix: add logging and retry for cache population failures#736chrisburr merged 1 commit intoDIRACGrid:mainfrom
Conversation
8e3c1c5 to
f421620
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves error handling in the TwoLevelCache class by adding comprehensive logging for cache population failures and implementing retry logic. Previously, failed cache population attempts would silently fail and subsequent requests would get stuck waiting for stale failed futures.
Changes:
- Added error logging with full traceback when cache population fails using
logger.error()withexc_info=True - Moved future cleanup to a
finallyblock to ensure failed futures are always removed, enabling retries - Changed from
wait()tofuture.result()to properly propagate exceptions to callers - Added comprehensive unit tests for error handling scenarios including retry behavior
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| diracx-core/src/diracx/core/utils.py | Refactored _work() method with try-except-finally for robust error handling, added logging import and logger initialization, changed get() to use future.result() for exception propagation, added debug logging for non-blocking cache misses |
| diracx-core/tests/test_utils.py | Added comprehensive test suite for TwoLevelCache covering successful population, failed population with logging verification, retry after failure, failure-then-success scenarios, and non-blocking behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
diracx-core/src/diracx/core/utils.py
Outdated
| result = None | ||
| try: | ||
| result = populate_func() | ||
| except Exception: | ||
| logger.error( | ||
| "Failed to populate cache key %r, will retry on next request", | ||
| key, | ||
| exc_info=True, | ||
| ) | ||
| raise | ||
| finally: | ||
| # Always remove the future so the next request can retry on failure | ||
| # or submit a new refresh task on success | ||
| with self.locks[key]: | ||
| self.futures.pop(key, None) | ||
| if result is not None: |
There was a problem hiding this comment.
The check if result is not None cannot properly distinguish between populate_func() succeeding with a None return value versus populate_func() raising an exception. In both cases, result would be None. This breaks the original behavior where None values could be cached. Use a success flag or sentinel value instead. For example: initialize success = False before the try block, set success = True after line 263, and check if success: on line 276 instead of if result is not None:.
| wait([future]) | ||
| # Use result() instead of wait() to propagate any exceptions | ||
| future.result() | ||
| return self.hard_cache[key] |
There was a problem hiding this comment.
After calling future.result() on line 234, the code assumes the key exists in self.hard_cache and directly accesses it on line 235. However, the new finally block logic at line 276 only caches the result if result is not None. This means if populate_func() returns None (or the check fails), the key won't be in hard_cache, causing a KeyError. This is a behavior change from the original code which cached all return values including None. Either remove the if result is not None condition on line 276, or add error handling on line 235 to handle missing keys.
| return self.hard_cache[key] | |
| try: | |
| return self.hard_cache[key] | |
| except KeyError: | |
| # If the background worker did not cache a value (for example, | |
| # because it skipped caching when the result was None), fall | |
| # back to computing and caching the value here to avoid a | |
| # KeyError and to restore the original behavior of caching | |
| # all return values, including None. | |
| result = populate_func() | |
| self.hard_cache[key] = result | |
| self.soft_cache[key] = result | |
| return result |
diracx-core/tests/test_utils.py
Outdated
|
|
||
| cache = TwoLevelCache(soft_ttl=10, hard_ttl=60) | ||
|
|
||
| def slow_populate(): | ||
| time.sleep(0.5) | ||
| return "value" | ||
|
|
||
| # Start population in background by acquiring lock then releasing | ||
| # Use non-blocking mode which should raise NotReadyError | ||
| with pytest.raises(NotReadyError, match="not ready yet"): | ||
| # First call starts background population, second call (non-blocking) | ||
| # should raise NotReadyError since cache isn't populated yet | ||
| import threading | ||
|
|
||
| def start_slow(): | ||
| cache.get("key", slow_populate, blocking=True) | ||
|
|
||
| thread = threading.Thread(target=start_slow) | ||
| thread.start() | ||
| time.sleep(0.1) # Give thread time to start | ||
| cache.get("key", slow_populate, blocking=False) | ||
| thread.join() |
There was a problem hiding this comment.
This test has structural issues: the pytest.raises context contains not just the call that should raise NotReadyError, but also thread management code. If thread.start() or time.sleep() somehow raised NotReadyError, the test would incorrectly pass. More critically, thread.join() is inside the context, so if the background thread raises an exception during its execution, that exception will be silently ignored. The test should be restructured to start the thread and verify it completes successfully outside the pytest.raises block. Additionally, there's a race condition: the background thread might complete within the 0.1 second sleep, causing the test to fail intermittently.
| cache = TwoLevelCache(soft_ttl=10, hard_ttl=60) | |
| def slow_populate(): | |
| time.sleep(0.5) | |
| return "value" | |
| # Start population in background by acquiring lock then releasing | |
| # Use non-blocking mode which should raise NotReadyError | |
| with pytest.raises(NotReadyError, match="not ready yet"): | |
| # First call starts background population, second call (non-blocking) | |
| # should raise NotReadyError since cache isn't populated yet | |
| import threading | |
| def start_slow(): | |
| cache.get("key", slow_populate, blocking=True) | |
| thread = threading.Thread(target=start_slow) | |
| thread.start() | |
| time.sleep(0.1) # Give thread time to start | |
| cache.get("key", slow_populate, blocking=False) | |
| thread.join() | |
| import threading | |
| cache = TwoLevelCache(soft_ttl=10, hard_ttl=60) | |
| started = threading.Event() | |
| def slow_populate(): | |
| # Signal that population has started, then sleep to simulate work | |
| started.set() | |
| time.sleep(0.5) | |
| return "value" | |
| def start_slow(): | |
| cache.get("key", slow_populate, blocking=True) | |
| # Start population in background | |
| thread = threading.Thread(target=start_slow) | |
| thread.start() | |
| # Wait until slow_populate has started to avoid race conditions | |
| assert started.wait(timeout=1.0) | |
| # Use non-blocking mode which should raise NotReadyError | |
| with pytest.raises(NotReadyError, match="not ready yet"): | |
| # Non-blocking call should raise NotReadyError since cache isn't populated yet | |
| cache.get("key", slow_populate, blocking=False) | |
| # Ensure background thread completes and surface any exceptions it raised | |
| thread.join() |
| def test_failed_population_then_success(self): | ||
| """Test that after a failure, subsequent successful call works.""" | ||
| cache = TwoLevelCache(soft_ttl=10, hard_ttl=60) | ||
| should_fail = True | ||
|
|
||
| def populate(): | ||
| if should_fail: | ||
| raise ValueError("Test error") | ||
| return "success_value" | ||
|
|
||
| # First call fails | ||
| with pytest.raises(ValueError): | ||
| cache.get("key", populate, blocking=True) | ||
|
|
||
| # Second call succeeds | ||
| should_fail = False | ||
| result = cache.get("key", populate, blocking=True) | ||
| assert result == "success_value" | ||
|
|
There was a problem hiding this comment.
The tests don't cover the case where populate_func returns None or other falsy values (False, 0, empty string, empty list). Given the bug at line 276 in utils.py where if result is not None prevents caching of None values, a test should be added to document the expected behavior for falsy return values. This would help catch the critical bug where line 235 in utils.py assumes the key exists in hard_cache after future.result() completes.
| # Second call should use cached value | ||
| result = cache.get("key", populate) | ||
| assert result == "test_value" | ||
| assert call_count == 1 |
There was a problem hiding this comment.
Test is always true, because of this condition.
Previously when TwoLevelCache.populate_func() failed: - Exceptions were silently swallowed by ThreadPoolExecutor - Failed futures were never removed from the dict - Subsequent requests would see the stale failed future and never retry Changes: - Add error logging with full traceback when population fails - Move future cleanup to finally block so it always runs - Use future.result() instead of wait() to propagate exceptions - Add unit tests for TwoLevelCache error handling
f421620 to
7fe225c
Compare
Summary
TwoLevelCache.populate_func()fails (e.g., config loading)finallyblock so failed futures are always removedPreviously when cache population failed:
future.result()instead ofwait()to propagate exceptions to callersTwoLevelCacheerror handling