fix: add logging and retry for cache population failures by chrisburr · Pull Request #736 · DIRACGrid/diracx

chrisburr · 2026-01-20T07:47:30Z

Summary

Add error logging with full traceback when TwoLevelCache.populate_func() fails (e.g., config loading)
Move future cleanup to finally block so failed futures are always removed
This allows subsequent requests to retry instead of being stuck with a stale failed future

Previously when cache population failed:

Exceptions were silently swallowed by ThreadPoolExecutor
Failed futures were never removed from the dict
Subsequent requests would see the stale failed future and never retry
Use future.result() instead of wait() to propagate exceptions to callers
Add unit tests for TwoLevelCache error handling

read-the-docs-community · 2026-01-20T07:48:42Z

Documentation build overview

📚 diracx | 🛠️ Build #31070786 | 📁 Comparing 7fe225c against latest (581517d)

🔍 Preview build

Show files changed (1 files in total): 📝 1 modified | ➕ 0 added | ➖ 0 deleted

File	Status
admin/reference/values/index.html	📝 modified

Copilot

Pull request overview

This PR improves error handling in the TwoLevelCache class by adding comprehensive logging for cache population failures and implementing retry logic. Previously, failed cache population attempts would silently fail and subsequent requests would get stuck waiting for stale failed futures.

Changes:

Added error logging with full traceback when cache population fails using logger.error() with exc_info=True
Moved future cleanup to a finally block to ensure failed futures are always removed, enabling retries
Changed from wait() to future.result() to properly propagate exceptions to callers
Added comprehensive unit tests for error handling scenarios including retry behavior

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
diracx-core/src/diracx/core/utils.py	Refactored `_work()` method with try-except-finally for robust error handling, added logging import and logger initialization, changed `get()` to use `future.result()` for exception propagation, added debug logging for non-blocking cache misses
diracx-core/tests/test_utils.py	Added comprehensive test suite for `TwoLevelCache` covering successful population, failed population with logging verification, retry after failure, failure-then-success scenarios, and non-blocking behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T08:03:47Z

diracx-core/src/diracx/core/utils.py

+        result = None
+        try:
+            result = populate_func()
+        except Exception:
+            logger.error(
+                "Failed to populate cache key %r, will retry on next request",
+                key,
+                exc_info=True,
+            )
+            raise
+        finally:
+            # Always remove the future so the next request can retry on failure
+            # or submit a new refresh task on success
+            with self.locks[key]:
+                self.futures.pop(key, None)
+                if result is not None:


The check if result is not None cannot properly distinguish between populate_func() succeeding with a None return value versus populate_func() raising an exception. In both cases, result would be None. This breaks the original behavior where None values could be cached. Use a success flag or sentinel value instead. For example: initialize success = False before the try block, set success = True after line 263, and check if success: on line 276 instead of if result is not None:.

Copilot · 2026-01-20T08:03:48Z

diracx-core/src/diracx/core/utils.py

-                wait([future])
+                # Use result() instead of wait() to propagate any exceptions
+                future.result()
                return self.hard_cache[key]


After calling future.result() on line 234, the code assumes the key exists in self.hard_cache and directly accesses it on line 235. However, the new finally block logic at line 276 only caches the result if result is not None. This means if populate_func() returns None (or the check fails), the key won't be in hard_cache, causing a KeyError. This is a behavior change from the original code which cached all return values including None. Either remove the if result is not None condition on line 276, or add error handling on line 235 to handle missing keys.

Suggested change

return self.hard_cache[key]

try:

return self.hard_cache[key]

except KeyError:

# If the background worker did not cache a value (for example,

# because it skipped caching when the result was None), fall

# back to computing and caching the value here to avoid a

# KeyError and to restore the original behavior of caching

# all return values, including None.

result = populate_func()

self.hard_cache[key] = result

self.soft_cache[key] = result

return result

Copilot · 2026-01-20T08:03:48Z

diracx-core/tests/test_utils.py

+
+        cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)
+
+        def slow_populate():
+            time.sleep(0.5)
+            return "value"
+
+        # Start population in background by acquiring lock then releasing
+        # Use non-blocking mode which should raise NotReadyError
+        with pytest.raises(NotReadyError, match="not ready yet"):
+            # First call starts background population, second call (non-blocking)
+            # should raise NotReadyError since cache isn't populated yet
+            import threading
+
+            def start_slow():
+                cache.get("key", slow_populate, blocking=True)
+
+            thread = threading.Thread(target=start_slow)
+            thread.start()
+            time.sleep(0.1)  # Give thread time to start
+            cache.get("key", slow_populate, blocking=False)
+            thread.join()


This test has structural issues: the pytest.raises context contains not just the call that should raise NotReadyError, but also thread management code. If thread.start() or time.sleep() somehow raised NotReadyError, the test would incorrectly pass. More critically, thread.join() is inside the context, so if the background thread raises an exception during its execution, that exception will be silently ignored. The test should be restructured to start the thread and verify it completes successfully outside the pytest.raises block. Additionally, there's a race condition: the background thread might complete within the 0.1 second sleep, causing the test to fail intermittently.

Suggested change

cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)

def slow_populate():

time.sleep(0.5)

return "value"

# Start population in background by acquiring lock then releasing

# Use non-blocking mode which should raise NotReadyError

with pytest.raises(NotReadyError, match="not ready yet"):

# First call starts background population, second call (non-blocking)

# should raise NotReadyError since cache isn't populated yet

import threading

def start_slow():

cache.get("key", slow_populate, blocking=True)

thread = threading.Thread(target=start_slow)

thread.start()

time.sleep(0.1) # Give thread time to start

cache.get("key", slow_populate, blocking=False)

thread.join()

import threading

cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)

started = threading.Event()

def slow_populate():

# Signal that population has started, then sleep to simulate work

started.set()

time.sleep(0.5)

return "value"

def start_slow():

cache.get("key", slow_populate, blocking=True)

# Start population in background

thread = threading.Thread(target=start_slow)

thread.start()

# Wait until slow_populate has started to avoid race conditions

assert started.wait(timeout=1.0)

# Use non-blocking mode which should raise NotReadyError

with pytest.raises(NotReadyError, match="not ready yet"):

# Non-blocking call should raise NotReadyError since cache isn't populated yet

cache.get("key", slow_populate, blocking=False)

# Ensure background thread completes and surface any exceptions it raised

thread.join()

Copilot · 2026-01-20T08:03:48Z

diracx-core/tests/test_utils.py

+    def test_failed_population_then_success(self):
+        """Test that after a failure, subsequent successful call works."""
+        cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)
+        should_fail = True
+
+        def populate():
+            if should_fail:
+                raise ValueError("Test error")
+            return "success_value"
+
+        # First call fails
+        with pytest.raises(ValueError):
+            cache.get("key", populate, blocking=True)
+
+        # Second call succeeds
+        should_fail = False
+        result = cache.get("key", populate, blocking=True)
+        assert result == "success_value"
+


The tests don't cover the case where populate_func returns None or other falsy values (False, 0, empty string, empty list). Given the bug at line 276 in utils.py where if result is not None prevents caching of None values, a test should be added to document the expected behavior for falsy return values. This would help catch the critical bug where line 235 in utils.py assumes the key exists in hard_cache after future.result() completes.

Copilot · 2026-01-20T08:03:48Z

diracx-core/tests/test_utils.py

+        # Second call should use cached value
+        result = cache.get("key", populate)
+        assert result == "test_value"
+        assert call_count == 1


Test is always true, because of this condition.

Previously when TwoLevelCache.populate_func() failed: - Exceptions were silently swallowed by ThreadPoolExecutor - Failed futures were never removed from the dict - Subsequent requests would see the stale failed future and never retry Changes: - Add error logging with full traceback when population fails - Move future cleanup to finally block so it always runs - Use future.result() instead of wait() to propagate exceptions - Add unit tests for TwoLevelCache error handling

chrisburr force-pushed the fix/cache-population-logging branch from 8e3c1c5 to f421620 Compare January 20, 2026 07:55

chrisburr requested a review from Copilot January 20, 2026 07:57

Copilot started reviewing on behalf of chrisburr January 20, 2026 07:58 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

chrisburr force-pushed the fix/cache-population-logging branch from f421620 to 7fe225c Compare January 20, 2026 08:16

chrisburr merged commit b899a8d into DIRACGrid:main Jan 20, 2026
25 checks passed

DIRACGridBot mentioned this pull request Jan 20, 2026

chore(main): release 0.0.7 #738

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add logging and retry for cache population failures#736

fix: add logging and retry for cache population failures#736
chrisburr merged 1 commit intoDIRACGrid:mainfrom
chrisburr:fix/cache-population-logging

chrisburr commented Jan 20, 2026 •

edited

Loading

Uh oh!

read-the-docs-community bot commented Jan 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                return self.hard_cache[key]
+                try:
+                    return self.hard_cache[key]
+                except KeyError:
+                    # If the background worker did not cache a value (for example,
+                    # because it skipped caching when the result was None), fall
+                    # back to computing and caching the value here to avoid a
+                    # KeyError and to restore the original behavior of caching
+                    # all return values, including None.
+                    result = populate_func()
+                    self.hard_cache[key] = result
+                    self.soft_cache[key] = result
+                    return result

Conversation

chrisburr commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

read-the-docs-community bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chrisburr commented Jan 20, 2026 •

edited

Loading

read-the-docs-community bot commented Jan 20, 2026 •

edited

Loading