Skip to content

fix: add logging and retry for cache population failures#736

Merged
chrisburr merged 1 commit intoDIRACGrid:mainfrom
chrisburr:fix/cache-population-logging
Jan 20, 2026
Merged

fix: add logging and retry for cache population failures#736
chrisburr merged 1 commit intoDIRACGrid:mainfrom
chrisburr:fix/cache-population-logging

Conversation

@chrisburr
Copy link
Copy Markdown
Member

@chrisburr chrisburr commented Jan 20, 2026

Summary

  • Add error logging with full traceback when TwoLevelCache.populate_func() fails (e.g., config loading)
  • Move future cleanup to finally block so failed futures are always removed
  • This allows subsequent requests to retry instead of being stuck with a stale failed future

Previously when cache population failed:

  • Exceptions were silently swallowed by ThreadPoolExecutor
  • Failed futures were never removed from the dict
  • Subsequent requests would see the stale failed future and never retry
  • Use future.result() instead of wait() to propagate exceptions to callers
  • Add unit tests for TwoLevelCache error handling

@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community bot commented Jan 20, 2026

Documentation build overview

📚 diracx | 🛠️ Build #31070786 | 📁 Comparing 7fe225c against latest (581517d)


🔍 Preview build

Show files changed (1 files in total): 📝 1 modified | ➕ 0 added | ➖ 0 deleted
File Status
admin/reference/values/index.html 📝 modified

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves error handling in the TwoLevelCache class by adding comprehensive logging for cache population failures and implementing retry logic. Previously, failed cache population attempts would silently fail and subsequent requests would get stuck waiting for stale failed futures.

Changes:

  • Added error logging with full traceback when cache population fails using logger.error() with exc_info=True
  • Moved future cleanup to a finally block to ensure failed futures are always removed, enabling retries
  • Changed from wait() to future.result() to properly propagate exceptions to callers
  • Added comprehensive unit tests for error handling scenarios including retry behavior

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
diracx-core/src/diracx/core/utils.py Refactored _work() method with try-except-finally for robust error handling, added logging import and logger initialization, changed get() to use future.result() for exception propagation, added debug logging for non-blocking cache misses
diracx-core/tests/test_utils.py Added comprehensive test suite for TwoLevelCache covering successful population, failed population with logging verification, retry after failure, failure-then-success scenarios, and non-blocking behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +261 to +276
result = None
try:
result = populate_func()
except Exception:
logger.error(
"Failed to populate cache key %r, will retry on next request",
key,
exc_info=True,
)
raise
finally:
# Always remove the future so the next request can retry on failure
# or submit a new refresh task on success
with self.locks[key]:
self.futures.pop(key, None)
if result is not None:
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check if result is not None cannot properly distinguish between populate_func() succeeding with a None return value versus populate_func() raising an exception. In both cases, result would be None. This breaks the original behavior where None values could be cached. Use a success flag or sentinel value instead. For example: initialize success = False before the try block, set success = True after line 263, and check if success: on line 276 instead of if result is not None:.

Copilot uses AI. Check for mistakes.
wait([future])
# Use result() instead of wait() to propagate any exceptions
future.result()
return self.hard_cache[key]
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After calling future.result() on line 234, the code assumes the key exists in self.hard_cache and directly accesses it on line 235. However, the new finally block logic at line 276 only caches the result if result is not None. This means if populate_func() returns None (or the check fails), the key won't be in hard_cache, causing a KeyError. This is a behavior change from the original code which cached all return values including None. Either remove the if result is not None condition on line 276, or add error handling on line 235 to handle missing keys.

Suggested change
return self.hard_cache[key]
try:
return self.hard_cache[key]
except KeyError:
# If the background worker did not cache a value (for example,
# because it skipped caching when the result was None), fall
# back to computing and caching the value here to avoid a
# KeyError and to restore the original behavior of caching
# all return values, including None.
result = populate_func()
self.hard_cache[key] = result
self.soft_cache[key] = result
return result

Copilot uses AI. Check for mistakes.
Comment on lines +257 to +278

cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)

def slow_populate():
time.sleep(0.5)
return "value"

# Start population in background by acquiring lock then releasing
# Use non-blocking mode which should raise NotReadyError
with pytest.raises(NotReadyError, match="not ready yet"):
# First call starts background population, second call (non-blocking)
# should raise NotReadyError since cache isn't populated yet
import threading

def start_slow():
cache.get("key", slow_populate, blocking=True)

thread = threading.Thread(target=start_slow)
thread.start()
time.sleep(0.1) # Give thread time to start
cache.get("key", slow_populate, blocking=False)
thread.join()
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has structural issues: the pytest.raises context contains not just the call that should raise NotReadyError, but also thread management code. If thread.start() or time.sleep() somehow raised NotReadyError, the test would incorrectly pass. More critically, thread.join() is inside the context, so if the background thread raises an exception during its execution, that exception will be silently ignored. The test should be restructured to start the thread and verify it completes successfully outside the pytest.raises block. Additionally, there's a race condition: the background thread might complete within the 0.1 second sleep, causing the test to fail intermittently.

Suggested change
cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)
def slow_populate():
time.sleep(0.5)
return "value"
# Start population in background by acquiring lock then releasing
# Use non-blocking mode which should raise NotReadyError
with pytest.raises(NotReadyError, match="not ready yet"):
# First call starts background population, second call (non-blocking)
# should raise NotReadyError since cache isn't populated yet
import threading
def start_slow():
cache.get("key", slow_populate, blocking=True)
thread = threading.Thread(target=start_slow)
thread.start()
time.sleep(0.1) # Give thread time to start
cache.get("key", slow_populate, blocking=False)
thread.join()
import threading
cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)
started = threading.Event()
def slow_populate():
# Signal that population has started, then sleep to simulate work
started.set()
time.sleep(0.5)
return "value"
def start_slow():
cache.get("key", slow_populate, blocking=True)
# Start population in background
thread = threading.Thread(target=start_slow)
thread.start()
# Wait until slow_populate has started to avoid race conditions
assert started.wait(timeout=1.0)
# Use non-blocking mode which should raise NotReadyError
with pytest.raises(NotReadyError, match="not ready yet"):
# Non-blocking call should raise NotReadyError since cache isn't populated yet
cache.get("key", slow_populate, blocking=False)
# Ensure background thread completes and surface any exceptions it raised
thread.join()

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +253
def test_failed_population_then_success(self):
"""Test that after a failure, subsequent successful call works."""
cache = TwoLevelCache(soft_ttl=10, hard_ttl=60)
should_fail = True

def populate():
if should_fail:
raise ValueError("Test error")
return "success_value"

# First call fails
with pytest.raises(ValueError):
cache.get("key", populate, blocking=True)

# Second call succeeds
should_fail = False
result = cache.get("key", populate, blocking=True)
assert result == "success_value"

Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests don't cover the case where populate_func returns None or other falsy values (False, 0, empty string, empty list). Given the bug at line 276 in utils.py where if result is not None prevents caching of None values, a test should be added to document the expected behavior for falsy return values. This would help catch the critical bug where line 235 in utils.py assumes the key exists in hard_cache after future.result() completes.

Copilot uses AI. Check for mistakes.
# Second call should use cached value
result = cache.get("key", populate)
assert result == "test_value"
assert call_count == 1
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is always true, because of this condition.

Copilot uses AI. Check for mistakes.
Previously when TwoLevelCache.populate_func() failed:
- Exceptions were silently swallowed by ThreadPoolExecutor
- Failed futures were never removed from the dict
- Subsequent requests would see the stale failed future and never retry

Changes:
- Add error logging with full traceback when population fails
- Move future cleanup to finally block so it always runs
- Use future.result() instead of wait() to propagate exceptions
- Add unit tests for TwoLevelCache error handling
@chrisburr chrisburr force-pushed the fix/cache-population-logging branch from f421620 to 7fe225c Compare January 20, 2026 08:16
@chrisburr chrisburr merged commit b899a8d into DIRACGrid:main Jan 20, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants