PKRange Cache Warm Up#47066
Conversation
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Improves Azure Cosmos DB async routing-map (PKRange) cache warm-up reliability under aggressive caller timeouts/cancellation by introducing a shared “single-flight” in-flight fetch task per (event loop, collection), so cache publication can complete even if the originating awaiter is cancelled.
Changes:
- Add shared in-flight fetch task tracking in async
PartitionKeyRangeCacheand await viaasyncio.shieldto decouple cache publication from caller cancellation. - Extend async/sync routing-map provider tests to cover timeout-kwarg forwarding, cancellation survival, single-flight behavior, and in-flight cleanup.
- Update shared-cache lifecycle tests to reset/validate the new shared in-flight state.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py |
Clears new shared in-flight state between tests; adds lifecycle test for releasing while a fetch is in flight. |
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider.py |
Adds sync coverage ensuring tight timeout= kwargs are forwarded and caching still populates on success; expands concurrency test commentary. |
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py |
Adds async coverage for cancellation survival, single-flight coalescing, in-flight cleanup on success/failure, and concurrency invariants. |
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py |
Implements shared in-flight fetch-and-publish tasks and shields awaiters so cache warm-up can complete despite caller cancellation. |
|
✅ Review complete (48:46) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (42:20) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (44:47) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
simorenoh
left a comment
There was a problem hiding this comment.
LGTM overall, can approve after DR drilling
simorenoh
left a comment
There was a problem hiding this comment.
Approving pre-emptively in case we need it merged for proper testing.
|
|
||
| Stripping the customer's deadline at the cache layer is deliberate. | ||
| Most cache call sites already drop ``**kwargs`` two layers above the | ||
| fetch, but a small set of paths -- ``read_feed_ranges`` (sync and |
There was a problem hiding this comment.
read_feed_ranges should honor the timeout as the timeout is explicitly passed in for pkranges for the rest the pkrange call only happens as part of the lifecycle of another operation and in that case we shouldn't honor the timeout override
|
Recommmendations
|
|
@tvaron3 - all the above are very valuable feedback. Thank you.
3 When a caller's request is cancelled or times out while a shared lookup is still running, Python's async runtime later logs a noisy ERROR ("Task exception was never retrieved") if that shared lookup ends up failing - Fixed 4 When the last CosmosClient for an endpoint shuts down, background partition lookups that were still running were left dangling - Fixed |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| if existing_entry is not None: | ||
| if not existing_entry.task.done(): | ||
| # Join in-flight fetch and register joiner hook if present. | ||
| joiner_hook = fetch_kwargs.get("raw_response_hook") |
There was a problem hiding this comment.
🟢 Suggestion — Concurrency: Joiner's timeout semantics silently discarded during coalescing
When a caller joins an existing in-flight fetch, only raw_response_hook is extracted from its kwargs — all other kwargs (including _honor_customer_timeout, timeout, read_timeout) are silently discarded. The originator's kwargs drive the shared fetch:
joiner_hook = fetch_kwargs.get("raw_response_hook")
if joiner_hook is not None:
existing_entry.joined_hooks.append(joiner_hook)
return existing_entry.taskConcrete scenario: A normal query starts a cold-cache fetch (timeout stripped by default). read_feed_ranges(timeout=5, _honor_customer_timeout=True) arrives as a joiner — its timeout opt-in is silently dropped and the fetch runs without a customer timeout. The customer observes no timeout enforcement on that call. Conversely, if read_feed_ranges(timeout=5) is the originator, a normal query joiner would be subject to the originator's 5-second HTTP timeout, and if the fetch exceeds 5s the joiner gets a spurious timeout error.
Impact: Very narrow window — requires concurrent cold-cache access from different call types to the same collection. Self-heals on retry (the finally block frees the slot). Data correctness is never affected.
Consider: Either documenting this as a known trade-off in the docstring, or skipping coalescing when _honor_customer_timeout differs between originator and joiner.
There was a problem hiding this comment.
documented skipping wont work and make it very complicated with not much benfit
|
✅ Review complete (47:39) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This PR fixes a bug where if a customer set a short deadline on a request and that deadline ran out while the address-list lookup was still in progress, the lookup was thrown away. The customer's next call would start the same lookup again from scratch, hit the same deadline, and fail the same way. A short deadline on a slow network could keep a customer stuck in this loop indefinitely.
This PR fixes the lookup so it survives the customer's deadline. Two improvements ship together, each addressing a different way the deadline could reach the lookup:
Improvement 1 (applies to the async client). The lookup now keeps running in the background even if the customer's wait runs out. The customer still sees their timeout, but the work the SDK started isn't lost — it finishes a moment later and the result is saved. The customer's retry finds the answer already there and proceeds immediately, with no extra round-trip to the service.
Improvement 2 (applies to both sync and async clients). A few SDK methods (most visibly read_feed_ranges) pass the customer's timeout all the way down to the internal lookup. That meant the customer's "2-second budget for this call" was also bounding the internal address-list lookup, which could itself time out before the customer's actual work even started. The SDK now keeps the customer's deadline scoped to the work the customer actually asked about; the internal lookup runs under the SDK's own retry rules.
Scope (intentional)
This change only affects the internal address-list cache for container partitions. Other internal caches the SDK keeps (such as container-properties) are not changed: they are populated by a single small request that doesn't have the same failure pattern, so changing them would add risk without a customer-visible benefit.