PKRange Cache Warm Up by dibahlfi · Pull Request #47066 · Azure/azure-sdk-for-python

dibahlfi · 2026-05-22T00:53:32Z

This PR fixes a bug where if a customer set a short deadline on a request and that deadline ran out while the address-list lookup was still in progress, the lookup was thrown away. The customer's next call would start the same lookup again from scratch, hit the same deadline, and fail the same way. A short deadline on a slow network could keep a customer stuck in this loop indefinitely.

This PR fixes the lookup so it survives the customer's deadline. Two improvements ship together, each addressing a different way the deadline could reach the lookup:

Improvement 1 (applies to the async client). The lookup now keeps running in the background even if the customer's wait runs out. The customer still sees their timeout, but the work the SDK started isn't lost — it finishes a moment later and the result is saved. The customer's retry finds the answer already there and proceeds immediately, with no extra round-trip to the service.

Improvement 2 (applies to both sync and async clients). A few SDK methods (most visibly read_feed_ranges) pass the customer's timeout all the way down to the internal lookup. That meant the customer's "2-second budget for this call" was also bounding the internal address-list lookup, which could itself time out before the customer's actual work even started. The SDK now keeps the customer's deadline scoped to the work the customer actually asked about; the internal lookup runs under the SDK's own retry rules.

Scope (intentional)
This change only affects the internal address-list cache for container partitions. Other internal caches the SDK keeps (such as container-properties) are not changed: they are populated by a single small request that doesn't have the same failure pattern, so changing them would add risk without a customer-visible benefit.

dibahlfi · 2026-05-22T00:53:54Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T00:54:14Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T00:54:29Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Improves Azure Cosmos DB async routing-map (PKRange) cache warm-up reliability under aggressive caller timeouts/cancellation by introducing a shared “single-flight” in-flight fetch task per (event loop, collection), so cache publication can complete even if the originating awaiter is cancelled.

Changes:

Add shared in-flight fetch task tracking in async PartitionKeyRangeCache and await via asyncio.shield to decouple cache publication from caller cancellation.
Extend async/sync routing-map provider tests to cover timeout-kwarg forwarding, cancellation survival, single-flight behavior, and in-flight cleanup.
Update shared-cache lifecycle tests to reset/validate the new shared in-flight state.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py`	Clears new shared in-flight state between tests; adds lifecycle test for releasing while a fetch is in flight.
`sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider.py`	Adds sync coverage ensuring tight `timeout=` kwargs are forwarded and caching still populates on success; expands concurrency test commentary.
`sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py`	Adds async coverage for cancellation survival, single-flight coalescing, in-flight cleanup on success/failure, and concurrency invariants.
`sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py`	Implements shared in-flight fetch-and-publish tasks and shields awaiters so cache warm-up can complete despite caller cancellation.

xinlian12 · 2026-05-22T01:43:56Z

✅ Review complete (48:46)

Posted 4 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-22T03:23:29Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T03:23:44Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T03:23:59Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-22T04:06:32Z

✅ Review complete (42:20)

Posted 3 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-22T05:20:50Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T06:06:49Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T06:07:05Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-22T15:06:06Z

✅ Review complete (44:47)

Posted 1 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

simorenoh

LGTM overall, can approve after DR drilling

simorenoh

Approving pre-emptively in case we need it merged for proper testing.

tvaron3 · 2026-05-27T18:50:40Z

+
+    Stripping the customer's deadline at the cache layer is deliberate.
+    Most cache call sites already drop ``**kwargs`` two layers above the
+    fetch, but a small set of paths -- ``read_feed_ranges`` (sync and


read_feed_ranges should honor the timeout as the timeout is explicitly passed in for pkranges for the rest the pkrange call only happens as part of the lifecycle of another operation and in that case we shouldn't honor the timeout override

tvaron3 · 2026-05-27T18:57:09Z

Recommmendations

Force-refresh callers lose staleness sensitivity. In-flight-task check runs before determine_refresh_action. A force_refresh=True caller at T3 joins a fetch that began at T2 — sampling pre-split server state. Weakens the documented "at least as fresh as now" invariant.
Joining callers' kwargs silently dropped. Only the originator's kwargs go into the task closure. Joiners' raw_response_hook, correlation IDs, and custom headers vanish. Pre-fix, every caller's hook fired for their own fetch. Now coalesced away non-obviously.
asyncio.shield + cancellation = noisy "Task exception was never retrieved" logs. Customer's wait_for(timeout=2) fires → shield raises CancelledError to awaiter → fetch task still running raises → nobody awaits it → asyncio logs ERROR at GC. Monitoring will flag this as a regression. Fix: task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None).
release() orphans in-flight tasks during shutdown. Drops the dict entry but doesn't cancel tasks; they then hit a torn-down transport → IOError → unretrieved exception. Steady drumbeat of false-positive errors for per-request client patterns. Fix: iterate and task.cancel() before popping.

dibahlfi · 2026-05-28T00:27:14Z

@tvaron3 - all the above are very valuable feedback. Thank you.

When a caller asks for a forced fresh read and another caller is already mid-fetch, the forced caller may receive stale data
Deferred to a follow-up issue
Real but narrow timing window, and same behavior as before this PR (not a new bug). A correct fix needs the cache to compare what's in flight against the caller's "what I last saw" marker, which is a big and separate change will do as a follow up not part of this PR.
When several callers ask for the same partition info at the same time, only the first caller's response callback fires — the other callers' callbacks are silently dropped - Fixed
Joining callers' request headers and correlation IDs don't go out on the wire — only the first caller's do

documented as a known limit, no code change. Cannot be fixed without making one separate network call per caller, which would undo the whole memory and traffic saving this PR is built on. Called out so it's explicit, not hidden.

3 When a caller's request is cancelled or times out while a shared lookup is still running, Python's async runtime later logs a noisy ERROR ("Task exception was never retrieved") if that shared lookup ends up failing - Fixed

4 When the last CosmosClient for an endpoint shuts down, background partition lookups that were still running were left dangling - Fixed

dibahlfi · 2026-05-28T00:29:28Z

/azp run python - cosmos - tests

dibahlfi · 2026-05-28T00:29:41Z

@sdkReviewAgent-2

azure-pipelines · 2026-05-28T00:29:43Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-28T01:17:28Z

+            if existing_entry is not None:
+                if not existing_entry.task.done():
+                    # Join in-flight fetch and register joiner hook if present.
+                    joiner_hook = fetch_kwargs.get("raw_response_hook")


🟢 Suggestion — Concurrency: Joiner's timeout semantics silently discarded during coalescing

When a caller joins an existing in-flight fetch, only raw_response_hook is extracted from its kwargs — all other kwargs (including _honor_customer_timeout, timeout, read_timeout) are silently discarded. The originator's kwargs drive the shared fetch:

joiner_hook = fetch_kwargs.get("raw_response_hook") if joiner_hook is not None: existing_entry.joined_hooks.append(joiner_hook) return existing_entry.task

Concrete scenario: A normal query starts a cold-cache fetch (timeout stripped by default). read_feed_ranges(timeout=5, _honor_customer_timeout=True) arrives as a joiner — its timeout opt-in is silently dropped and the fetch runs without a customer timeout. The customer observes no timeout enforcement on that call. Conversely, if read_feed_ranges(timeout=5) is the originator, a normal query joiner would be subject to the originator's 5-second HTTP timeout, and if the fetch exceeds 5s the joiner gets a spurious timeout error.

Impact: Very narrow window — requires concurrent cold-cache access from different call types to the same collection. Self-heals on retry (the finally block frees the slot). Data correctness is never affected.

Consider: Either documenting this as a known trade-off in the docstring, or skipping coalescing when _honor_customer_timeout differs between originator and joiner.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

documented skipping wont work and make it very complicated with not much benfit

xinlian12 · 2026-05-28T01:17:45Z

✅ Review complete (47:39)

Posted 2 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-28T02:29:04Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-28T02:29:20Z

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3

LGTM

PKRange Cache fix

904b3e1

dibahlfi requested a review from a team as a code owner May 22, 2026 00:53

Copilot AI review requested due to automatic review settings May 22, 2026 00:53

Copilot started reviewing on behalf of dibahlfi May 22, 2026 00:53 View session

github-actions Bot added the Cosmos label May 22, 2026

github-project-automation Bot added this to CosmosDB Python Eco-System May 22, 2026

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Outdated

Comment thread sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py Outdated