Skip to content

PKRange Cache Warm Up#47066

Open
dibahlfi wants to merge 9 commits into
mainfrom
users/dibahl/caching-foreground-bufg
Open

PKRange Cache Warm Up#47066
dibahlfi wants to merge 9 commits into
mainfrom
users/dibahl/caching-foreground-bufg

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented May 22, 2026

This PR fixes a bug where if a customer set a short deadline on a request and that deadline ran out while the address-list lookup was still in progress, the lookup was thrown away. The customer's next call would start the same lookup again from scratch, hit the same deadline, and fail the same way. A short deadline on a slow network could keep a customer stuck in this loop indefinitely.

This PR fixes the lookup so it survives the customer's deadline. Two improvements ship together, each addressing a different way the deadline could reach the lookup:

Improvement 1 (applies to the async client). The lookup now keeps running in the background even if the customer's wait runs out. The customer still sees their timeout, but the work the SDK started isn't lost — it finishes a moment later and the result is saved. The customer's retry finds the answer already there and proceeds immediately, with no extra round-trip to the service.

Improvement 2 (applies to both sync and async clients). A few SDK methods (most visibly read_feed_ranges) pass the customer's timeout all the way down to the internal lookup. That meant the customer's "2-second budget for this call" was also bounding the internal address-list lookup, which could itself time out before the customer's actual work even started. The SDK now keeps the customer's deadline scoped to the work the customer actually asked about; the internal lookup runs under the SDK's own retry rules.

Scope (intentional)
This change only affects the internal address-list cache for container partitions. Other internal caches the SDK keeps (such as container-properties) are not changed: they are populated by a single small request that doesn't have the same failure pattern, so changing them would add risk without a customer-visible benefit.

@dibahlfi dibahlfi requested a review from a team as a code owner May 22, 2026 00:53
Copilot AI review requested due to automatic review settings May 22, 2026 00:53
@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves Azure Cosmos DB async routing-map (PKRange) cache warm-up reliability under aggressive caller timeouts/cancellation by introducing a shared “single-flight” in-flight fetch task per (event loop, collection), so cache publication can complete even if the originating awaiter is cancelled.

Changes:

  • Add shared in-flight fetch task tracking in async PartitionKeyRangeCache and await via asyncio.shield to decouple cache publication from caller cancellation.
  • Extend async/sync routing-map provider tests to cover timeout-kwarg forwarding, cancellation survival, single-flight behavior, and in-flight cleanup.
  • Update shared-cache lifecycle tests to reset/validate the new shared in-flight state.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py Clears new shared in-flight state between tests; adds lifecycle test for releasing while a fetch is in flight.
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider.py Adds sync coverage ensuring tight timeout= kwargs are forwarded and caching still populates on success; expands concurrency test commentary.
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py Adds async coverage for cancellation survival, single-flight coalescing, in-flight cleanup on success/failure, and concurrency invariants.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Implements shared in-flight fetch-and-publish tasks and shields awaiters so cache warm-up can complete despite caller cancellation.

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Outdated
Comment thread sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Outdated
Comment thread sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py Outdated
@xinlian12
Copy link
Copy Markdown
Member

Review complete (48:46)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Outdated
@xinlian12
Copy link
Copy Markdown
Member

Review complete (42:20)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Copy Markdown
Member

Review complete (44:47)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, can approve after DR drilling

Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving pre-emptively in case we need it merged for proper testing.


Stripping the customer's deadline at the cache layer is deliberate.
Most cache call sites already drop ``**kwargs`` two layers above the
fetch, but a small set of paths -- ``read_feed_ranges`` (sync and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_feed_ranges should honor the timeout as the timeout is explicitly passed in for pkranges for the rest the pkrange call only happens as part of the lifecycle of another operation and in that case we shouldn't honor the timeout override

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@tvaron3
Copy link
Copy Markdown
Member

tvaron3 commented May 27, 2026

Recommmendations

  1. Force-refresh callers lose staleness sensitivity. In-flight-task check runs before determine_refresh_action. A force_refresh=True caller at T3 joins a fetch that began at T2 — sampling pre-split server state. Weakens the documented "at least as fresh as now" invariant.

  2. Joining callers' kwargs silently dropped. Only the originator's kwargs go into the task closure. Joiners' raw_response_hook, correlation IDs, and custom headers vanish. Pre-fix, every caller's hook fired for their own fetch. Now coalesced away non-obviously.

  3. asyncio.shield + cancellation = noisy "Task exception was never retrieved" logs. Customer's wait_for(timeout=2) fires → shield raises CancelledError to awaiter → fetch task still running raises → nobody awaits it → asyncio logs ERROR at GC. Monitoring will flag this as a regression. Fix: task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None).

  4. release() orphans in-flight tasks during shutdown. Drops the dict entry but doesn't cancel tasks; they then hit a torn-down transport → IOError → unretrieved exception. Steady drumbeat of false-positive errors for per-request client patterns. Fix: iterate and task.cancel() before popping.

@dibahlfi
Copy link
Copy Markdown
Member Author

dibahlfi commented May 28, 2026

@tvaron3 - all the above are very valuable feedback. Thank you.

  1. When a caller asks for a forced fresh read and another caller is already mid-fetch, the forced caller may receive stale data
    Deferred to a follow-up issue
    Real but narrow timing window, and same behavior as before this PR (not a new bug). A correct fix needs the cache to compare what's in flight against the caller's "what I last saw" marker, which is a big and separate change will do as a follow up not part of this PR.

  2. When several callers ask for the same partition info at the same time, only the first caller's response callback fires — the other callers' callbacks are silently dropped - Fixed
    Joining callers' request headers and correlation IDs don't go out on the wire — only the first caller's do

  • documented as a known limit, no code change. Cannot be fixed without making one separate network call per caller, which would undo the whole memory and traffic saving this PR is built on. Called out so it's explicit, not hidden.

3 When a caller's request is cancelled or times out while a shared lookup is still running, Python's async runtime later logs a noisy ERROR ("Task exception was never retrieved") if that shared lookup ends up failing - Fixed

4 When the last CosmosClient for an endpoint shuts down, background partition lookups that were still running were left dangling - Fixed

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

if existing_entry is not None:
if not existing_entry.task.done():
# Join in-flight fetch and register joiner hook if present.
joiner_hook = fetch_kwargs.get("raw_response_hook")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Concurrency: Joiner's timeout semantics silently discarded during coalescing

When a caller joins an existing in-flight fetch, only raw_response_hook is extracted from its kwargs — all other kwargs (including _honor_customer_timeout, timeout, read_timeout) are silently discarded. The originator's kwargs drive the shared fetch:

joiner_hook = fetch_kwargs.get("raw_response_hook")
if joiner_hook is not None:
    existing_entry.joined_hooks.append(joiner_hook)
return existing_entry.task

Concrete scenario: A normal query starts a cold-cache fetch (timeout stripped by default). read_feed_ranges(timeout=5, _honor_customer_timeout=True) arrives as a joiner — its timeout opt-in is silently dropped and the fetch runs without a customer timeout. The customer observes no timeout enforcement on that call. Conversely, if read_feed_ranges(timeout=5) is the originator, a normal query joiner would be subject to the originator's 5-second HTTP timeout, and if the fetch exceeds 5s the joiner gets a spurious timeout error.

Impact: Very narrow window — requires concurrent cold-cache access from different call types to the same collection. Self-heals on retry (the finally block frees the slot). Data correctness is never affected.

Consider: Either documenting this as a known trade-off in the docstring, or skipping coalescing when _honor_customer_timeout differs between originator and joiner.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documented skipping wont work and make it very complicated with not much benfit

@xinlian12
Copy link
Copy Markdown
Member

Review complete (47:39)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants