[GATEWAY V2]: Fix availability strategy flows.#48432
Merged
jeet1995 merged 42 commits intoAzure:mainfrom Mar 20, 2026
Merged
Conversation
a2c92e2 to
e8efae4
Compare
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
e8efae4 to
fbf0b6f
Compare
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…edged PK-scoped queries RxDocumentServiceRequest.clone() (used by hedging in executeFeedOperationWithAvailabilityStrategy) was not copying partitionKeyDefinition. When a PK-scoped query is hedged to a secondary region, the cloned request has partitionKeyInternal (PK value) but null partitionKeyDefinition (PK schema). ThinClientStoreModel.wrapInHttpRequest needs both for client-side EPK computation, causing NPE: Cannot invoke PartitionKeyDefinition.getPaths() because partitionKeyDefinition is null at PartitionKeyInternalHelper.getEffectivePartitionKeyBytes at ThinClientStoreModel.wrapInHttpRequest Root cause: RxDocumentServiceRequest.clone() did not copy partitionKeyDefinition. This only manifests in GW V2 (thin client) because GW V1 never reads partitionKeyDefinition from the request - it serializes the PK as a JSON header and lets the server resolve EPK. Point operations are unaffected because they clone RequestOptions (not the request) for hedging. Fix (defense in depth): - Primary: ThinClientStoreModel overrides performRequestInternal to reactively resolve partitionKeyDefinition from the collection cache when missing on the request. Includes logger.warn for production observability when the fallback triggers. - Secondary: RxDocumentServiceRequest.clone() now copies partitionKeyDefinition. Test coverage: - Unit test: cloneShouldPreservePartitionKeyDefinition (verified: fails without fix, passes with) - E2E test: PK-scoped query with fault-injected response delay triggering hedging on GW V2 (verified: NPE without fix, passes with fix) - Wired FaultInjectionWithAvailabilityStrategy test suite (FITests_read/write/query/readMany/readAll) into fi-thinclient-multi-region and fi-thinclient-multi-master test groups for holistic coverage.
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
fbf0b6f to
12ada5e
Compare
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ness TenantWorkloadConfig.getConsistencyLevel() called valueOf(toUpperCase()) which fails for wire-format values like 'BoundedStaleness' -> 'BOUNDEDSTALENESS' (enum name is BOUNDED_STALENESS). Now falls back to matching the display name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
FITests_queryAfterCreation test configs use ONE_SECOND_DURATION timeouts designed for DIRECT mode. Thin client forces GATEWAY mode which has higher latency through the proxy, causing 408 timeouts. Skip DIRECT configs when thin client is enabled until proper GATEWAY variants are added. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
Member
Author
|
/azp run java - cosmos - ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
…strategy tests running FI tests with availability strategy (hedging) must still run under thin client since query/readAll hedging is core functionality being validated. Only skip DIRECT mode test configs that have no availability strategy (baseline AllGood tests with tight timeouts). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ction modes Availability strategy (hedging) is connection-mode agnostic. Tests should run under thin client without skipping, regardless of whether the config specifies DIRECT or GATEWAY. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ression test" This reverts commit da8ea88.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
|
@sdkReviewAgent |
Member
|
sdkReviewAgent | Status: ⏳ Queued Review requested by @xinlian12. I'll start shortly. |
Member
|
sdkReviewAgent | Status: 🔍 Reviewing I'm reviewing this PR now. I'll post my findings as comments when done. |
…e region bounce Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When resolveCollectionAsync returns null, throw NullPointerException with a descriptive message instead of logging a warning and falling through. There is no recovery path -- proceeding would NPE in getEffectivePartitionKeyBytes with a cryptic stack trace. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
|
@sdkReviewAgent-2 |
Member
|
⏳ PR Review Agent — Starting review... |
xinlian12
reviewed
Mar 20, 2026
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java
Show resolved
Hide resolved
xinlian12
reviewed
Mar 20, 2026
...ure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/RegionalRoutingContext.java
Outdated
Show resolved
Hide resolved
xinlian12
reviewed
Mar 20, 2026
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java
Show resolved
Hide resolved
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ion bug CaseInsensitiveMap.convertKey() uses toString() as the map key, not equals()/hashCode(). The richer toString() that included thinclientEndpoint caused keys to change after setThinclientRegionalEndpoint was called post-insertion, breaking map lookups in LocationCache. Revert to gateway-only toString(). Add unit test proving toString() is stable before and after setThinclientRegionalEndpoint. Found by: xinlian12 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Member
Author
|
/check-enforcer override |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
1. Availability strategy broken for Gateway V2 (thin client)
Availability strategy is broken for Gateway V2 (thin client). Requests that should be hedged to a secondary region instead timeout or return errors from the primary region.
Root cause: Availability strategy resolves the target region by looking up a
RegionalRoutingContextusing only the gateway regional endpoint. However,equals(),hashCode(), andtoString()onRegionalRoutingContextdepended on bothgatewayRegionalEndpointandthinclientRegionalEndpoint. When thin client is enabled, the storedRegionalRoutingContextentries havethinclientRegionalEndpointset, but the lookup key constructed during hedging only hasgatewayRegionalEndpoint-- so the lookup fails to match any entry. This causesgetRegionName()to return just the primary region as a fallback, and hedging targets the same (failing) region twice instead of failing over.2. NPE in hedged PK-scoped query on Gateway V2
When a partition-key-scoped query is hedged to a secondary region on Gateway V2, the hedged request fails with
NullPointerExceptioninThinClientStoreModel.wrapInHttpRequestbecausepartitionKeyDefinitionis null.Root cause:
RxDocumentServiceRequest.clone()copiespartitionKeyInternalbut notpartitionKeyDefinition. The cloned hedged request hashasFeedRangeFilteringBeenApplied = truewhich prevents downstream re-resolution, soThinClientStoreModelNPEs when computing EPK bytes.Fix
RegionalRoutingContext identity fix
equals(),hashCode(), andtoString()now only depend ongatewayRegionalEndpoint. SincethinclientRegionalEndpointis set after construction via a mutable setter, it must not participate in identity -- lookups using just the gateway endpoint now correctly resolve to the storedRegionalRoutingContext.Request clone fix (defense in depth)
ThinClientStoreModel.performRequestInternal-- resolvespartitionKeyDefinitionfrom collection cache when missing.RxDocumentServiceRequest.clone()-- copiespartitionKeyDefinitionalongsidepartitionKeyInternal.Test coverage for thin client availability strategy flows
PerPartitionAutomaticFailoverE2ETests) -- addedfi-thinclient-multi-regiongroup, DIRECT skip when thin client + HTTP/2 enabled, thinProxy to gatewayProxy replacement for HttpClient mock compatibility.PerPartitionCircuitBreakerE2ETests) -- addedfi-thinclient-multi-mastergroup, DIRECT skip,assertThinClientEndpointUsedon success responses.assertThinClientEndpointUsedshared utility inTestSuiteBase.FaultInjectionServerErrorRuleOnGatewayV2Tests).Failover Regression Test (DR Drill)
Date: 2026-03-20 02:49-04:02 UTC | Account:
thin-client-multi-region-ci(Single Writer, Session consistency, Auto-failover) | Regions: East US 2 (Write), South Central US (Read) | Branch:AzCosmos_FixCloneMissingPkDef@461303cfb69All 8 workloads (Direct + Gateway, Write + Read, 2 rounds) completed with zero user-visible errors. Both rounds show clean failover and full write restore.
Workloads
Each workload is uniquely identifiable in Kusto via its
UserAgentsuffix.dr-direct-writedr-direct-readdr-gw-writedr-gw-readdr-off-direct-writedr-off-direct-readdr-off-gw-writedr-off-gw-readMgmtDatabaseAccountTrace -- Write Region Transitions
Clean transitions with no write region bounce:
Round 1: Switch Write Region
az cosmosdb failover-priority-changetriggeredWrite workloads -- both Direct and Gateway writes shifted cleanly to South Central US within one 5-min bucket:
Read workloads -- reads stayed entirely on East US 2 (preferred region unaffected by write region change):
Round 2: Offline Write Region
POST offlineRegiontriggeredAll workloads -- reads and writes moved to SCUS when EUS2 went offline. After restore at 03:56, writes and reads returned to EUS2:
Error Analysis
Backend success rates (Direct mode only -- Gateway retries internally before reaching
BackendEndRequest5M):dr-direct-readdr-direct-writedr-off-direct-readdr-off-direct-writeExact error counts:
dr-direct-readReadSessionNotAvailable. Session token from write replica not yet on read replica. Auto-retried.dr-direct-readdr-direct-writedr-direct-readdr-direct-writedr-direct-readdr-off-direct-readdr-off-direct-readdr-off-direct-readdr-off-direct-writedr-off-direct-writedr-off-direct-readdr-off-direct-writedr-off-direct-readThin Client (Gateway V2) Failover Test
Account:
thin-client-failover-test(fe50, IsThinClientEnabled=True) | Date: 2026-03-20 18:00-18:20 UTC | JVM flags:-DCOSMOS.THINCLIENT_ENABLED=true -DCOSMOS.HTTP2_ENABLED=trueRan thin client workloads (
dr-tc-write,dr-tc-read) while performing switch-write-region (EUS2 to SCUS) followed by offline EUS2 (read region).ThinClientProxyRequest5M -- region distribution:
Thin client errors (ThinClientProxyRequest5M):
Total: 42 errors out of ~700K requests. Zero user-visible failures.
Verdict: Thin client (Gateway V2) failover works correctly. The
RegionalRoutingContextidentity fix enables proper region routing for thin client -- writes shift to SCUS within one 5-min bucket, reads stay on preferred region (EUS2) until that region goes offline.Kusto query (ThinClientProxyRequest5M)
Note:
ThinClientProxyRequest5Muses lowercase column names (globalDatabaseAccountName,region,operationType,statusCode,userAgent).Verdict
No regression from the
RegionalRoutingContextidentity fix orRxDocumentServiceRequest.clone()fix. All three connection modes (Direct, Gateway, Thin Client) handled failover correctly with zero user-visible errors. CleanMgmtDatabaseAccountTraceconfirms no write region bounce.Kusto verification queries (cluster: cdbsupport.kusto.windows.net, database: Support)
Direct mode region distribution (BackendEndRequest5M):
Gateway mode region distribution (ComputeRequest5M):
Error breakdown:
MgmtDatabaseAccountTrace (control plane verification):
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines