Skip to content

[GATEWAY V2]: Fix availability strategy flows.#48432

Merged
jeet1995 merged 42 commits intoAzure:mainfrom
jeet1995:AzCosmos_FixCloneMissingPkDef
Mar 20, 2026
Merged

[GATEWAY V2]: Fix availability strategy flows.#48432
jeet1995 merged 42 commits intoAzure:mainfrom
jeet1995:AzCosmos_FixCloneMissingPkDef

Conversation

@jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Mar 16, 2026

Problem

1. Availability strategy broken for Gateway V2 (thin client)

Availability strategy is broken for Gateway V2 (thin client). Requests that should be hedged to a secondary region instead timeout or return errors from the primary region.

Root cause: Availability strategy resolves the target region by looking up a RegionalRoutingContext using only the gateway regional endpoint. However, equals(), hashCode(), and toString() on RegionalRoutingContext depended on both gatewayRegionalEndpoint and thinclientRegionalEndpoint. When thin client is enabled, the stored RegionalRoutingContext entries have thinclientRegionalEndpoint set, but the lookup key constructed during hedging only has gatewayRegionalEndpoint -- so the lookup fails to match any entry. This causes getRegionName() to return just the primary region as a fallback, and hedging targets the same (failing) region twice instead of failing over.

2. NPE in hedged PK-scoped query on Gateway V2

When a partition-key-scoped query is hedged to a secondary region on Gateway V2, the hedged request fails with NullPointerException in ThinClientStoreModel.wrapInHttpRequest because partitionKeyDefinition is null.

Root cause: RxDocumentServiceRequest.clone() copies partitionKeyInternal but not partitionKeyDefinition. The cloned hedged request has hasFeedRangeFilteringBeenApplied = true which prevents downstream re-resolution, so ThinClientStoreModel NPEs when computing EPK bytes.

Fix

RegionalRoutingContext identity fix

equals(), hashCode(), and toString() now only depend on gatewayRegionalEndpoint. Since thinclientRegionalEndpoint is set after construction via a mutable setter, it must not participate in identity -- lookups using just the gateway endpoint now correctly resolve to the stored RegionalRoutingContext.

Request clone fix (defense in depth)

  1. ThinClientStoreModel.performRequestInternal -- resolves partitionKeyDefinition from collection cache when missing.
  2. RxDocumentServiceRequest.clone() -- copies partitionKeyDefinition alongside partitionKeyInternal.

Test coverage for thin client availability strategy flows

  • PPAF tests (PerPartitionAutomaticFailoverE2ETests) -- added fi-thinclient-multi-region group, DIRECT skip when thin client + HTTP/2 enabled, thinProxy to gatewayProxy replacement for HttpClient mock compatibility.
  • Circuit breaker tests (PerPartitionCircuitBreakerE2ETests) -- added fi-thinclient-multi-master group, DIRECT skip, assertThinClientEndpointUsed on success responses.
  • FI availability strategy tests -- wired into thin client CI groups, added thin client endpoint validation, increased e2e timeout by 500ms for thin client to account for RNTBD proxy cache lookup overhead.
  • Added assertThinClientEndpointUsed shared utility in TestSuiteBase.
  • PK-scoped query hedging regression test (FaultInjectionServerErrorRuleOnGatewayV2Tests).

Failover Regression Test (DR Drill)

Date: 2026-03-20 02:49-04:02 UTC | Account: thin-client-multi-region-ci (Single Writer, Session consistency, Auto-failover) | Regions: East US 2 (Write), South Central US (Read) | Branch: AzCosmos_FixCloneMissingPkDef @ 461303cfb69

All 8 workloads (Direct + Gateway, Write + Read, 2 rounds) completed with zero user-visible errors. Both rounds show clean failover and full write restore.

Workloads

Each workload is uniquely identifiable in Kusto via its UserAgent suffix.

User Agent Mode Operation Round
dr-direct-write Direct WriteThroughput 1: Switch
dr-direct-read Direct ReadThroughput 1: Switch
dr-gw-write Gateway WriteThroughput 1: Switch
dr-gw-read Gateway ReadThroughput 1: Switch
dr-off-direct-write Direct WriteThroughput 2: Offline
dr-off-direct-read Direct ReadThroughput 2: Offline
dr-off-gw-write Gateway WriteThroughput 2: Offline
dr-off-gw-read Gateway ReadThroughput 2: Offline

MgmtDatabaseAccountTrace -- Write Region Transitions

Clean transitions with no write region bounce:

Time (UTC) East US 2 South Central US Event
02:55:37 ReadLocation WriteReadLocation R1: Failover -- SCUS is write
03:12:19 WriteReadLocation ReadLocation R1: Restore -- EUS2 is write
03:23:59 Offline WriteReadLocation R2: EUS2 offline -- SCUS takes over
03:54:35 ReadLocation (Online) WriteReadLocation R2: EUS2 re-added online
03:56:03 WriteReadLocation ReadLocation R2: Restore -- EUS2 is write

Round 1: Switch Write Region

  • 02:49:51 -- R1 benchmark start (4 workloads)
  • 02:54:52 -- az cosmosdb failover-priority-change triggered
  • 02:55:57 -- Failover completed (65s)
  • 03:11:31 -- Restore triggered
  • 03:12:36 -- Restore completed (65s)
  • 03:18:21 -- R1 benchmark stop

Write workloads -- both Direct and Gateway writes shifted cleanly to South Central US within one 5-min bucket:

Round 1 Writes

Read workloads -- reads stayed entirely on East US 2 (preferred region unaffected by write region change):

Round 1 Reads


Round 2: Offline Write Region

  • 03:18:23 -- R2 benchmark start (4 workloads)
  • 03:23:27 -- POST offlineRegion triggered
  • 03:23:59 -- EUS2 confirmed offline (32s)
  • 03:39:33 -- 3-step restore started (remove + re-add + failover)
  • 03:56:20 -- Restore completed
  • 04:01:53 -- R2 benchmark stop

All workloads -- reads and writes moved to SCUS when EUS2 went offline. After restore at 03:56, writes and reads returned to EUS2:

Round 2 All Workloads


Error Analysis

Error Codes by Time Window

Backend success rates (Direct mode only -- Gateway retries internally before reaching BackendEndRequest5M):

Workload Total Requests Errors Success Rate
dr-direct-read 1,134,910 34,857 96.9%
dr-direct-write 450,298 72 99.98%
dr-off-direct-read 755,465 34,793 95.4%
dr-off-direct-write 634,619 69 99.99%

Important: The 404/1002 (ReadSessionNotAvailable) errors on reads are steady-state behavior for Session consistency -- they appear consistently across ALL time windows, not just during failover. The SDK retries these automatically on another replica. The application-level success rate was 100% (zero user-visible failures).

Exact error counts:

StatusCode/Sub Workload Round Time (UTC) Count Explanation
404/1002 dr-direct-read 1 02:50 14,952 Steady-state ReadSessionNotAvailable. Session token from write replica not yet on read replica. Auto-retried.
404/1002 dr-direct-read 1 02:55 997 Same steady-state.
403/3 dr-direct-write 1 02:55 32 Write to read-only region during cache refresh. Auto-retried.
404/1002 dr-direct-read 1 03:10 8,547 Spike during restore -- replicas rebalancing.
403/3 dr-direct-write 1 03:10 40 Write to read-only region during restore. Auto-retried.
404/1002 dr-direct-read 1 03:15 10,358 Steady-state resumes.
404/1002 dr-off-direct-read 2 03:15 4,610 Steady-state on EUS2 before offline.
404/1002 dr-off-direct-read 2 03:20 19,624 Spike during offline transition.
429/3200 dr-off-direct-read 2 03:20 37 Standard throughput throttle. Auto-retried.
403/3 dr-off-direct-write 2 03:25 5 Brief write to read-only during transition.
403/3 dr-off-direct-write 2 03:50 34 Write during restore transition.
404/1002 dr-off-direct-read 2 03:55 4,724 Session tokens catching up after EUS2 re-added.
403/3 dr-off-direct-write 2 03:55 30 Write during failover-priority-change.
404/1002 dr-off-direct-read 2 04:00 5,794 Session tokens settling.

Errors only appear on Direct mode workloads in BackendEndRequest5M because Direct mode exposes per-replica-level errors. Gateway mode retries internally before reaching BackendEndRequest5M.


Thin Client (Gateway V2) Failover Test

Account: thin-client-failover-test (fe50, IsThinClientEnabled=True) | Date: 2026-03-20 18:00-18:20 UTC | JVM flags: -DCOSMOS.THINCLIENT_ENABLED=true -DCOSMOS.HTTP2_ENABLED=true

Ran thin client workloads (dr-tc-write, dr-tc-read) while performing switch-write-region (EUS2 to SCUS) followed by offline EUS2 (read region).

Thin Client Failover

ThinClientProxyRequest5M -- region distribution:

Time (UTC) EUS2 Creates SCUS Creates EUS2 Reads SCUS Reads Event
18:00 72,782 -- 120,238 -- Baseline (EUS2 write)
18:05 55,686 15,075 125,211 4 Failover at 18:07 (mixed bucket)
18:10 -- 36,023 149,245 -- Writes fully on SCUS, reads on EUS2
18:15 -- 32,115 34,516 16,168 EUS2 offlined at 18:14 -- reads shifting to SCUS

Thin client errors (ThinClientProxyRequest5M):

Time (UTC) Region StatusCode/Sub Count Explanation
18:05 East US 2 403/3 28 TC write to read-only region during cache refresh. Auto-retried.
18:05 South Central US 403/3 8 Brief TC write during transition. Auto-retried.
18:05 East US 2 404/1002 6 ReadSessionNotAvailable during failover. Auto-retried.

Total: 42 errors out of ~700K requests. Zero user-visible failures.

Verdict: Thin client (Gateway V2) failover works correctly. The RegionalRoutingContext identity fix enables proper region routing for thin client -- writes shift to SCUS within one 5-min bucket, reads stay on preferred region (EUS2) until that region goes offline.

Kusto query (ThinClientProxyRequest5M)
ThinClientProxyRequest5M
| where TIMESTAMP between (datetime(2026-03-20 18:00) .. datetime(2026-03-20 18:25))
| where globalDatabaseAccountName == 'thin-client-failover-test'
| where userAgent has 'dr-tc'
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), region, operationType, statusCode
| order by TIMESTAMP asc, operationType asc

Note: ThinClientProxyRequest5M uses lowercase column names (globalDatabaseAccountName, region, operationType, statusCode, userAgent).

---

Verdict

Scenario Direct Write Direct Read Gateway Write Gateway Read TC Write TC Read Result
Switch Write Region Failover < 5m No disruption Failover < 5m No disruption Failover < 5m No disruption PASS
Offline Write Region Failover < 5m Failover < 5m Failover < 5m Failover < 5m -- -- PASS
Offline Read Region -- -- -- -- No disruption Failover < 5m PASS
Restore (Round 1) < 5m No disruption < 5m No disruption -- -- PASS
Restore (Round 2) Writes to EUS2 < 5m Reads to EUS2 < 5m Writes to EUS2 < 5m Reads to EUS2 < 5m -- -- PASS

TC (Thin Client / Gateway V2) tested on separate account (thin-client-failover-test, fe50) with -DCOSMOS.THINCLIENT_ENABLED=true. Switch Write Region + Offline Read Region tested. Direct/Gateway tested on thin-client-multi-region-ci with full switch + offline write + restore cycle.

No regression from the RegionalRoutingContext identity fix or RxDocumentServiceRequest.clone() fix. All three connection modes (Direct, Gateway, Thin Client) handled failover correctly with zero user-visible errors. Clean MgmtDatabaseAccountTrace confirms no write region bounce.

Kusto verification queries (cluster: cdbsupport.kusto.windows.net, database: Support)

Direct mode region distribution (BackendEndRequest5M):

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccountName == 'thin-client-multi-region-ci'
| where UserAgent has 'dr-direct' or UserAgent has 'dr-off-direct'
| where ResourceType == 2
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, Workload
| extend Series = strcat(Workload, " [", Region, "]")
| project TIMESTAMP, Series, Requests
| render timechart

Gateway mode region distribution (ComputeRequest5M):

ComputeRequest5M
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccountName == 'thin-client-multi-region-ci'
| where UserAgent has 'dr-gw' or UserAgent has 'dr-off-gw'
| where OperationName in ('Create', 'Read')
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, Workload
| extend Series = strcat(Workload, " [", Region, "]")
| project TIMESTAMP, Series, Requests
| render timechart

Error breakdown:

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccountName == 'thin-client-multi-region-ci'
| where UserAgent has 'dr-' and StatusCode >= 400
| where ResourceType == 2
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize ErrorCount=sum(SampleCount) by bin(TIMESTAMP, 5m), StatusCode, SubStatusCode, Workload
| order by TIMESTAMP asc

MgmtDatabaseAccountTrace (control plane verification):

MgmtDatabaseAccountTrace
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccount == 'thin-client-multi-region-ci'
| project TIMESTAMP, Location, LocationType, FederationId, Status
| order by TIMESTAMP asc

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995 jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch 2 times, most recently from a2c92e2 to e8efae4 Compare March 16, 2026 22:59
@jeet1995 jeet1995 changed the title [GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for h… [GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for hedged PK-scoped queries. Mar 16, 2026
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch from e8efae4 to fbf0b6f Compare March 17, 2026 00:02
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…edged PK-scoped queries

RxDocumentServiceRequest.clone() (used by hedging in executeFeedOperationWithAvailabilityStrategy)
was not copying partitionKeyDefinition. When a PK-scoped query is hedged to a secondary region,
the cloned request has partitionKeyInternal (PK value) but null partitionKeyDefinition (PK schema).
ThinClientStoreModel.wrapInHttpRequest needs both for client-side EPK computation, causing NPE:

  Cannot invoke PartitionKeyDefinition.getPaths() because partitionKeyDefinition is null
  at PartitionKeyInternalHelper.getEffectivePartitionKeyBytes
  at ThinClientStoreModel.wrapInHttpRequest

Root cause: RxDocumentServiceRequest.clone() did not copy partitionKeyDefinition.
This only manifests in GW V2 (thin client) because GW V1 never reads partitionKeyDefinition
from the request - it serializes the PK as a JSON header and lets the server resolve EPK.
Point operations are unaffected because they clone RequestOptions (not the request) for hedging.

Fix (defense in depth):
- Primary: ThinClientStoreModel overrides performRequestInternal to reactively resolve
  partitionKeyDefinition from the collection cache when missing on the request. Includes
  logger.warn for production observability when the fallback triggers.
- Secondary: RxDocumentServiceRequest.clone() now copies partitionKeyDefinition.

Test coverage:
- Unit test: cloneShouldPreservePartitionKeyDefinition (verified: fails without fix, passes with)
- E2E test: PK-scoped query with fault-injected response delay triggering hedging on GW V2
  (verified: NPE without fix, passes with fix)
- Wired FaultInjectionWithAvailabilityStrategy test suite (FITests_read/write/query/readMany/readAll)
  into fi-thinclient-multi-region and fi-thinclient-multi-master test groups for holistic coverage.
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch from fbf0b6f to 12ada5e Compare March 17, 2026 01:01
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…ness

TenantWorkloadConfig.getConsistencyLevel() called valueOf(toUpperCase()) which
fails for wire-format values like 'BoundedStaleness' -> 'BOUNDEDSTALENESS'
(enum name is BOUNDED_STALENESS). Now falls back to matching the display name.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

FITests_queryAfterCreation test configs use ONE_SECOND_DURATION timeouts
designed for DIRECT mode. Thin client forces GATEWAY mode which has higher
latency through the proxy, causing 408 timeouts. Skip DIRECT configs when
thin client is enabled until proper GATEWAY variants are added.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@jeet1995
Copy link
Member Author

/azp run java - cosmos - ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 2 commits March 17, 2026 08:03
…strategy tests running

FI tests with availability strategy (hedging) must still run under thin client since
query/readAll hedging is core functionality being validated. Only skip DIRECT mode test
configs that have no availability strategy (baseline AllGood tests with tight timeouts).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ction modes

Availability strategy (hedging) is connection-mode agnostic. Tests should run under
thin client without skipping, regardless of whether the config specifies DIRECT or GATEWAY.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 3 commits March 19, 2026 18:19
…test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Member

@sdkReviewAgent

@xinlian12
Copy link
Member

sdkReviewAgent | Status: ⏳ Queued

Review requested by @xinlian12. I'll start shortly.

@xinlian12
Copy link
Member

sdkReviewAgent | Status: 🔍 Reviewing

I'm reviewing this PR now. I'll post my findings as comments when done.

jeet1995 and others added 8 commits March 19, 2026 18:53
…e region bounce

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When resolveCollectionAsync returns null, throw NullPointerException with a
descriptive message instead of logging a warning and falling through.
There is no recovery path -- proceeding would NPE in
getEffectivePartitionKeyBytes with a cryptic stack trace.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Member

@sdkReviewAgent-2

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@xinlian12
Copy link
Member

PR Review Agent — Starting review...

jeet1995 and others added 3 commits March 20, 2026 14:45
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ion bug

CaseInsensitiveMap.convertKey() uses toString() as the map key, not
equals()/hashCode(). The richer toString() that included thinclientEndpoint
caused keys to change after setThinclientRegionalEndpoint was called
post-insertion, breaking map lookups in LocationCache.

Revert to gateway-only toString(). Add unit test proving toString() is
stable before and after setThinclientRegionalEndpoint.

Found by: xinlian12

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 enabled auto-merge (squash) March 20, 2026 21:45
@jeet1995
Copy link
Member Author

/check-enforcer override

@jeet1995 jeet1995 merged commit e3c9c34 into Azure:main Mar 20, 2026
88 of 92 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants