[GATEWAY V2]: Fix availability strategy flows. by jeet1995 · Pull Request #48432 · Azure/azure-sdk-for-java

jeet1995 · 2026-03-16T22:43:29Z

Problem

1. Availability strategy broken for Gateway V2 (thin client)

Availability strategy is broken for Gateway V2 (thin client). Requests that should be hedged to a secondary region instead timeout or return errors from the primary region.

Root cause: Availability strategy resolves the target region by looking up a RegionalRoutingContext using only the gateway regional endpoint. However, equals(), hashCode(), and toString() on RegionalRoutingContext depended on both gatewayRegionalEndpoint and thinclientRegionalEndpoint. When thin client is enabled, the stored RegionalRoutingContext entries have thinclientRegionalEndpoint set, but the lookup key constructed during hedging only has gatewayRegionalEndpoint -- so the lookup fails to match any entry. This causes getRegionName() to return just the primary region as a fallback, and hedging targets the same (failing) region twice instead of failing over.

2. NPE in hedged PK-scoped query on Gateway V2

When a partition-key-scoped query is hedged to a secondary region on Gateway V2, the hedged request fails with NullPointerException in ThinClientStoreModel.wrapInHttpRequest because partitionKeyDefinition is null.

Root cause: RxDocumentServiceRequest.clone() copies partitionKeyInternal but not partitionKeyDefinition. The cloned hedged request has hasFeedRangeFilteringBeenApplied = true which prevents downstream re-resolution, so ThinClientStoreModel NPEs when computing EPK bytes.

Fix

RegionalRoutingContext identity fix

equals(), hashCode(), and toString() now only depend on gatewayRegionalEndpoint. Since thinclientRegionalEndpoint is set after construction via a mutable setter, it must not participate in identity -- lookups using just the gateway endpoint now correctly resolve to the stored RegionalRoutingContext.

Request clone fix (defense in depth)

ThinClientStoreModel.performRequestInternal -- resolves partitionKeyDefinition from collection cache when missing.
RxDocumentServiceRequest.clone() -- copies partitionKeyDefinition alongside partitionKeyInternal.

Test coverage for thin client availability strategy flows

PPAF tests (PerPartitionAutomaticFailoverE2ETests) -- added fi-thinclient-multi-region group, DIRECT skip when thin client + HTTP/2 enabled, thinProxy to gatewayProxy replacement for HttpClient mock compatibility.
Circuit breaker tests (PerPartitionCircuitBreakerE2ETests) -- added fi-thinclient-multi-master group, DIRECT skip, assertThinClientEndpointUsed on success responses.
FI availability strategy tests -- wired into thin client CI groups, added thin client endpoint validation, increased e2e timeout by 500ms for thin client to account for RNTBD proxy cache lookup overhead.
Added assertThinClientEndpointUsed shared utility in TestSuiteBase.
PK-scoped query hedging regression test (FaultInjectionServerErrorRuleOnGatewayV2Tests).

Failover Regression Test (DR Drill)

Date: 2026-03-20 02:49-04:02 UTC | Account: thin-client-multi-region-ci (Single Writer, Session consistency, Auto-failover) | Regions: East US 2 (Write), South Central US (Read) | Branch: AzCosmos_FixCloneMissingPkDef @ 461303cfb69

All 8 workloads (Direct + Gateway, Write + Read, 2 rounds) completed with zero user-visible errors. Both rounds show clean failover and full write restore.

Workloads

Each workload is uniquely identifiable in Kusto via its UserAgent suffix.

User Agent	Mode	Operation	Round
`dr-direct-write`	Direct	WriteThroughput	1: Switch
`dr-direct-read`	Direct	ReadThroughput	1: Switch
`dr-gw-write`	Gateway	WriteThroughput	1: Switch
`dr-gw-read`	Gateway	ReadThroughput	1: Switch
`dr-off-direct-write`	Direct	WriteThroughput	2: Offline
`dr-off-direct-read`	Direct	ReadThroughput	2: Offline
`dr-off-gw-write`	Gateway	WriteThroughput	2: Offline
`dr-off-gw-read`	Gateway	ReadThroughput	2: Offline

MgmtDatabaseAccountTrace -- Write Region Transitions

Clean transitions with no write region bounce:

Time (UTC)	East US 2	South Central US	Event
02:55:37	ReadLocation	WriteReadLocation	R1: Failover -- SCUS is write
03:12:19	WriteReadLocation	ReadLocation	R1: Restore -- EUS2 is write
03:23:59	Offline	WriteReadLocation	R2: EUS2 offline -- SCUS takes over
03:54:35	ReadLocation (Online)	WriteReadLocation	R2: EUS2 re-added online
03:56:03	WriteReadLocation	ReadLocation	R2: Restore -- EUS2 is write

Round 1: Switch Write Region

02:49:51 -- R1 benchmark start (4 workloads)
02:54:52 -- az cosmosdb failover-priority-change triggered
02:55:57 -- Failover completed (65s)
03:11:31 -- Restore triggered
03:12:36 -- Restore completed (65s)
03:18:21 -- R1 benchmark stop

Write workloads -- both Direct and Gateway writes shifted cleanly to South Central US within one 5-min bucket:

Read workloads -- reads stayed entirely on East US 2 (preferred region unaffected by write region change):

Round 2: Offline Write Region

03:18:23 -- R2 benchmark start (4 workloads)
03:23:27 -- POST offlineRegion triggered
03:23:59 -- EUS2 confirmed offline (32s)
03:39:33 -- 3-step restore started (remove + re-add + failover)
03:56:20 -- Restore completed
04:01:53 -- R2 benchmark stop

All workloads -- reads and writes moved to SCUS when EUS2 went offline. After restore at 03:56, writes and reads returned to EUS2:

Error Analysis

Backend success rates (Direct mode only -- Gateway retries internally before reaching BackendEndRequest5M):

Workload	Total Requests	Errors	Success Rate
`dr-direct-read`	1,134,910	34,857	96.9%
`dr-direct-write`	450,298	72	99.98%
`dr-off-direct-read`	755,465	34,793	95.4%
`dr-off-direct-write`	634,619	69	99.99%

Important: The 404/1002 (ReadSessionNotAvailable) errors on reads are steady-state behavior for Session consistency -- they appear consistently across ALL time windows, not just during failover. The SDK retries these automatically on another replica. The application-level success rate was 100% (zero user-visible failures).

Exact error counts:

StatusCode/Sub	Workload	Round	Time (UTC)	Count	Explanation
404/1002	`dr-direct-read`	1	02:50	14,952	Steady-state `ReadSessionNotAvailable`. Session token from write replica not yet on read replica. Auto-retried.
404/1002	`dr-direct-read`	1	02:55	997	Same steady-state.
403/3	`dr-direct-write`	1	02:55	32	Write to read-only region during cache refresh. Auto-retried.
404/1002	`dr-direct-read`	1	03:10	8,547	Spike during restore -- replicas rebalancing.
403/3	`dr-direct-write`	1	03:10	40	Write to read-only region during restore. Auto-retried.
404/1002	`dr-direct-read`	1	03:15	10,358	Steady-state resumes.
404/1002	`dr-off-direct-read`	2	03:15	4,610	Steady-state on EUS2 before offline.
404/1002	`dr-off-direct-read`	2	03:20	19,624	Spike during offline transition.
429/3200	`dr-off-direct-read`	2	03:20	37	Standard throughput throttle. Auto-retried.
403/3	`dr-off-direct-write`	2	03:25	5	Brief write to read-only during transition.
403/3	`dr-off-direct-write`	2	03:50	34	Write during restore transition.
404/1002	`dr-off-direct-read`	2	03:55	4,724	Session tokens catching up after EUS2 re-added.
403/3	`dr-off-direct-write`	2	03:55	30	Write during failover-priority-change.
404/1002	`dr-off-direct-read`	2	04:00	5,794	Session tokens settling.

Errors only appear on Direct mode workloads in BackendEndRequest5M because Direct mode exposes per-replica-level errors. Gateway mode retries internally before reaching BackendEndRequest5M.

Thin Client (Gateway V2) Failover Test

Account: thin-client-failover-test (fe50, IsThinClientEnabled=True) | Date: 2026-03-20 18:00-18:20 UTC | JVM flags: -DCOSMOS.THINCLIENT_ENABLED=true -DCOSMOS.HTTP2_ENABLED=true

Ran thin client workloads (dr-tc-write, dr-tc-read) while performing switch-write-region (EUS2 to SCUS) followed by offline EUS2 (read region).

ThinClientProxyRequest5M -- region distribution:

Time (UTC)	EUS2 Creates	SCUS Creates	EUS2 Reads	SCUS Reads	Event
18:00	72,782	--	120,238	--	Baseline (EUS2 write)
18:05	55,686	15,075	125,211	4	Failover at 18:07 (mixed bucket)
18:10	--	36,023	149,245	--	Writes fully on SCUS, reads on EUS2
18:15	--	32,115	34,516	16,168	EUS2 offlined at 18:14 -- reads shifting to SCUS

Thin client errors (ThinClientProxyRequest5M):

Time (UTC)	Region	StatusCode/Sub	Count	Explanation
18:05	East US 2	403/3	28	TC write to read-only region during cache refresh. Auto-retried.
18:05	South Central US	403/3	8	Brief TC write during transition. Auto-retried.
18:05	East US 2	404/1002	6	ReadSessionNotAvailable during failover. Auto-retried.

Total: 42 errors out of ~700K requests. Zero user-visible failures.

Verdict: Thin client (Gateway V2) failover works correctly. The RegionalRoutingContext identity fix enables proper region routing for thin client -- writes shift to SCUS within one 5-min bucket, reads stay on preferred region (EUS2) until that region goes offline.

Kusto query (ThinClientProxyRequest5M)

ThinClientProxyRequest5M
| where TIMESTAMP between (datetime(2026-03-20 18:00) .. datetime(2026-03-20 18:25))
| where globalDatabaseAccountName == 'thin-client-failover-test'
| where userAgent has 'dr-tc'
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), region, operationType, statusCode
| order by TIMESTAMP asc, operationType asc

Note: ThinClientProxyRequest5M uses lowercase column names (globalDatabaseAccountName, region, operationType, statusCode, userAgent).

---

Verdict

Scenario	Direct Write	Direct Read	Gateway Write	Gateway Read	TC Write	TC Read	Result
Switch Write Region	Failover < 5m	No disruption	Failover < 5m	No disruption	Failover < 5m	No disruption	PASS
Offline Write Region	Failover < 5m	Failover < 5m	Failover < 5m	Failover < 5m	--	--	PASS
Offline Read Region	--	--	--	--	No disruption	Failover < 5m	PASS
Restore (Round 1)	< 5m	No disruption	< 5m	No disruption	--	--	PASS
Restore (Round 2)	Writes to EUS2 < 5m	Reads to EUS2 < 5m	Writes to EUS2 < 5m	Reads to EUS2 < 5m	--	--	PASS

TC (Thin Client / Gateway V2) tested on separate account (thin-client-failover-test, fe50) with -DCOSMOS.THINCLIENT_ENABLED=true. Switch Write Region + Offline Read Region tested. Direct/Gateway tested on thin-client-multi-region-ci with full switch + offline write + restore cycle.

No regression from the RegionalRoutingContext identity fix or RxDocumentServiceRequest.clone() fix. All three connection modes (Direct, Gateway, Thin Client) handled failover correctly with zero user-visible errors. Clean MgmtDatabaseAccountTrace confirms no write region bounce.

Kusto verification queries (cluster: cdbsupport.kusto.windows.net, database: Support)

Direct mode region distribution (BackendEndRequest5M):

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccountName == 'thin-client-multi-region-ci'
| where UserAgent has 'dr-direct' or UserAgent has 'dr-off-direct'
| where ResourceType == 2
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, Workload
| extend Series = strcat(Workload, " [", Region, "]")
| project TIMESTAMP, Series, Requests
| render timechart

Gateway mode region distribution (ComputeRequest5M):

ComputeRequest5M
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccountName == 'thin-client-multi-region-ci'
| where UserAgent has 'dr-gw' or UserAgent has 'dr-off-gw'
| where OperationName in ('Create', 'Read')
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, Workload
| extend Series = strcat(Workload, " [", Region, "]")
| project TIMESTAMP, Series, Requests
| render timechart

Error breakdown:

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccountName == 'thin-client-multi-region-ci'
| where UserAgent has 'dr-' and StatusCode >= 400
| where ResourceType == 2
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize ErrorCount=sum(SampleCount) by bin(TIMESTAMP, 5m), StatusCode, SubStatusCode, Workload
| order by TIMESTAMP asc

MgmtDatabaseAccountTrace (control plane verification):

MgmtDatabaseAccountTrace
| where TIMESTAMP between (datetime(2026-03-20 02:45) .. datetime(2026-03-20 04:05))
| where GlobalDatabaseAccount == 'thin-client-multi-region-ci'
| project TIMESTAMP, Location, LocationType, FederationId, Status
| order by TIMESTAMP asc

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message.

Testing Guidelines

Pull request includes test coverage for the included changes.

jeet1995 · 2026-03-16T23:10:04Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-16T23:10:31Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2026-03-17T00:03:01Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-17T00:03:33Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2026-03-17T00:08:10Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-17T00:08:32Z

Azure Pipelines successfully started running 1 pipeline(s).

…edged PK-scoped queries RxDocumentServiceRequest.clone() (used by hedging in executeFeedOperationWithAvailabilityStrategy) was not copying partitionKeyDefinition. When a PK-scoped query is hedged to a secondary region, the cloned request has partitionKeyInternal (PK value) but null partitionKeyDefinition (PK schema). ThinClientStoreModel.wrapInHttpRequest needs both for client-side EPK computation, causing NPE: Cannot invoke PartitionKeyDefinition.getPaths() because partitionKeyDefinition is null at PartitionKeyInternalHelper.getEffectivePartitionKeyBytes at ThinClientStoreModel.wrapInHttpRequest Root cause: RxDocumentServiceRequest.clone() did not copy partitionKeyDefinition. This only manifests in GW V2 (thin client) because GW V1 never reads partitionKeyDefinition from the request - it serializes the PK as a JSON header and lets the server resolve EPK. Point operations are unaffected because they clone RequestOptions (not the request) for hedging. Fix (defense in depth): - Primary: ThinClientStoreModel overrides performRequestInternal to reactively resolve partitionKeyDefinition from the collection cache when missing on the request. Includes logger.warn for production observability when the fallback triggers. - Secondary: RxDocumentServiceRequest.clone() now copies partitionKeyDefinition. Test coverage: - Unit test: cloneShouldPreservePartitionKeyDefinition (verified: fails without fix, passes with) - E2E test: PK-scoped query with fault-injected response delay triggering hedging on GW V2 (verified: NPE without fix, passes with fix) - Wired FaultInjectionWithAvailabilityStrategy test suite (FITests_read/write/query/readMany/readAll) into fi-thinclient-multi-region and fi-thinclient-multi-master test groups for holistic coverage.

jeet1995 · 2026-03-17T00:29:00Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-17T00:29:25Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2026-03-17T01:04:13Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-17T01:04:38Z

Azure Pipelines successfully started running 1 pipeline(s).

FabianMeiswinkel

LGTM

…ness TenantWorkloadConfig.getConsistencyLevel() called valueOf(toUpperCase()) which fails for wire-format values like 'BoundedStaleness' -> 'BOUNDEDSTALENESS' (enum name is BOUNDED_STALENESS). Now falls back to matching the display name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-03-17T01:32:30Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-17T01:32:54Z

Azure Pipelines successfully started running 1 pipeline(s).

FITests_queryAfterCreation test configs use ONE_SECOND_DURATION timeouts designed for DIRECT mode. Thin client forces GATEWAY mode which has higher latency through the proxy, causing 408 timeouts. Skip DIRECT configs when thin client is enabled until proper GATEWAY variants are added. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-03-17T03:35:00Z

/azp run java - cosmos - tests

jeet1995 · 2026-03-17T03:35:01Z

/azp run java - cosmos - ci

azure-pipelines · 2026-03-17T03:35:25Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2026-03-17T03:35:26Z

Azure Pipelines successfully started running 1 pipeline(s).

…strategy tests running FI tests with availability strategy (hedging) must still run under thin client since query/readAll hedging is core functionality being validated. Only skip DIRECT mode test configs that have no availability strategy (baseline AllGood tests with tight timeouts). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ction modes Availability strategy (hedging) is connection-mode agnostic. Tests should run under thin client without skipping, regardless of whether the config specifies DIRECT or GATEWAY. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-03-17T12:08:00Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-17T12:08:29Z

Azure Pipelines successfully started running 1 pipeline(s).

…test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ression test" This reverts commit da8ea88.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-03-19T22:42:28Z

@sdkReviewAgent

xinlian12 · 2026-03-19T22:42:58Z

sdkReviewAgent | Status: ⏳ Queued

Review requested by @xinlian12. I'll start shortly.

xinlian12 · 2026-03-19T22:43:00Z

sdkReviewAgent | Status: 🔍 Reviewing

I'm reviewing this PR now. I'll post my findings as comments when done.

…e region bounce Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When resolveCollectionAsync returns null, throw NullPointerException with a descriptive message instead of logging a warning and falling through. There is no recovery path -- proceeding would NPE in getEffectivePartitionKeyBytes with a cryptic stack trace. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-03-20T17:22:10Z

@sdkReviewAgent-2

xinlian12

LGTM, thanks

xinlian12 · 2026-03-20T17:53:01Z

⏳ PR Review Agent — Starting review...

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java

...ure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/RegionalRoutingContext.java

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ion bug CaseInsensitiveMap.convertKey() uses toString() as the map key, not equals()/hashCode(). The richer toString() that included thinclientEndpoint caused keys to change after setThinclientRegionalEndpoint was called post-insertion, breaking map lookups in LocationCache. Revert to gateway-only toString(). Add unit test proving toString() is stable before and after setThinclientRegionalEndpoint. Found by: xinlian12 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12

LGTM, thanks

jeet1995 · 2026-03-20T19:06:38Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-20T19:07:07Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2026-03-20T23:02:01Z

/check-enforcer override

github-actions bot added the Cosmos label Mar 16, 2026

jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch 2 times, most recently from a2c92e2 to e8efae4 Compare March 16, 2026 22:59

jeet1995 changed the title ~~[GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for h…~~ [GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for hedged PK-scoped queries. Mar 16, 2026

jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch from e8efae4 to fbf0b6f Compare March 17, 2026 00:02

jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch from fbf0b6f to 12ada5e Compare March 17, 2026 01:01

FabianMeiswinkel approved these changes Mar 17, 2026

View reviewed changes

jeet1995 and others added 2 commits March 17, 2026 08:03

jeet1995 and others added 3 commits March 19, 2026 18:19

Add DR drill timechart images for PR Azure#48432 failover regression …

da8ea88

…test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert "Add DR drill timechart images for PR Azure#48432 failover reg…

fda428c

…ression test" This reverts commit da8ea88.

Add DR drill timechart images for PR Azure#48432

43ae2a4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 and others added 8 commits March 19, 2026 18:53

Update DR drill charts: scope to Document ResourceType, annotate writ…

12aac15

…e region bounce Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add success timechart for DR drill report

a2a27e9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update DR drill charts with skill-test results

1abccf4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update DR drill charts with v2 skill-test results (full completion)

e34dbae

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add success timechart for v2 DR drill

46f68b8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add separate write and read success timecharts for v2 DR drill

584907a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Regenerate success charts with verified Kusto data

461303c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 mentioned this pull request Mar 20, 2026

[Cosmos] Audit RxDocumentServiceRequest.clone() for missing field copies #48496

Open

xinlian12 approved these changes Mar 20, 2026

View reviewed changes

xinlian12 reviewed Mar 20, 2026

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java Show resolved Hide resolved

xinlian12 reviewed Mar 20, 2026

View reviewed changes

...ure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/RegionalRoutingContext.java Outdated Show resolved Hide resolved

xinlian12 reviewed Mar 20, 2026

View reviewed changes

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java Show resolved Hide resolved

jeet1995 and others added 3 commits March 20, 2026 14:45

Add thin client failover timechart (ThinClientProxyRequest5M)

f897209

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

TC chart: clarify titles show successes only (201/200)

88928f7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 approved these changes Mar 20, 2026

View reviewed changes

jeet1995 enabled auto-merge (squash) March 20, 2026 21:45

jeet1995 merged commit e3c9c34 into Azure:main Mar 20, 2026
88 of 92 checks passed

Conversation

jeet1995 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

1. Availability strategy broken for Gateway V2 (thin client)

2. NPE in hedged PK-scoped query on Gateway V2

Fix

RegionalRoutingContext identity fix

Request clone fix (defense in depth)

Test coverage for thin client availability strategy flows

Failover Regression Test (DR Drill)

Workloads

MgmtDatabaseAccountTrace -- Write Region Transitions

Round 1: Switch Write Region

Round 2: Offline Write Region

Error Analysis

Thin Client (Gateway V2) Failover Test

Verdict

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

jeet1995 commented Mar 16, 2026

Uh oh!

azure-pipelines bot commented Mar 16, 2026

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

jeet1995 commented Mar 17, 2026

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 20, 2026

Uh oh!

xinlian12 left a comment

Choose a reason for hiding this comment

Uh oh!

xinlian12 commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinlian12 left a comment

Choose a reason for hiding this comment

Uh oh!

jeet1995 commented Mar 16, 2026 •

edited

Loading