-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Session consistency improvement with bloom filter approach #38003
Session consistency improvement with bloom filter approach #38003
Conversation
…ssionConsistencyImprovement # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
…ssionConsistencyImprovement # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/SessionContainer.java
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosClientBuilderTest.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
Show resolved
Hide resolved
...cosmos/src/main/java/com/azure/cosmos/implementation/PartitionScopedRegionLevelProgress.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for few questions/comments - thank you, great work!
…ssionConsistencyImprovementWithBloomFilterApproach
Hey @jeet1995, 2 questions wrt Bloom Filters:
|
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
We have a unit test
Clearing out the bloom filter is not a use case for this feature - without knowing what regions an EPK was seen in it is impossible to guarantee read your create / read your read behavior for this EPK. If we clear out the bloom filter it is effectively eventual consistency when reading the document associated with a given EPK. |
Failures in live test pipeline w.r.t partition-split tests timing out which pass locally. Same behavior for this PR as well - #38740 |
Merging PR as is. |
Background
For multi-write accounts, it is possible for clients to route write requests to any region (hub region or satellite region(s)) based on the preferred regions list. At a high-level, this can cause a backlog of changes for the hub region to “consolidate” and conflict resolve. The hub region also forwards conflict resolved changes to satellite region, hence any cross-regional replication latency coupled with regional failover can make it seem as though either the hub region or one or more satellite regions are “lagging”.
Session consistency which is read your write and read your read guarantee / monotonic read at the client-level is guaranteed by using a construct called the session token. The session token is representative of the progress made by a particular replica from a physical partition. The SDK reaches out to the replica with a requested session token and the replica validates whether it has made progress until the requested session token. If the replica has made progress until the requested session token, then it responds with the required data otherwise it responds with 404:1002 (404 Read/Write Session Not Available). The SDK uses this as a signal to retry on a different replica. Depending on the configuration set in SessionRetryOptions, the SDK decides whether to cycle through the replicas in the current region or to switch to a replica in a different region for the same physical partition.
404:1002s can occur in steady state due to in-region replication lag from time to time which the SDK can recover from by cycling through the replicas within the same region once. The major focus in this document is when there is cross-regional replication lag. This would imply the SDK has to retry replicas in other regions until the most updated replica (which satisfies the requested session token) can be found. The SDK may even have to cycle through remote region replicas multiple times. Any retry to a different region comes with cross-regional latency and this along with the no. of retries can increase the CPU utilization on client-side application pods.
PR details
Overview
This PR moves away from maintaining a single "global" representation of the session token to maintaining a region-scoped progress for each physical partition of a container. This way, when a session token has to be resolved for a request, if the request happens to be targeted to a logical partition, then the session information can be representative of only the regions the logical partition saw requests being routed for. This way the replication target for a replica for whichever region receives the request is reduced and therefore increasing the chances of session guarantees being met without needing retries both locally and cross-regionally.
Major classes introduced
PartitionScopedRegionLevelProgress
This class maintains a nested
ConcurrentHashMap
of typeConcurrentHashMap<String, ConcurrentHashMap<String, RegionLevelProgress>>
which maintains mappings betweenpartitionKeyRangeId
and region-level progress. Region-level progress is anotherConcurrentHashMap
which maps the progress scope / region with the respectivelocalLsn
/globalLsn
/ session token. This class has the main logic for resolving the session token for a given request.PartitionKeyBasedBloomFilter
This class encapsulates a bloom filter which stores hashed tuples where the tuples represent the effective partition key string (hashed representation of the logical partition), collection resource id and the region. This class is responsible for enumerating which regions a particular logical partition saw requests in.
RegionScopedSessionContainer
RegionScopedSessionContainer
is an implementation of the interfaceISessionContainer
. This class maintains an instance ofPartitionKeyBasedBloomFilter
and mappings between thecollectionRid
andPartitionScopedRegionLevelProgress
(collection-level mappings).RegionScopedSessionContainer
performs checks to validate whether a request is targeted to a logical partition (thus determining whether the bloom filter is needed or not). It also fronts invocations from upstream classes wishing to set a session token or resolve the session token for a request.Setting the session token in the
RegionScopedSessionContainer
Sequence diagram
Flow of setting a session token / progress within
PartitionScopedRegionLevelProgress
.NOTES:
localLsn
of a subset ofregionId
s where a logical partition was resolved to.Resolving the session token for a request in the
RegionScopedSessionContainer
Sequence diagram
Flow of resolving a session token within
PartitionScopedRegionLevelProgress
Configuration options
Notes:
Opting in into region-scoped session capturing
Configuring the expected insertions and expected false-positivity rate of the bloom filter
Benchmarking results
The benchmarking done focuses on two areas - performance-regression benchmarking and 404/1002 retry reduction benchmarking.
404/1002 retry reduction benchmarking
Benchmark setup
The fundamental idea is to simulate cross-region replication lag. The account for benchmark purposes is a multi-write account with 3 write regions namely - West US 2, South Central US and East US. Two client instances were created - let's call them a slow client and fast client. The slow client has preferred regions as West US 2, South Central US and East US and uses 2 threads - 1 thread each for point reads and point creates. The fast client uses N = 30 threads to direct routed to East US. This is to simulate cross-region replication lag between East US and West US 2 (the two regions are geographically the furthest among the three regions).
From the slow client's perspective, every 5 minutes for K = 2 minutes (can be configured), a fraction of creates and reads do cross-region retries - either to South Central US or East US. This has the effect of forcing following creates or reads to capture the session progress from the quicker progressing East US thereby increasing chances of requests hitting 404/1002s in either West US 2 or South Central US.
Run type - the slow client is configured with threshold-based availability strategy and end-to-end timeout of 3s
Run type - the slow client is configured with no end-to-end operation timeout
Interpreting the results
Performance regression benchmarks (throughput or latency or both)
Diagnostic changes
bloomFilterInsertionCountSnapshot
- a snapshot of the insertion count into the bloom filter.regionScopedSessionCfg
- a bunch of settings around the expected insertion count and false positive rate of the bloom filter.sessionTokenEvaluationResults
- a list of evaluation results describing how a session token got resolved for a request and how it got recorded given the response.When does region-scoping of session tokens help?
When does the SDK-internal session container help?
Open questions
Memory implications
RegionScopedSessionContainer
with varying expected insertions.Follow up items