Session consistency improvement with bloom filter approach #38003

jeet1995 · 2023-12-09T22:23:03Z

Background

For multi-write accounts, it is possible for clients to route write requests to any region (hub region or satellite region(s)) based on the preferred regions list. At a high-level, this can cause a backlog of changes for the hub region to “consolidate” and conflict resolve. The hub region also forwards conflict resolved changes to satellite region, hence any cross-regional replication latency coupled with regional failover can make it seem as though either the hub region or one or more satellite regions are “lagging”.

Session consistency which is read your write and read your read guarantee / monotonic read at the client-level is guaranteed by using a construct called the session token. The session token is representative of the progress made by a particular replica from a physical partition. The SDK reaches out to the replica with a requested session token and the replica validates whether it has made progress until the requested session token. If the replica has made progress until the requested session token, then it responds with the required data otherwise it responds with 404:1002 (404 Read/Write Session Not Available). The SDK uses this as a signal to retry on a different replica. Depending on the configuration set in SessionRetryOptions, the SDK decides whether to cycle through the replicas in the current region or to switch to a replica in a different region for the same physical partition.

404:1002s can occur in steady state due to in-region replication lag from time to time which the SDK can recover from by cycling through the replicas within the same region once. The major focus in this document is when there is cross-regional replication lag. This would imply the SDK has to retry replicas in other regions until the most updated replica (which satisfies the requested session token) can be found. The SDK may even have to cycle through remote region replicas multiple times. Any retry to a different region comes with cross-regional latency and this along with the no. of retries can increase the CPU utilization on client-side application pods.

PR details

Overview

This PR moves away from maintaining a single "global" representation of the session token to maintaining a region-scoped progress for each physical partition of a container. This way, when a session token has to be resolved for a request, if the request happens to be targeted to a logical partition, then the session information can be representative of only the regions the logical partition saw requests being routed for. This way the replication target for a replica for whichever region receives the request is reduced and therefore increasing the chances of session guarantees being met without needing retries both locally and cross-regionally.

Major classes introduced

`PartitionScopedRegionLevelProgress`

This class maintains a nested ConcurrentHashMap of type ConcurrentHashMap<String, ConcurrentHashMap<String, RegionLevelProgress>> which maintains mappings between partitionKeyRangeId and region-level progress. Region-level progress is another ConcurrentHashMap which maps the progress scope / region with the respective localLsn / globalLsn / session token. This class has the main logic for resolving the session token for a given request.

`PartitionKeyBasedBloomFilter`

This class encapsulates a bloom filter which stores hashed tuples where the tuples represent the effective partition key string (hashed representation of the logical partition), collection resource id and the region. This class is responsible for enumerating which regions a particular logical partition saw requests in.

`RegionScopedSessionContainer`

RegionScopedSessionContainer is an implementation of the interface ISessionContainer. This class maintains an instance of PartitionKeyBasedBloomFilter and mappings between the collectionRid and PartitionScopedRegionLevelProgress (collection-level mappings). RegionScopedSessionContainer performs checks to validate whether a request is targeted to a logical partition (thus determining whether the bloom filter is needed or not). It also fronts invocations from upstream classes wishing to set a session token or resolve the session token for a request.

Setting the session token in the `RegionScopedSessionContainer`

Sequence diagram

sequenceDiagram
    autonumber
    participant RXDL as RxDocumentClientImpl
    participant RXGSM as RxGatewayStoreModel
    participant SCL as StoreClient
    participant RSSC as RegionScopedSessionContainer
    participant PKBF as PartitionKeyBasedBloomFilter
    participant PSRP as PartitionScopedRegionLevelProgress
    
    
    RXDL-->>RSSC: setSessionToken()
    RXGSM-->>RSSC: setSessionToken()
    SCL-->>RSSC: setSessionToken()
    RSSC->>PSRP: tryRecordSessionToken()
    RSSC-->>PKBF: tryRecordPartitionKey()

Flow of setting a session token / progress within `PartitionScopedRegionLevelProgress`.

flowchart TD
    A[tryRecordSessionToken - invoked after the collectionRid and physical partiton is determined from the response] -->B[Store the mapping - global: merged session token]
    B --> B1{Does the session token have regionId to localLsn mappings?}
    B1 --> |Yes|C1{Has the targeted partition already seen operations without logical partition scope?}
    C1 --> |No|C[Capture the region the request got a response from]
    C --> D[Obtain the regionId for the region]
    D --> E{Is regionId to region mapping known to the SDK}
    E --> |Yes|F{Does regionId map to a satellite region or specifically if regionId is not present is session token?}
    F --> |Yes|G[Store the mapping - region: max localLsn seen so far, max globalLsn seen so far, prior retained session token if any]
    E --> |Yes|H{Does regionId map to a hub region or specifically if regionId is not present is session token?}
    H --> |Yes|I[Store the mapping - region: max globalLsn seen so far, prior retained session token if any]
    G --> J{Is region is the first preferred readable region?}
    J --> |Yes|K[Store the mapping - region: merged result of session token from the response and retained session token]
    I --> J

NOTES:

The motivation behind storing the "global" representation of the session token is to resolve the session token as is for requests which don't have logical partition scoping.
The motivation behind storing the session token as is from the first preferred read region is so it is available as is to be merged with any region-scoped session token. A region-scoped session token will mainly have localLsn of a subset of regionIds where a logical partition was resolved to.
Whenever a physical partition sees a request with no logical partition scope such as cross-partition query, change feed with non-logical partition scope or bulk operations any following point operations will resort to using the global session token from that partition. This is done to ensure read your read guarantees.

Resolving the session token for a request in the `RegionScopedSessionContainer`

Sequence diagram

sequenceDiagram
    participant RXGSM as RxGatewayStoreModel
    participant STH as SessionTokenHelper
    participant RSSC as RegionScopedSessionContainer
    participant PKBF as PartitionKeyBasedBloomFilter
    participant PSRP as PartitionScopedRegionLevelProgress
    participant SR as StoreReader
    participant CW as ConsistencyWriter
   
    RXGSM-->>STH: 1. setPartitionLocalSessionToken()
    SR-->>STH: 1. setPartitionLocalSessionToken()
    CW-->>STH: 1. setPartitionLocalSessionToken()
    STH-->>RSSC: 2. resolvePartitionLocalSessionToken()
    RSSC-->>STH: 3. resolvePartitionLocalSessionToken()
    STH-->>PKBF: 4. tryGetPossibleRegionsLogicalPartitionResolvedTo()
    STH-->>PSRP: 5. tryResolveSessionToken()

Flow of resolving a session token within `PartitionScopedRegionLevelProgress`

flowchart TD
    A[tryResolveSessionToken - kicks in when the session token for the request is to be resolved] --> B{Can region scoped session token be used?}
    B --> |No| C[Return global session token]
    B --> |Yes|D[Obtain session token for first preferred read region aka base session token]
    D --> E[Iterate through regionIds in the global session token for construction of region scoped session token string]
    subgraph Loop to construct a region scoped session token string
    E --> EStart[Start iteration]
    EStart --> F{If regionId has corresponding region known to the SDK}
    F --> |No|G[Return global session token]
    F --> |Yes|H{If region has been resolved for logical partition}
    H --> |Yes|I[Append regionId=localLsn to a session token string and store max globalLsn seen so far]
    H --> |No|J[Append regionId=-1 to a session token string]
    I --> EEnd[End iteration]
    J --> EEnd
    end
    EEnd --> K[Prepend version obtained from global session token to session token string]
    K --> L[Preprend max globalLsn seen so far to session token string]
    L --> M{If merge of base session token and session token string succeeds}
    M --> |Yes|N[Return merge of base session token and session token string]
    M --> |No|O[Return global session token]

Configuration options

Notes:

The below settings can be applied as an environment variable or a JVM system property.

Opting in into region-scoped session capturing

System.setProperty("COSMOS.SESSION_CAPTURING_TYPE", "REGION_SCOPED");

Configuring the expected insertions and expected false-positivity rate of the bloom filter

Region-scoped session capturing needs to be first opted into with the previous system config setting.

System.setProperty("COSMOS.PK_BASED_BLOOM_FILTER_EXPECTED_INSERTION_COUNT", "5000000");
System.setProperty("COSMOS.PK_BASED_BLOOM_FILTER_EXPECTED_FFP_RATE", "0.001");

Benchmarking results

The benchmarking done focuses on two areas - performance-regression benchmarking and 404/1002 retry reduction benchmarking.

404/1002 retry reduction benchmarking

Benchmark setup

The fundamental idea is to simulate cross-region replication lag. The account for benchmark purposes is a multi-write account with 3 write regions namely - West US 2, South Central US and East US. Two client instances were created - let's call them a slow client and fast client. The slow client has preferred regions as West US 2, South Central US and East US and uses 2 threads - 1 thread each for point reads and point creates. The fast client uses N = 30 threads to direct routed to East US. This is to simulate cross-region replication lag between East US and West US 2 (the two regions are geographically the furthest among the three regions).

From the slow client's perspective, every 5 minutes for K = 2 minutes (can be configured), a fraction of creates and reads do cross-region retries - either to South Central US or East US. This has the effect of forcing following creates or reads to capture the session progress from the quicker progressing East US thereby increasing chances of requests hitting 404/1002s in either West US 2 or South Central US.

Run type - the slow client is configured with threshold-based availability strategy and end-to-end timeout of 3s

Configuration type	Configuration value
Connectivity mode	Direct
Is threshold-based vailability strategy enabled?	TRUE
End-to-end operation timeout	3s
Threads allocated to run creates in East US region (remote region) through secondary client	30
Cross-region retry rate every 5 minutes for creates and reads through primary client	10%
404:1002 retry hint	REMOTE_REGION_PREFERRED
In-region retry time for 404:1002s	10 thousand ms
Run duration	4 hours
Container's manual provisioned throughput	100 thousand
Bloom filter expected insertions	800 million
Bloom filter expected false positive rate	0.001

Run type - the slow client is configured with no end-to-end operation timeout

Configuration type	Configuration value
Connectivity mode	Direct
Threads allocated to run creates in East US region (remote region) through secondary client	30
Cross-region retry rate every 5 minutes for creates and reads through primary client	10%
404:1002 retry hint	REMOTE_REGION_PREFERRED
In-region retry time for 404:1002s	10 thousand ms
Run duration	4 hours
Container's manual provisioned throughput	100 thousand
Bloom filter expected insertions	800 million
Bloom filter expected false positive rate	0.001

Interpreting the results

When end-to-end operation timeout and availability strategy is configured:
- When region-scoped session capturing is enabled, 9.7% more PKs had simulated cross-region retries with a 0.6% increase in of 404/1002 cross-region retries in case of reads. In case of 404/1002 retries for creates, with region-scoped session capturing enabled, 404/1002 retries for creates drops roughly by 68%.
When end-to-end operation timeout and availability strategy is not configured:
- When region-scoped session capturing is enabled, ~27% more PKs had simulated cross-region retries with a 1.1% decrease in of 404/1002 cross-region retries in case of reads. In case of 404/1002 retries for creates, with region-scoped session capturing enabled, 404/1002 retries for creates drops roughly by 59%.

Performance regression benchmarks (throughput or latency or both)

Diagnostic changes

Properties added
- bloomFilterInsertionCountSnapshot - a snapshot of the insertion count into the bloom filter.
- regionScopedSessionCfg - a bunch of settings around the expected insertion count and false positive rate of the bloom filter.
- sessionTokenEvaluationResults - a list of evaluation results describing how a session token got resolved for a request and how it got recorded given the response.

{
    "userAgent": "azsdk-java-cosmos/4.60.0-beta.1 Windows11/10.0 JRE/18.0.2.1",
    "activityId": "a1a07046-aafe-4568-8e31-0f389fa0e57f",
    "requestLatencyInMs": 1551,
    "requestStartTimeUTC": "2024-05-08T18:08:23.912067800Z",
    "requestEndTimeUTC": "2024-05-08T18:08:25.463844200Z",
    "responseStatisticsList": [
        {
            "storeResult": {
                "storePhysicalAddress": "rntbd://cdb-ms-prod-southcentralus1-be11.documents.azure.com:14352/apps/331f5c74-6380-425f-bc61-ccc87c1ecc3c/services/fa4f1f2f-3f2f-4818-9622-f6be8c5d4552/partitions/a669c23c-583f-4709-8202-6fd50742dc3d/replicas/133596162039633782s/",
                "lsn": 5,
                "globalCommittedLsn": 1,
                "partitionKeyRangeId": "0",
                "isValid": true,
                "statusCode": 200,
                "subStatusCode": 0,
                "isGone": false,
                "isNotFound": false,
                "isInvalidPartition": false,
                "isThroughputControlRequestRateTooLarge": false,
                "requestCharge": 1.0,
                "itemLSN": 5,
                "sessionToken": "0:0#2#1=-1#5=-1",
                "backendLatencyInMs": 0.262,
                "retryAfterInMs": null,
                "replicaStatusList": [
                    "14352:Unknown",
                    "14086:Unknown",
                    "14371:Unknown",
                    "14077:Unknown"
                ],
                "transportRequestTimeline": [
                    {
                        "eventName": "created",
                        "startTimeUTC": "2024-05-08T18:08:24.422093800Z",
                        "durationInMilliSecs": 2.5119
                    },
                    {
                        "eventName": "queued",
                        "startTimeUTC": "2024-05-08T18:08:24.424605700Z",
                        "durationInMilliSecs": 0.0
                    },
                    {
                        "eventName": "channelAcquisitionStarted",
                        "startTimeUTC": "2024-05-08T18:08:24.424605700Z",
                        "durationInMilliSecs": 1007.4226
                    },
                    {
                        "eventName": "pipelined",
                        "startTimeUTC": "2024-05-08T18:08:25.432028300Z",
                        "durationInMilliSecs": 2.9939
                    },
                    {
                        "eventName": "transitTime",
                        "startTimeUTC": "2024-05-08T18:08:25.435022200Z",
                        "durationInMilliSecs": 26.3154
                    },
                    {
                        "eventName": "decodeTime",
                        "startTimeUTC": "2024-05-08T18:08:25.461337600Z",
                        "durationInMilliSecs": 0.0
                    },
                    {
                        "eventName": "received",
                        "startTimeUTC": "2024-05-08T18:08:25.461337600Z",
                        "durationInMilliSecs": 1.5029
                    },
                    {
                        "eventName": "completed",
                        "startTimeUTC": "2024-05-08T18:08:25.462840500Z",
                        "durationInMilliSecs": 0.0
                    }
                ],
                "transportRequestChannelAcquisitionContext": {
                    "events": [
                        {
                            "poll": "2024-05-08T18:08:24.427755300Z",
                            "durationInMilliSecs": 0.0
                        },
                        {
                            "startNew": "2024-05-08T18:08:24.427755300Z",
                            "durationInMilliSecs": 872.2768
                        },
                        {
                            "completeNew": "2024-05-08T18:08:25.300032100Z"
                        }
                    ],
                    "waitForChannelInit": true
                },
                "rntbdRequestLengthInBytes": 583,
                "rntbdResponseLengthInBytes": 894,
                "requestPayloadLengthInBytes": 0,
                "responsePayloadLengthInBytes": 369,
                "channelStatistics": {
                    "channelId": "f270d909",
                    "channelTaskQueueSize": 0,
                    "pendingRequestsCount": 0,
                    "lastReadTime": "2024-05-08T18:08:25.431024700Z",
                    "waitForConnectionInit": true
                },
                "serviceEndpointStatistics": {
                    "availableChannels": 0,
                    "acquiredChannels": 0,
                    "executorTaskQueueSize": 0,
                    "inflightRequests": 1,
                    "lastSuccessfulRequestTime": "2024-05-08T18:08:24.423Z",
                    "lastRequestTime": "2024-05-08T18:08:24.423Z",
                    "createdTime": "2024-05-08T18:08:24.423598700Z",
                    "isClosed": false,
                    "cerMetrics": {}
                }
            },
            "requestResponseTimeUTC": "2024-05-08T18:08:25.463844200Z",
            "requestStartTimeUTC": "2024-05-08T18:08:24.422093800Z",
            "requestResourceType": "Document",
            "requestOperationType": "Read",
            "requestSessionToken": "0:0#2#1=-1#5=-1",
            "e2ePolicyCfg": null,
            "excludedRegions": "West US 2",
            "sessionTokenEvaluationResults": [
                "Recording region specific progress of region : southcentralus.",
                "Resolving to the session token corresponding to the first preferred readable region since the requested logical partition has not been resolved to other regions."
            ]
        }
    ],
    "supplementalResponseStatisticsList": [],
    "addressResolutionStatistics": {
        "2c4a785c-f1b8-4897-ab04-680d065ddb01": {
            "startTimeUTC": "2024-05-08T18:08:23.975076300Z",
            "endTimeUTC": "2024-05-08T18:08:24.301765600Z",
            "targetEndpoint": "https://xxxxxxx.documents.azure.com:443/addresses/?$resolveFor=dbs%2Fbq9PAA%3D%3D%2Fcolls%2Fbq9PALlDn7M%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=0",
            "exceptionMessage": null,
            "forceRefresh": false,
            "forceCollectionRoutingMapRefresh": false,
            "inflightRequest": false
        }
    },
    "regionsContacted": [
        "south central us"
    ],
    "retryContext": {
        "statusAndSubStatusCodes": null,
        "retryLatency": 0,
        "retryCount": 0
    },
    "metadataDiagnosticsContext": {
        "metadataDiagnosticList": [
            {
                "metaDataName": "SERVER_ADDRESS_LOOKUP",
                "startTimeUTC": "2024-05-08T18:08:23.975076300Z",
                "endTimeUTC": "2024-05-08T18:08:24.301765600Z",
                "durationinMS": 326
            }
        ]
    },
    "serializationDiagnosticsContext": {
        "serializationDiagnosticsList": null
    },
    "gatewayStatisticsList": [],
    "samplingRateSnapshot": 1.0,
    "bloomFilterInsertionCountSnapshot": 0,
    "systemInformation": {
        "usedMemory": "43442 KB",
        "availableMemory": "4150862 KB",
        "systemCpuLoad": "(2024-05-08T18:08:00.003443600Z 45.5%), (2024-05-08T18:08:05.019003200Z 34.8%), (2024-05-08T18:08:10.010209300Z 12.0%), (2024-05-08T18:08:15.018912300Z 10.0%), (2024-05-08T18:08:20.012022400Z 13.5%), (2024-05-08T18:08:25.012157100Z 9.6%)",
        "availableProcessors": 8
    },
    "clientCfgs": {
        "id": 3,
        "machineId": "uuid:406f4553-6e18-4080-82d0-15aaa9cfef24",
        "connectionMode": "DIRECT",
        "numberOfClients": 1,
        "excrgns": "[]",
        "clientEndpoints": {
            "https://xxxxx.documents.azure.com:443/": 3
        },
        "connCfg": {
            "rntbd": "(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:true)",
            "gw": "(cps:1000, nrto:PT1M, icto:PT1M, p:false)",
            "other": "(ed: true, cs: false, rv: true)"
        },
        "consistencyCfg": "(consistency: Session, mm: true, prgns: [westus2,southcentralus,eastus])",
        "proactiveInitCfg": "",
        "e2ePolicyCfg": "",
        "sessionRetryCfg": "",
        "regionScopedSessionCfg": "(rssc: true, expins: 5000000, ffprate: 0.001)"
    }
}

When does region-scoping of session tokens help?

The targeted account is a multi-write region account.
This feature will help reduce the replication target for requests targeting those logical partitions which haven't seen any transient cross-region retries.
The proportion of logical partitions which see transient cross-region retries should be small compared to total no. of logical partitions seen by the application or the application itself should have a high cardinality of logical partitions.
The application has a regular cadence of restarts - this helps clear out the bloom filter. The more records added to the bloom filter, the higher the chances of false positives being returned.
The application's workload primarily consists of point operations through a given client instance. Any operation with non-logical partition scope would mean following point operations have to use the global session token of that partition. An example - say a cross-partition query reads version 10 of a document, then the following read to that document also has to read at least version 10 of the same document which can be guaranteed when using the global session token for the partition in which the document exists.

When does the SDK-internal session container help?

An SDK-internal session container helps primarily in single-region write accounts where only 1 region sees all the writes, hence ensuring reading the latest committed write for a logical partition with a retry on the write region (worst-case).
It could help when all clients have the same preferred region order with multi-region write accounts. Any service-side issue would cause failovers in a similar manner on all clients, so write traffic increase would be isolated to a particular region. If not, Strong consistency would be a better choice if read your write guarantees are required albeit with latency increase for write operations to be globally committed. If read your write guarantees are indeed not required, then moving to Eventual consistency could help.
If Session guarantees are still required with multi-write accounts with clients biased to different regions, then capturing session tokens on a per-logical partition at an application level can help.

Open questions

The hub region could change, or the account-level region configuration could change.
When availability strategy is configured for the client, this could potentially lead to requests also being routed to other preferred regions leading to the bloom filter being filled up.

Memory implications

Below is the retained size (size of the object and whatever it depends on) of RegionScopedSessionContainer with varying expected insertions.

Expected Insertions	False Positive Rate	Retained Size
10, 000	0.001	21 KB
100, 000	0.001	183 KB
1 million	0.001	1.8 MB
10 million	0.001	17.9 MB
100 million	0.001	179 MB
1 billion	0.001	1.8 GB

Follow up items

Extend support for bulk operations.

…ssionConsistencyImprovement # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java

…ssionConsistencyImprovement # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/SessionContainer.java

jeet1995 · 2024-05-04T23:29:12Z

/azp run java - cosmos - tests

azure-pipelines · 2024-05-04T23:29:32Z

Azure Pipelines successfully started running 1 pipeline(s).

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosClientBuilderTest.java

sdk/cosmos/azure-cosmos/CHANGELOG.md

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ChangeFeedQueryImpl.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java

...cosmos/src/main/java/com/azure/cosmos/implementation/PartitionScopedRegionLevelProgress.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java

FabianMeiswinkel

LGTM except for few questions/comments - thank you, great work!

…ssionConsistencyImprovementWithBloomFilterApproach

nitinitt · 2024-05-13T19:46:38Z

Hey @jeet1995, 2 questions wrt Bloom Filters:

What was the actual error rate, in the experiment where 800 million insertions were made in Bloom Filter with expected error rate: 0.001. (code https://github.com/Azure/azure-sdk-for-java/pull/38003/files#diff-3ebb7622a9d10ed5f6e06103b7eda616c0c9ce4e1b72e88708ddf069cb5fcff8R120), how many occurrences of EPK the Bloom Filter deduced it has, but were a false positive? Does this error rate increase as the initial capacity of Bloom Filter is lowered(to conserve memory) keeping the expected error rate same, with same no of insertions: 800million? What is the impact of these false positives?
Is it possible to keep Bloom Filter in Guava Cache for a TTL amount of time i.e. 1 day/week. eg Bloom Filter 1 in Guava: Key=20240513 Value: Actual Bloom Filter1, TTL: 1 day. Bloom Filter 2 Key=20240514 and Value=Actual Bloom Filter2, TTL=1 day. In this case the old Bloom Filter can get TTL'd out and new one gets created each week/day. Would this help in reducing false positives and/or help reduce the memory footprint by lowering the capacity of Bloom Filter, as we are refreshing Bloom Filter periodically?

jeet1995 · 2024-05-14T21:37:31Z

/azp run java - cosmos - tests

azure-pipelines · 2024-05-14T21:37:53Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2024-05-15T01:22:38Z

Hey @jeet1995, 2 questions wrt Bloom Filters:

What was the actual error rate, in the experiment where 800 million insertions were made in Bloom Filter with expected error rate: 0.001. (code https://github.com/Azure/azure-sdk-for-java/pull/38003/files#diff-3ebb7622a9d10ed5f6e06103b7eda616c0c9ce4e1b72e88708ddf069cb5fcff8R120), how many occurrences of EPK the Bloom Filter deduced it has, but were a false positive? Does this error rate increase as the initial capacity of Bloom Filter is lowered(to conserve memory) keeping the expected error rate same, with same no of insertions: 800million? What is the impact of these false positives?

We have a unit test SessionConsistencyWithRegionScopingTests#testFppRate - the idea is to use a shadow Set and BloomFilter and insert numbers at random into both these structures and cross-reference - when expected insertions is 10 million, as long as the actual insertion count is below 10 million, the false positive rate hovers around 0.001 - there could be 9.8K false positives or 10.1K false positives are some of the values I have seen. If 10 million more insertions are made, then all bets are off - in the unit test the false positive count increased to 110K - so a false positive rate of ~0.05. The impact of false positives is that the bloom filter will conclude that an EPK has been seen in more regions than actually is the case, so the constructed session token for session consistency guarantees will include progress information from the extra regions. The more the progress scoped into a session token, the more additional retries for the SDK to hit a replica which has caught up to this session token. (The SDK behavior as of today is to use progress from all regions so this is an optimization as 1 / 1000 requests will hit the old behavior). I can double check the 800 million scenario - it is hard to fit the shadow Set into memory for such an insertion count but let me think how to test this.

Is it possible to keep Bloom Filter in Guava Cache for a TTL amount of time i.e. 1 day/week. eg Bloom Filter 1 in Guava: Key=20240513 Value: Actual Bloom Filter1, TTL: 1 day. Bloom Filter 2 Key=20240514 and Value=Actual Bloom Filter2, TTL=1 day. In this case the old Bloom Filter can get TTL'd out and new one gets created each week/day. Would this help in reducing false positives and/or help reduce the memory footprint by lowering the capacity of Bloom Filter, as we are refreshing Bloom Filter periodically?

Clearing out the bloom filter is not a use case for this feature - without knowing what regions an EPK was seen in it is impossible to guarantee read your create / read your read behavior for this EPK. If we clear out the bloom filter it is effectively eventual consistency when reading the document associated with a given EPK.

jeet1995 · 2024-05-15T13:23:24Z

Failures in live test pipeline w.r.t partition-split tests timing out which pass locally. Same behavior for this PR as well - #38740

jeet1995 · 2024-05-15T13:23:39Z

Merging PR as is.

jeet1995 added 20 commits November 1, 2023 09:26

Adding tests.

b8fd151

Adding tests.

12522c8

Adding PK-scoped session token map.

dc62739

Refactoring.

eb09a5e

Addind client-level options for session consistency.

3ab9d36

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java into Se…

7b88d04

…ssionConsistencyImprovement # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java

Adding client-level options for session consistency.

bc74be3

Adding client-level options for session consistency.

ed3ae6f

Added read my writes test.

2a164dd

Added PartitionKeyMetadata.

52c246d

Added SessionTokenMetadata.

a374d05

Fixing bugs.

c3f02fc

Added a registry for session tokens.

38a3b52

Added a registry for session tokens.

dd36f8f

Refactorings.

c4e7c87

Adding LRU-based eviction.

a712c76

Refactorings.

23cd35e

Shade Guava BloomFilter and its dependencies.

ca0a36e

Adding bloom filter based PK tracking.

0568f58

github-actions bot added the Cosmos label Dec 9, 2023

jeet1995 added 9 commits December 9, 2023 21:42

Adding bloom filter based PK tracking.

e16750f

Fixing custom type for bloom filter key.

baf00d2

Refactoring.

1d92f44

Refactoring.

70d3dc6

Refactoring.

b5ee08d

Refactoring.

4c66567

Fixing SessionContainerTest.java.

85b57cc

Reverting changes.

f5dfcdb

Refactoring and bug fixes.

733dd8f

jeet1995 added 2 commits May 4, 2024 18:42

Fixing tests.

9924830

Force RegionScopedSessionContainer for live tests.

8155d58