Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session consistency improvement with bloom filter approach #38003

Merged

Conversation

jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Dec 9, 2023

Background

For multi-write accounts, it is possible for clients to route write requests to any region (hub region or satellite region(s)) based on the preferred regions list. At a high-level, this can cause a backlog of changes for the hub region to “consolidate” and conflict resolve. The hub region also forwards conflict resolved changes to satellite region, hence any cross-regional replication latency coupled with regional failover can make it seem as though either the hub region or one or more satellite regions are “lagging”.

Session consistency which is read your write and read your read guarantee / monotonic read at the client-level is guaranteed by using a construct called the session token. The session token is representative of the progress made by a particular replica from a physical partition. The SDK reaches out to the replica with a requested session token and the replica validates whether it has made progress until the requested session token. If the replica has made progress until the requested session token, then it responds with the required data otherwise it responds with 404:1002 (404 Read/Write Session Not Available). The SDK uses this as a signal to retry on a different replica. Depending on the configuration set in SessionRetryOptions, the SDK decides whether to cycle through the replicas in the current region or to switch to a replica in a different region for the same physical partition.

404:1002s can occur in steady state due to in-region replication lag from time to time which the SDK can recover from by cycling through the replicas within the same region once. The major focus in this document is when there is cross-regional replication lag. This would imply the SDK has to retry replicas in other regions until the most updated replica (which satisfies the requested session token) can be found. The SDK may even have to cycle through remote region replicas multiple times. Any retry to a different region comes with cross-regional latency and this along with the no. of retries can increase the CPU utilization on client-side application pods.

PR details

Overview

This PR moves away from maintaining a single "global" representation of the session token to maintaining a region-scoped progress for each physical partition of a container. This way, when a session token has to be resolved for a request, if the request happens to be targeted to a logical partition, then the session information can be representative of only the regions the logical partition saw requests being routed for. This way the replication target for a replica for whichever region receives the request is reduced and therefore increasing the chances of session guarantees being met without needing retries both locally and cross-regionally.

Major classes introduced

PartitionScopedRegionLevelProgress

This class maintains a nested ConcurrentHashMap of type ConcurrentHashMap<String, ConcurrentHashMap<String, RegionLevelProgress>> which maintains mappings between partitionKeyRangeId and region-level progress. Region-level progress is another ConcurrentHashMap which maps the progress scope / region with the respective localLsn / globalLsn / session token. This class has the main logic for resolving the session token for a given request.

PartitionKeyBasedBloomFilter

This class encapsulates a bloom filter which stores hashed tuples where the tuples represent the effective partition key string (hashed representation of the logical partition), collection resource id and the region. This class is responsible for enumerating which regions a particular logical partition saw requests in.

RegionScopedSessionContainer

RegionScopedSessionContainer is an implementation of the interface ISessionContainer. This class maintains an instance of PartitionKeyBasedBloomFilter and mappings between the collectionRid and PartitionScopedRegionLevelProgress (collection-level mappings). RegionScopedSessionContainer performs checks to validate whether a request is targeted to a logical partition (thus determining whether the bloom filter is needed or not). It also fronts invocations from upstream classes wishing to set a session token or resolve the session token for a request.

Setting the session token in the RegionScopedSessionContainer

Sequence diagram

sequenceDiagram
    autonumber
    participant RXDL as RxDocumentClientImpl
    participant RXGSM as RxGatewayStoreModel
    participant SCL as StoreClient
    participant RSSC as RegionScopedSessionContainer
    participant PKBF as PartitionKeyBasedBloomFilter
    participant PSRP as PartitionScopedRegionLevelProgress
    
    
    RXDL-->>RSSC: setSessionToken()
    RXGSM-->>RSSC: setSessionToken()
    SCL-->>RSSC: setSessionToken()
    RSSC->>PSRP: tryRecordSessionToken()
    RSSC-->>PKBF: tryRecordPartitionKey()
Loading

Flow of setting a session token / progress within PartitionScopedRegionLevelProgress.

flowchart TD
    A[tryRecordSessionToken - invoked after the collectionRid and physical partiton is determined from the response] -->B[Store the mapping - global: merged session token]
    B --> B1{Does the session token have regionId to localLsn mappings?}
    B1 --> |Yes|C1{Has the targeted partition already seen operations without logical partition scope?}
    C1 --> |No|C[Capture the region the request got a response from]
    C --> D[Obtain the regionId for the region]
    D --> E{Is regionId to region mapping known to the SDK}
    E --> |Yes|F{Does regionId map to a satellite region or specifically if regionId is not present is session token?}
    F --> |Yes|G[Store the mapping - region: max localLsn seen so far, max globalLsn seen so far, prior retained session token if any]
    E --> |Yes|H{Does regionId map to a hub region or specifically if regionId is not present is session token?}
    H --> |Yes|I[Store the mapping - region: max globalLsn seen so far, prior retained session token if any]
    G --> J{Is region is the first preferred readable region?}
    J --> |Yes|K[Store the mapping - region: merged result of session token from the response and retained session token]
    I --> J
Loading

NOTES:

  • The motivation behind storing the "global" representation of the session token is to resolve the session token as is for requests which don't have logical partition scoping.
  • The motivation behind storing the session token as is from the first preferred read region is so it is available as is to be merged with any region-scoped session token. A region-scoped session token will mainly have localLsn of a subset of regionIds where a logical partition was resolved to.
  • Whenever a physical partition sees a request with no logical partition scope such as cross-partition query, change feed with non-logical partition scope or bulk operations any following point operations will resort to using the global session token from that partition. This is done to ensure read your read guarantees.

Resolving the session token for a request in the RegionScopedSessionContainer

Sequence diagram

sequenceDiagram
    participant RXGSM as RxGatewayStoreModel
    participant STH as SessionTokenHelper
    participant RSSC as RegionScopedSessionContainer
    participant PKBF as PartitionKeyBasedBloomFilter
    participant PSRP as PartitionScopedRegionLevelProgress
    participant SR as StoreReader
    participant CW as ConsistencyWriter
   
    RXGSM-->>STH: 1. setPartitionLocalSessionToken()
    SR-->>STH: 1. setPartitionLocalSessionToken()
    CW-->>STH: 1. setPartitionLocalSessionToken()
    STH-->>RSSC: 2. resolvePartitionLocalSessionToken()
    RSSC-->>STH: 3. resolvePartitionLocalSessionToken()
    STH-->>PKBF: 4. tryGetPossibleRegionsLogicalPartitionResolvedTo()
    STH-->>PSRP: 5. tryResolveSessionToken()
Loading

Flow of resolving a session token within PartitionScopedRegionLevelProgress

flowchart TD
    A[tryResolveSessionToken - kicks in when the session token for the request is to be resolved] --> B{Can region scoped session token be used?}
    B --> |No| C[Return global session token]
    B --> |Yes|D[Obtain session token for first preferred read region aka base session token]
    D --> E[Iterate through regionIds in the global session token for construction of region scoped session token string]
    subgraph Loop to construct a region scoped session token string
    E --> EStart[Start iteration]
    EStart --> F{If regionId has corresponding region known to the SDK}
    F --> |No|G[Return global session token]
    F --> |Yes|H{If region has been resolved for logical partition}
    H --> |Yes|I[Append regionId=localLsn to a session token string and store max globalLsn seen so far]
    H --> |No|J[Append regionId=-1 to a session token string]
    I --> EEnd[End iteration]
    J --> EEnd
    end
    EEnd --> K[Prepend version obtained from global session token to session token string]
    K --> L[Preprend max globalLsn seen so far to session token string]
    L --> M{If merge of base session token and session token string succeeds}
    M --> |Yes|N[Return merge of base session token and session token string]
    M --> |No|O[Return global session token]
Loading

Configuration options

Notes:

  • The below settings can be applied as an environment variable or a JVM system property.

Opting in into region-scoped session capturing

System.setProperty("COSMOS.SESSION_CAPTURING_TYPE", "REGION_SCOPED");

Configuring the expected insertions and expected false-positivity rate of the bloom filter

  • Region-scoped session capturing needs to be first opted into with the previous system config setting.
System.setProperty("COSMOS.PK_BASED_BLOOM_FILTER_EXPECTED_INSERTION_COUNT", "5000000");
System.setProperty("COSMOS.PK_BASED_BLOOM_FILTER_EXPECTED_FFP_RATE", "0.001");

Benchmarking results

The benchmarking done focuses on two areas - performance-regression benchmarking and 404/1002 retry reduction benchmarking.

404/1002 retry reduction benchmarking

Benchmark setup

The fundamental idea is to simulate cross-region replication lag. The account for benchmark purposes is a multi-write account with 3 write regions namely - West US 2, South Central US and East US. Two client instances were created - let's call them a slow client and fast client. The slow client has preferred regions as West US 2, South Central US and East US and uses 2 threads - 1 thread each for point reads and point creates. The fast client uses N = 30 threads to direct routed to East US. This is to simulate cross-region replication lag between East US and West US 2 (the two regions are geographically the furthest among the three regions).

From the slow client's perspective, every 5 minutes for K = 2 minutes (can be configured), a fraction of creates and reads do cross-region retries - either to South Central US or East US. This has the effect of forcing following creates or reads to capture the session progress from the quicker progressing East US thereby increasing chances of requests hitting 404/1002s in either West US 2 or South Central US.

Run type - the slow client is configured with threshold-based availability strategy and end-to-end timeout of 3s

Configuration type Configuration value
Connectivity mode Direct
Is threshold-based vailability strategy enabled? TRUE
End-to-end operation timeout 3s
Threads allocated to run creates in East US region (remote region) through secondary client 30
Cross-region retry rate every 5 minutes for creates and reads through primary client 10%
404:1002 retry hint REMOTE_REGION_PREFERRED
In-region retry time for 404:1002s 10 thousand ms
Run duration 4 hours
Container's manual provisioned throughput 100 thousand
Bloom filter expected insertions 800 million
Bloom filter expected false positive rate 0.001
image

Run type - the slow client is configured with no end-to-end operation timeout

Configuration type Configuration value
Connectivity mode Direct
Threads allocated to run creates in East US region (remote region) through secondary client 30
Cross-region retry rate every 5 minutes for creates and reads through primary client 10%
404:1002 retry hint REMOTE_REGION_PREFERRED
In-region retry time for 404:1002s 10 thousand ms
Run duration 4 hours
Container's manual provisioned throughput 100 thousand
Bloom filter expected insertions 800 million
Bloom filter expected false positive rate 0.001
image

Interpreting the results

  • When end-to-end operation timeout and availability strategy is configured:
    • When region-scoped session capturing is enabled, 9.7% more PKs had simulated cross-region retries with a 0.6% increase in of 404/1002 cross-region retries in case of reads. In case of 404/1002 retries for creates, with region-scoped session capturing enabled, 404/1002 retries for creates drops roughly by 68%.
  • When end-to-end operation timeout and availability strategy is not configured:
    • When region-scoped session capturing is enabled, ~27% more PKs had simulated cross-region retries with a 1.1% decrease in of 404/1002 cross-region retries in case of reads. In case of 404/1002 retries for creates, with region-scoped session capturing enabled, 404/1002 retries for creates drops roughly by 59%.

Performance regression benchmarks (throughput or latency or both)

image image image image

Diagnostic changes

  • Properties added
    • bloomFilterInsertionCountSnapshot - a snapshot of the insertion count into the bloom filter.
    • regionScopedSessionCfg - a bunch of settings around the expected insertion count and false positive rate of the bloom filter.
    • sessionTokenEvaluationResults - a list of evaluation results describing how a session token got resolved for a request and how it got recorded given the response.
{
    "userAgent": "azsdk-java-cosmos/4.60.0-beta.1 Windows11/10.0 JRE/18.0.2.1",
    "activityId": "a1a07046-aafe-4568-8e31-0f389fa0e57f",
    "requestLatencyInMs": 1551,
    "requestStartTimeUTC": "2024-05-08T18:08:23.912067800Z",
    "requestEndTimeUTC": "2024-05-08T18:08:25.463844200Z",
    "responseStatisticsList": [
        {
            "storeResult": {
                "storePhysicalAddress": "rntbd://cdb-ms-prod-southcentralus1-be11.documents.azure.com:14352/apps/331f5c74-6380-425f-bc61-ccc87c1ecc3c/services/fa4f1f2f-3f2f-4818-9622-f6be8c5d4552/partitions/a669c23c-583f-4709-8202-6fd50742dc3d/replicas/133596162039633782s/",
                "lsn": 5,
                "globalCommittedLsn": 1,
                "partitionKeyRangeId": "0",
                "isValid": true,
                "statusCode": 200,
                "subStatusCode": 0,
                "isGone": false,
                "isNotFound": false,
                "isInvalidPartition": false,
                "isThroughputControlRequestRateTooLarge": false,
                "requestCharge": 1.0,
                "itemLSN": 5,
                "sessionToken": "0:0#2#1=-1#5=-1",
                "backendLatencyInMs": 0.262,
                "retryAfterInMs": null,
                "replicaStatusList": [
                    "14352:Unknown",
                    "14086:Unknown",
                    "14371:Unknown",
                    "14077:Unknown"
                ],
                "transportRequestTimeline": [
                    {
                        "eventName": "created",
                        "startTimeUTC": "2024-05-08T18:08:24.422093800Z",
                        "durationInMilliSecs": 2.5119
                    },
                    {
                        "eventName": "queued",
                        "startTimeUTC": "2024-05-08T18:08:24.424605700Z",
                        "durationInMilliSecs": 0.0
                    },
                    {
                        "eventName": "channelAcquisitionStarted",
                        "startTimeUTC": "2024-05-08T18:08:24.424605700Z",
                        "durationInMilliSecs": 1007.4226
                    },
                    {
                        "eventName": "pipelined",
                        "startTimeUTC": "2024-05-08T18:08:25.432028300Z",
                        "durationInMilliSecs": 2.9939
                    },
                    {
                        "eventName": "transitTime",
                        "startTimeUTC": "2024-05-08T18:08:25.435022200Z",
                        "durationInMilliSecs": 26.3154
                    },
                    {
                        "eventName": "decodeTime",
                        "startTimeUTC": "2024-05-08T18:08:25.461337600Z",
                        "durationInMilliSecs": 0.0
                    },
                    {
                        "eventName": "received",
                        "startTimeUTC": "2024-05-08T18:08:25.461337600Z",
                        "durationInMilliSecs": 1.5029
                    },
                    {
                        "eventName": "completed",
                        "startTimeUTC": "2024-05-08T18:08:25.462840500Z",
                        "durationInMilliSecs": 0.0
                    }
                ],
                "transportRequestChannelAcquisitionContext": {
                    "events": [
                        {
                            "poll": "2024-05-08T18:08:24.427755300Z",
                            "durationInMilliSecs": 0.0
                        },
                        {
                            "startNew": "2024-05-08T18:08:24.427755300Z",
                            "durationInMilliSecs": 872.2768
                        },
                        {
                            "completeNew": "2024-05-08T18:08:25.300032100Z"
                        }
                    ],
                    "waitForChannelInit": true
                },
                "rntbdRequestLengthInBytes": 583,
                "rntbdResponseLengthInBytes": 894,
                "requestPayloadLengthInBytes": 0,
                "responsePayloadLengthInBytes": 369,
                "channelStatistics": {
                    "channelId": "f270d909",
                    "channelTaskQueueSize": 0,
                    "pendingRequestsCount": 0,
                    "lastReadTime": "2024-05-08T18:08:25.431024700Z",
                    "waitForConnectionInit": true
                },
                "serviceEndpointStatistics": {
                    "availableChannels": 0,
                    "acquiredChannels": 0,
                    "executorTaskQueueSize": 0,
                    "inflightRequests": 1,
                    "lastSuccessfulRequestTime": "2024-05-08T18:08:24.423Z",
                    "lastRequestTime": "2024-05-08T18:08:24.423Z",
                    "createdTime": "2024-05-08T18:08:24.423598700Z",
                    "isClosed": false,
                    "cerMetrics": {}
                }
            },
            "requestResponseTimeUTC": "2024-05-08T18:08:25.463844200Z",
            "requestStartTimeUTC": "2024-05-08T18:08:24.422093800Z",
            "requestResourceType": "Document",
            "requestOperationType": "Read",
            "requestSessionToken": "0:0#2#1=-1#5=-1",
            "e2ePolicyCfg": null,
            "excludedRegions": "West US 2",
            "sessionTokenEvaluationResults": [
                "Recording region specific progress of region : southcentralus.",
                "Resolving to the session token corresponding to the first preferred readable region since the requested logical partition has not been resolved to other regions."
            ]
        }
    ],
    "supplementalResponseStatisticsList": [],
    "addressResolutionStatistics": {
        "2c4a785c-f1b8-4897-ab04-680d065ddb01": {
            "startTimeUTC": "2024-05-08T18:08:23.975076300Z",
            "endTimeUTC": "2024-05-08T18:08:24.301765600Z",
            "targetEndpoint": "https://xxxxxxx.documents.azure.com:443/addresses/?$resolveFor=dbs%2Fbq9PAA%3D%3D%2Fcolls%2Fbq9PALlDn7M%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=0",
            "exceptionMessage": null,
            "forceRefresh": false,
            "forceCollectionRoutingMapRefresh": false,
            "inflightRequest": false
        }
    },
    "regionsContacted": [
        "south central us"
    ],
    "retryContext": {
        "statusAndSubStatusCodes": null,
        "retryLatency": 0,
        "retryCount": 0
    },
    "metadataDiagnosticsContext": {
        "metadataDiagnosticList": [
            {
                "metaDataName": "SERVER_ADDRESS_LOOKUP",
                "startTimeUTC": "2024-05-08T18:08:23.975076300Z",
                "endTimeUTC": "2024-05-08T18:08:24.301765600Z",
                "durationinMS": 326
            }
        ]
    },
    "serializationDiagnosticsContext": {
        "serializationDiagnosticsList": null
    },
    "gatewayStatisticsList": [],
    "samplingRateSnapshot": 1.0,
    "bloomFilterInsertionCountSnapshot": 0,
    "systemInformation": {
        "usedMemory": "43442 KB",
        "availableMemory": "4150862 KB",
        "systemCpuLoad": "(2024-05-08T18:08:00.003443600Z 45.5%), (2024-05-08T18:08:05.019003200Z 34.8%), (2024-05-08T18:08:10.010209300Z 12.0%), (2024-05-08T18:08:15.018912300Z 10.0%), (2024-05-08T18:08:20.012022400Z 13.5%), (2024-05-08T18:08:25.012157100Z 9.6%)",
        "availableProcessors": 8
    },
    "clientCfgs": {
        "id": 3,
        "machineId": "uuid:406f4553-6e18-4080-82d0-15aaa9cfef24",
        "connectionMode": "DIRECT",
        "numberOfClients": 1,
        "excrgns": "[]",
        "clientEndpoints": {
            "https://xxxxx.documents.azure.com:443/": 3
        },
        "connCfg": {
            "rntbd": "(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:true)",
            "gw": "(cps:1000, nrto:PT1M, icto:PT1M, p:false)",
            "other": "(ed: true, cs: false, rv: true)"
        },
        "consistencyCfg": "(consistency: Session, mm: true, prgns: [westus2,southcentralus,eastus])",
        "proactiveInitCfg": "",
        "e2ePolicyCfg": "",
        "sessionRetryCfg": "",
        "regionScopedSessionCfg": "(rssc: true, expins: 5000000, ffprate: 0.001)"
    }
}

When does region-scoping of session tokens help?

  • The targeted account is a multi-write region account.
  • This feature will help reduce the replication target for requests targeting those logical partitions which haven't seen any transient cross-region retries.
  • The proportion of logical partitions which see transient cross-region retries should be small compared to total no. of logical partitions seen by the application or the application itself should have a high cardinality of logical partitions.
  • The application has a regular cadence of restarts - this helps clear out the bloom filter. The more records added to the bloom filter, the higher the chances of false positives being returned.
  • The application's workload primarily consists of point operations through a given client instance. Any operation with non-logical partition scope would mean following point operations have to use the global session token of that partition. An example - say a cross-partition query reads version 10 of a document, then the following read to that document also has to read at least version 10 of the same document which can be guaranteed when using the global session token for the partition in which the document exists.

When does the SDK-internal session container help?

  • An SDK-internal session container helps primarily in single-region write accounts where only 1 region sees all the writes, hence ensuring reading the latest committed write for a logical partition with a retry on the write region (worst-case).
  • It could help when all clients have the same preferred region order with multi-region write accounts. Any service-side issue would cause failovers in a similar manner on all clients, so write traffic increase would be isolated to a particular region. If not, Strong consistency would be a better choice if read your write guarantees are required albeit with latency increase for write operations to be globally committed. If read your write guarantees are indeed not required, then moving to Eventual consistency could help.
  • If Session guarantees are still required with multi-write accounts with clients biased to different regions, then capturing session tokens on a per-logical partition at an application level can help.

Open questions

  • The hub region could change, or the account-level region configuration could change.
  • When availability strategy is configured for the client, this could potentially lead to requests also being routed to other preferred regions leading to the bloom filter being filled up.

Memory implications

  • Below is the retained size (size of the object and whatever it depends on) of RegionScopedSessionContainer with varying expected insertions.
Expected Insertions False Positive Rate Retained Size
10, 000 0.001 21 KB
100, 000 0.001 183 KB
1 million 0.001 1.8 MB
10 million 0.001 17.9 MB
100 million 0.001 179 MB
1 billion 0.001 1.8 GB

Follow up items

  • Extend support for bulk operations.

@github-actions github-actions bot added the Cosmos label Dec 9, 2023
@jeet1995
Copy link
Member Author

jeet1995 commented May 4, 2024

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for few questions/comments - thank you, great work!

@nitinitt
Copy link

Hey @jeet1995, 2 questions wrt Bloom Filters:

  1. What was the actual error rate, in the experiment where 800 million insertions were made in Bloom Filter with expected error rate: 0.001. (code https://github.com/Azure/azure-sdk-for-java/pull/38003/files#diff-3ebb7622a9d10ed5f6e06103b7eda616c0c9ce4e1b72e88708ddf069cb5fcff8R120), how many occurrences of EPK the Bloom Filter deduced it has, but were a false positive? Does this error rate increase as the initial capacity of Bloom Filter is lowered(to conserve memory) keeping the expected error rate same, with same no of insertions: 800million? What is the impact of these false positives?
  2. Is it possible to keep Bloom Filter in Guava Cache for a TTL amount of time i.e. 1 day/week. eg Bloom Filter 1 in Guava: Key=20240513 Value: Actual Bloom Filter1, TTL: 1 day. Bloom Filter 2 Key=20240514 and Value=Actual Bloom Filter2, TTL=1 day. In this case the old Bloom Filter can get TTL'd out and new one gets created each week/day. Would this help in reducing false positives and/or help reduce the memory footprint by lowering the capacity of Bloom Filter, as we are refreshing Bloom Filter periodically?

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

jeet1995 commented May 15, 2024

Hey @jeet1995, 2 questions wrt Bloom Filters:

  1. What was the actual error rate, in the experiment where 800 million insertions were made in Bloom Filter with expected error rate: 0.001. (code https://github.com/Azure/azure-sdk-for-java/pull/38003/files#diff-3ebb7622a9d10ed5f6e06103b7eda616c0c9ce4e1b72e88708ddf069cb5fcff8R120), how many occurrences of EPK the Bloom Filter deduced it has, but were a false positive? Does this error rate increase as the initial capacity of Bloom Filter is lowered(to conserve memory) keeping the expected error rate same, with same no of insertions: 800million? What is the impact of these false positives?

We have a unit test SessionConsistencyWithRegionScopingTests#testFppRate - the idea is to use a shadow Set and BloomFilter and insert numbers at random into both these structures and cross-reference - when expected insertions is 10 million, as long as the actual insertion count is below 10 million, the false positive rate hovers around 0.001 - there could be 9.8K false positives or 10.1K false positives are some of the values I have seen. If 10 million more insertions are made, then all bets are off - in the unit test the false positive count increased to 110K - so a false positive rate of ~0.05. The impact of false positives is that the bloom filter will conclude that an EPK has been seen in more regions than actually is the case, so the constructed session token for session consistency guarantees will include progress information from the extra regions. The more the progress scoped into a session token, the more additional retries for the SDK to hit a replica which has caught up to this session token. (The SDK behavior as of today is to use progress from all regions so this is an optimization as 1 / 1000 requests will hit the old behavior). I can double check the 800 million scenario - it is hard to fit the shadow Set into memory for such an insertion count but let me think how to test this.

  1. Is it possible to keep Bloom Filter in Guava Cache for a TTL amount of time i.e. 1 day/week. eg Bloom Filter 1 in Guava: Key=20240513 Value: Actual Bloom Filter1, TTL: 1 day. Bloom Filter 2 Key=20240514 and Value=Actual Bloom Filter2, TTL=1 day. In this case the old Bloom Filter can get TTL'd out and new one gets created each week/day. Would this help in reducing false positives and/or help reduce the memory footprint by lowering the capacity of Bloom Filter, as we are refreshing Bloom Filter periodically?

Clearing out the bloom filter is not a use case for this feature - without knowing what regions an EPK was seen in it is impossible to guarantee read your create / read your read behavior for this EPK. If we clear out the bloom filter it is effectively eventual consistency when reading the document associated with a given EPK.

@jeet1995
Copy link
Member Author

Failures in live test pipeline w.r.t partition-split tests timing out which pass locally. Same behavior for this PR as well - #38740

@jeet1995
Copy link
Member Author

Merging PR as is.

@jeet1995 jeet1995 merged commit 042dfda into Azure:main May 15, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants