Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faultInjectionOnGateway #35378

Merged
merged 20 commits into from
Jul 12, 2023
Merged

Conversation

xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Jun 8, 2023

Description

Added capability for fault injection in gateway layer + staled_addresses in rntbd layer.
FaultInjection in rntbd layer - #33329

Design

High level (FaultInjection on gateway)

From public API perspective, there are no major changes. Customer can config fault injection behavior by creating FaultInjectionRule. Each Rule contains major three components (condition, result, id/lifecycle). Please see more details here #33329. However, there are few more constraints about the error scenarios being supported.

result: Type of FaultInjectionServerErrorResult

No FaultInjectionConnectionErrorResult will be allowed to be configured on gateway. Java SDK internally use netty http client which will manage the lifecycles of all connections, so we will just treat it as a black box.

FaultInjectionResultBuilders
    .getResultBuilder(FaultInjectionServerErrorType.TOO_MANY_REQUEST)
    .times(1)
    .build()

Supported server error type:

RESPONSE_DELAY, // can be used to simulate transient timeout/broken connections
CONNECTION_DELAY, // can be used to simulate network issue/connection timeout
TOO_MANY_REQUEST,
READ_SESSION_NOT_AVAILABLE,
TIMEOUT,
RETRY_WITH,
INTERNAL_SERVER_ERROR,
SERVICE_UNAVAILABLE

Note: GONE is not supported on gateway as gateway will retry, so SDK will get 503 instead

Supported operation type:

    READ_ITEM,
    QUERY_ITEM,
    CREATE_ITEM,
    UPSERT_ITEM,
    REPLACE_ITEM,
    DELETE_ITEM,
    PATCH_ITEM,
   * METADATA_REQUEST_CONTAINER,
   * METADATA_REQUEST_DATABASE_ACCOUNT,
   * METADATA_REQUEST_PARTITION_KEY_RANGES,
   * METADATA_REQUEST_REFRESH_ADDRESSES,
   * METADATA_REQUEST_QUERY_PLAN
OperationType GONE RESPONSE_DELAY CONNECTION_DELAY TOO_MANY_REQUEST READ_SESSION_NOT_AVAILABLE TIMEOUT RETRY_WITH INTERNAL_SERVER_ERROR SERVICE_UNAVAILABLE
*_ITEM ✖️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
METADATA_REQUEST_* ✖️ ✔️ ✔️ ✔️ ✖️ ✖️ ✖️ ✖️ ✖️
Name                        Notes
times Optional. Limit how many times the same rule can be applied on the operation. For metadata request rule, if the metadata request is initiated by/attached to another data plane operation, it essentially means how many times the data plane operation will observe the metadata related errors.
delay Required for RESPONSE_DELAY, CONNECTION_DELAY
suppressServiceRequests Only used for RESPONSE_DELAY, it controls when injected delay > network request timeout, whether the request will still be sent to sever. It will be useful for testing retriable writes testing

condition: Type of FaultInjectionCondition. A rule will only be applied if all the necessary conditions matches.

new FaultInjectionConditionBuilder()
     .connectionType(FaultInjectionConnectionType.GATEWAY)
     .region("East US 2")
     .operationType(FaultInjectionOperationType.METADATA_REQUEST_ADDRESS_REFRESH)
     .endpoints(
          new FaultInjectionEndpointBuilder(FeedRange.forLogicalPartition(new PartitionKey("Test"))).build()
     )
     .build();
Name                                                                                           Default Notes
OperationType Null Optional. Will be ignored for SERVER_CONNECTION_DELAY
ConnectionType FaultInjectionConnectionType.DIRECT For rule configured with metadata operationType, the connection type will be defaulted to FaultInjectionConnectionType.GATEWAY. For rule on gateway please use FaultInjectionConnectionType.GATEWAY
Region Null Optional. If not defined, the rule will be applied in all available regions
Endpoints Null Optional. Type of FaultInjectionEndpoint. Use when you want to filter down to certain partition. Ignored for metadata operation type except METADATA_REQUEST_REFRESH_ADDRESSES. Ignored for error type CONNECTION_DELAY

id/duration/hitLimit/startDelay: The unique identifier of the rule and the effective life span of the rule.

Diagnostics

faultInjectionRuleId and faultInjectionEvaluationResults will be included in the CosmosDiagnostics for easy debugging
image

Testing scenario example

Network issue/Connection timeout for address refresh requests
FaultInjectionRule faultInjectionRule = new FaultInjectionRuleBuilder("addressRefreshDelay")
            .condition(
                new FaultInjectionConditionBuilder()
                    .region("EAST US 2")
                    .connectionType(FaultInjectionConnectionType.GATEWAY)
                    .operationType(FaultInjectionOperationType.METADATA_REQUEST_REFRESH_ADDRESSES)
                    .build())
            .result(
                FaultInjectionResultBuilders.getResultBuilder(FaultInjectionServerErrorType.CONNECTION_DELAY)
                    .delay(Duration.ofSeconds(60))
                    .build()
            )
            .build();
Broken connections
FaultInjectionRule faultInjectionRule = new FaultInjectionRuleBuilder("addressRefreshDelay")
            .condition(
                new FaultInjectionConditionBuilder()
                    .region("EAST US 2")
                    .connectionType(FaultInjectionConnectionType.GATEWAY)
                    .operationType(FaultInjectionOperationType.METADATA_REQUEST_REFRESH_ADDRESSES)
                    .build())
            .result(
                FaultInjectionResultBuilders.getResultBuilder(FaultInjectionServerErrorType.RESPONSE_DELAY)
                    .delay(Duration.ofSeconds(5))
                    .build()
            )
            .build();
Server return Too_Many_Request exception for all partitions
FaultInjectionRule faultInjectionRule = new FaultInjectionRuleBuilder("addressRefreshDelay")
            .condition(
                new FaultInjectionConditionBuilder()
                    .region("EAST US 2")
                    .connectionType(FaultInjectionConnectionType.GATEWAY)
                    .operationType(FaultInjectionOperationType.METADATA_REQUEST_REFRESH_ADDRESSES)
                    .build())
            .result(
                FaultInjectionResultBuilders.getResultBuilder(FaultInjectionServerErrorType.TOO_MANY_REQUEST)
                    .times(3)
                    .build()
            )
            .build();
Server return Too_Many_Request exception for one partition
FaultInjectionRule faultInjectionRule = new FaultInjectionRuleBuilder("addressRefreshDelay")
            .condition(
                new FaultInjectionConditionBuilder()
                    .region("EAST US 2")
                    .connectionType(FaultInjectionConnectionType.GATEWAY)
                    .operationType(FaultInjectionOperationType.METADATA_REQUEST_REFRESH_ADDRESSES)
                    .endpoints(
                        new FaultInjectionEndpointBuilder(FeedRange.forLogicalPartition(new PartitionKey("Test"))).build()
                    )
                    .build())
            .result(
                FaultInjectionResultBuilders.getResultBuilder(FaultInjectionServerErrorType.TOO_MANY_REQUEST)
                    .times(3)
                    .build()
            )
            .build();

High level (STALED_ADDRESSES_SERVER_GONE)

This error type is used to simulate request failed with GONE error due to SDK has used staled addresses (which could be due to many reasons, partition split/merge/migrated etc). Unlike other exception where you can control how many times the error can be injected, STALED_ADDRESSES_SERVER_GONE will only be cleared if a forceRefresh address refresh happened.

During the effectiveness lifespan of the rule, for each operation will go through the following cycle

  graph LR
  A[Request] --> B{Match fault injection rule}
  B -- Yes --> C[Inject GONE exception]
  B -- No -->  D[Original flow]
  C --> E[Retry]
  E --> F{Has forceRefresh address happened}
  F -- Yes --> D
  F --No --> C
Loading

Other changes included

  • currently gatewayDiagnostics only track the diagnostics for the last retry/try. In this PR, it changed to use gatewayDiagnosticsList to track all details.

image

@github-actions github-actions bot added the Cosmos label Jun 8, 2023
@azure-sdk
Copy link
Collaborator

azure-sdk commented Jun 8, 2023

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12 xinlian12 requested a review from Pilchie as a code owner July 7, 2023 17:23
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

No commit pushedDate could be found for PR 35378 in repo Azure/azure-sdk-for-java

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xinlian12

sdk/cosmos/azure-cosmos-test/CHANGELOG.md Outdated Show resolved Hide resolved
sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated Show resolved Hide resolved
}

this.gatewayProxy.configureFaultInjectorProvider(injectorProvider, this.configs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always be configured now, even if the fault injector is in direct mode, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because no matter whether gateway or direct, there are always requests going to gateway

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

No commit pushedDate could be found for PR 35378 in repo Azure/azure-sdk-for-java

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12 xinlian12 merged commit 40e0e3f into Azure:main Jul 12, 2023
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants