Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FaultInjection-Direct #33329

Merged
merged 32 commits into from
Mar 8, 2023
Merged

FaultInjection-Direct #33329

merged 32 commits into from
Mar 8, 2023

Conversation

xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Feb 6, 2023

Description

Added capability for fault injection in rntbd layer

Design

High level

Customer can config fault injection behaviors by creating FaultInjectionRule. Each rule contains major three components:

result: Type of FaultInjectionServerErrorResult or FaultInjectionConnectionErrorResult

  • FaultInjectionServerErrorResult:
FaultInjectionResultBuilders
    .getResultBuilder(FaultInjectionServerErrorType.SERVER_GONE)
    .times(1)
    .build()
FaultInjectionResultBuilders
    .getResultBuilder(FaultInjectionServerErrorType.SERVER_CONNECTION_DELAY)
    .delay(Duration.ofSeconds(2))
    .times(1)
    .build()

Supported server error type:

    GONE,
    RETRY_WITH,
    INTERNAL_SERVER_ERROR,
    TOO_MANY_REQUEST,
    READ_SESSION_NOT_AVAILABLE,
    TIMEOUT,
    PARTITION_IS_MIGRATING,
    RESPONSE_DELAY,  // can be used to simulate transient timeout/broken connections
    CONNECTION_DELAY, // can be used to simulate high channel acquisition
Name Notes
times Optional. Limit how many times the same rule can be applied on the operation.
delay Required for SERVER_RESPONSE_DELAY, SERVER_CONNECTION_DELAY
  • FaultInjectionConnectionErrorResult:
         FaultInjectionResultBuilders
                        .getResultBuilder(errorType)
                        .interval(Duration.ofSeconds(1))
                        .threshold(1.0)
                        .build()

Supported connection error type:

    CONNECTION_CLOSE,
    CONNECTION_RESET
Name Notes
interval Required.
threshold Optional. By default act on all established connections

condition: Type of FaultInjectionCondition. acts like rule filter. A rule will only be applied if all the necessary conditions matches. When rule is configured, SDK will pre-processing the rule, resolve the physical addresses if configured.

    new FaultInjectionConditionBuilder()
                        .operationType(FaultInjectionOperationType.CREATE_ITEM)
                        .region("West US")
                        .connectionType(FaultInjectionConnectionType.DIRECT)
                        .endpoints(
                                new FaultInjectionEndpointBuilder(FeedRange.forLogicalPartition(new PartitionKey("Test")))
                                    .replicaCount(2)
                                    .includePrimary(false)
                                    .build())
                        .build()
Name Default Notes
OperationType Null Optional. For FaultInjectionServerErrorResult, for SERVER_GONE, SERVER_CONNECTION_DELAY , ignore after resolve addresses. For FaultInjectionConnectionErrorResult, ignore after resolve addresses
ConnectionType FaultInjectionConnectionType.Direct This PR only contains direct connection type, gateway will be added in next PR
Region Null Optional. If not defined, the rule will apply in all available regions
Endpoints Null Optional. Type of FaultInjectionEndpoint. Use when you want to filter down to a subset of physical addresses

id/duration/hitLimit/startDelay: The unique identifier of the rule and the effective life span of the rule.

new FaultInjectionRuleBuilder(ruleId)
                .condition(condition)
                .result(result)
                .duration(Duration.ofSeconds(2))
                .startDelay(Duration.ofSeconds(2))
                .hitLimit(100)
                .build();
Name Notes
id Required. Length 64 characters.
duration Optional. Defines how long the rule will be effective
startDelay Optional. If not defined, then the rule will be effective right away
hitLimit Optional

Add fault injection rules

         CosmosFaultInjectionHelper.configFaultInjectionRules(container, Arrays.asList(, serverErrorRule, connectionErrorRule)).block();

After config the fault injection rule successfully, regionEndpoint and addresses detail can be obtained by:

 serverErrorInjectionRule.getRegionEndpoints()
 serverErrorInjectionRule.getAddresses()

Disable fault injection rule

 serverErrorInjectionRule.disable()

Get hit count of the fault injection rule

    serverErrorRule.getHitCount();

CosmosDiagnostics

  • New faultInjectionRuleId:
    image

  • New lastFaultInjectionRuleId, lastFaultInjectionTimestamp in serviceEndpointStatistics:
    image

Testing scenario examples

High channel acquisition/Connection timeout scenario

       FaultInjectionRule serverConnectionDelayRule =
            new FaultInjectionRuleBuilder("ServerError-ConnectionTimeout")
                .condition(
                    new FaultInjectionConditionBuilder()
                        .operationType(FaultInjectionOperationType.CREATE_ITEM)
                        .build()
                )
                .result(
                    FaultInjectionResultBuilders
                        .getResultBuilder(FaultInjectionServerErrorType.SERVER_CONNECTION_DELAY)
                        .delay(Duration.ofSeconds(6)) // default connection timeout is 5s
                        .times(1)
                        .build()
                )
                .duration(Duration.ofMinutes(5))
                .build();

Broken connections scenario

    FaultInjectionRule timeoutRule =
            new FaultInjectionRuleBuilder(timeoutRuleId)
                .condition(
                    new FaultInjectionConditionBuilder()
                        .operationType(FaultInjectionOperationType.READ_ITEM)
                        .build()
                )
                .result(
                    FaultInjectionResultBuilders
                        .getResultBuilder(FaultInjectionServerErrorType.SERVER_RESPONSE_DELAY)
                        .times(1)
                        .delay(Duration.ofSeconds(6)) // the default time out is 5s
                        .build()
                )
                .duration(Duration.ofMinutes(5))
                .build();

Server return gone exception scenario

 FaultInjectionRule serverErrorRule =
            new FaultInjectionRuleBuilder(ruleId)
                .condition(
                    new FaultInjectionConditionBuilder()
                        .operationType(FaultInjectionOperationType.READ)
                        .build()
                )
                .result(
                    FaultInjectionResultBuilders
                        .getResultBuilder(FaultInjectionServerErrorType.SERVER_GONE)
                        .times(1)
                        .build()
                )
                .duration(Duration.ofMinutes(5))
                .build();

Random connection closing/reset scenario

FaultInjectionRule connectionErrorRule =
            new FaultInjectionRuleBuilder(ruleId)
                .condition(
                    new FaultInjectionConditionBuilder()
                        .operationType(FaultInjectionOperationType.CREATE_ITEM)
                        .endpoints(new FaultInjectionEndpointBuilder(FeedRange.forLogicalPartition(new PartitionKey(createdItem.getMypk()))).build())
                        .build()
                )
                .result(
                    FaultInjectionResultBuilders
                        .getResultBuilder(errorType)
                        .interval(Duration.ofSeconds(1))
                        .threshold(1.0)
                        .build()
                )
                .duration(Duration.ofSeconds(2))
                .build();

Implementation

aure-cosmos-test module

All the public API and new models mentioned above will be added into it is own azure-cosmos-test module. if customer want to use fault inject, they will need to add the following in their pom file:

<dependency>
  <groupId>com.azure</groupId>
  <artifactId>azure-cosmos-test</artifactId>
  <version>1.0.0-beta.1</version>
</dependency>

How ServerError is injected

RntbdTransportClient-> RntbdServiceEndpoint.Provider -> [0, *) RntbdServiceEndpoint -> [0, maxChannelsPerEndpoint] Connections/Channels -> RntbdRequestManager channelHandler -> RntbdServerErrorInjector.

  • SERVER_CONNECTION_DELAY - Inject during new connection establishment stage. Instead of opening connections right away, add delay and then reduce connectionTimeout based on the delay
  • SERVER_RESPONSE_DELAY - Inject delay after getting server responses - TBD
  • Other Server errors - Inject before sending request to server, returning injected error right away

How ConnectionError is injected

RntbdTransportClient-> RntbdConnectionErrorInjector which will schedule a side task to create chaos(close/reset) based on the interval and threshold defined in the rule

for (Channel channel: channelsToBeClosedList) {
            switch (faultInjectionResult.getErrorTypes()) {
                case CONNECTION_CLOSE:
                    channel.pipeline().context(RntbdRequestManager.class)
                            .fireUserEventTriggered(new RntbdFaultInjectionConnectionCloseEvent());
                    break;
                case CONNECTION_RESET:
                    channel.pipeline().firstContext()
                        .fireUserEventTriggered(new RntbdFaultInjectionConnectionResetEvent());
                    break;
                default:
                    throw new IllegalStateException("ConnectionErrorType " + faultInjectionResult.getErrorTypes() + " is not supported");
            }
        }
            if (event instanceof RntbdFaultInjectionConnectionResetEvent) {
                this.exceptionCaught(context, new IOException("Fault Injection Connection Reset"));
                return;
            }

            if (event instanceof RntbdFaultInjectionConnectionCloseEvent) {
                context.close(); // TODO: how to add a meaningful fault injection message
            }

To Be Discussed

  • For the rules, when filter down to physical addresses, the physical addresses are a snapshot when the rule is created, need to find a way to monitor the addresses change
  • if multiple rules are defined and having overlap, then SDK will only pick the first applicable rule
  • ServerError rule and connection error rule can be defined at the same time

Parent feature #33425

@ghost ghost added the Cosmos label Feb 6, 2023
@xinlian12 xinlian12 marked this pull request as ready for review February 28, 2023 17:16
@xinlian12 xinlian12 changed the title faultInjection[Draft] - NO REVIEW FaultInjection-Direct Feb 28, 2023
@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - looks amazing now - thanks!

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

clientTelemetryWithStageJunoEndpoint: Known failed test case
before_ReadFeedDocumentsTest: Tested locally and succeeded

@xinlian12
Copy link
Member Author

/check-enforcer override

@xinlian12 xinlian12 merged commit b1f0f9a into Azure:main Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants