Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos: Diagnostic improvements #33209

Merged

Conversation

FabianMeiswinkel
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel commented Jan 30, 2023

Description

We are planning to make a few improvements to the diagnostics capabilities in the Cosmos DB SDK in this PR

  • Adding an interface CosmosDiagnosticsHandler that customers can use to build their own diagnostics handlers
  • Adding a class CosmosDiagnosticsContext to represent metdata of an SDK operation that is relevant for diagnostics handlers - it could span multiple CosmosDiagnostics instances - for example in the case of cross-regional retries or speculative processing
  • Adding an implementation of the CosmosDiagnosticsHandler interface that will emit logs to log4j when operations hit certain conditions (certain status codes, latency exceeding certain threshold) - the configuration of the conditions can be done via CosmosDiagnosticsThresholds. This logger CosmosDiagnosticsLogger is exposed via CosmosDiagnosticsHandler.DEFAULT_LOGGING_HANDLER
  • Change the OpenTelemetry tracing to emit eventshonoring the semantic profile documentd here - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md#microsoft-azure-cosmos-db - also adding an option to keep today's traces to allow customers to keep the current traces if needed (via system property).
  • Some refactoring to unify different diagnostic code paths - mainly to ensure that other diagnostics like Client Metrics (emitted via micrometer Meter registry) and Client Telemetry (statistics of end-to-end behavior sent to the Cosmos DB service to allow more proactive alerting/support) are also using implementations of the CosmosDiagnosticsHandler interface.

Code snippets showing the usage of the new API

Enabling default logging to log4j

CosmosClientBuilder builder = this
    .getClientBuilder()
    .clientTelemetryConfig(
        new CosmosClientTelemetryConfig()
            .diagnosticsHandler(CosmosDiagnosticsHandler.DEFAULT_LOGGING_HANDLER)
    );

Enabling logging to log4j customizing conditions when to log

CosmosClientBuilder builder = this
    .getClientBuilder()
    .clientTelemetryConfig(
        new CosmosClientTelemetryConfig()
            .diagnosticsThresholds(
                        new CosmosDiagnosticsThresholds()
                            .setPointOperationLatencyThreshold(Duration.ofMillis(100))
                            .setNonPointOperationLatencyThreshold(Duration.ofMillis(2000))
                            .setRequestChargeThreshold(100)
                    )
           .diagnosticsHandler(CosmosDiagnosticsHandler.DEFAULT_LOGGING_HANDLER)
    );

Enabling a custom diagnostics handler

CapturingDiagnosticsHandler capturingHandler = new CapturingDiagnosticsHandler();

CosmosClientBuilder builder = this
    .getClientBuilder()
    .clientTelemetryConfig(
        new CosmosClientTelemetryConfig()
            .diagnosticsHandler(capturingHandler)
    );

private class CapturingDiagnosticsHandler implements CosmosDiagnosticsHandler {

    private final ArrayList<CosmosDiagnosticsContext> diagnosticsContexts = new ArrayList<>();

    @Override
    public void handleDiagnostics(Context traceContext, CosmosDiagnosticsContext diagnosticsContext) {
        diagnosticsContexts.add(diagnosticsContext);
    }

    public List<CosmosDiagnosticsContext> getDiagnosticsContexts() {
        return this.diagnosticsContexts;
    }
}

Enabling a custom diagnostics handler (in this case custom logger)

CosmosClientBuilder builder = this
    .getClientBuilder()
    .clientTelemetryConfig(
        new CosmosClientTelemetryConfig()
            .diagnosticsHandler(new ConsoleOutLogger())
            .diagnosticsThresholds(
                        new CosmosDiagnosticsThresholds()
                            .setPointOperationLatencyThreshold(Duration.ofMillis(100))
                            .setNonPointOperationLatencyThreshold(Duration.ofMillis(2000))
                            .setRequestChargeThreshold(100)
                    )
    );

private static class ConsoleOutLogger implements CosmosDiagnosticsHandler {
        private final List<String> loggedMessages = new ArrayList<>();
        public ConsoleOutLogger() {
            super();
        }

        @Override
        public void handleDiagnostics(CosmosDiagnosticsContext ctx, Context traceContext) {
            logger.info("--> log - ctx: {}", ctx);
            String msg = String.format(
                "Account: %s -> DB: %s, Col:%s, StatusCode: %d:%d Diagnostics: %s",
                ctx.getAccountName(),
                ctx.getDatabaseName(),
                ctx.getContainerName(),
                ctx.getStatusCode(),
                ctx.getSubStatusCode(),
                ctx);

            this.loggedMessages.add(msg);

            System.out.println(msg);
        }

        public List<String> getLoggedMessages() {
            return this.loggedMessages;
        }
    }

Capturing CosmosDiagnosticsContext from CosmosDiagnostics (for example to get the CosmosDiagnosticsContext from one of the FeedResponse instances all mapping to the same query operation/CosmosDiagnosticsContext instance

FeedResponse<ObjectNode> feedItemResponse = cosmosAsyncContainer
            .queryItems(query, queryRequestOptionsWithCustomOpsId, ObjectNode.class)
            .byPage()
            .blockFirst();
CosmosDiagnosticsContext ctx = feedItemResponse.getCosmosDiagnostics().getDiagnosticsContext();

Other design considerations

  • The current plan is by default to break the OpenTelemtry tracing (using the new one, which will be easier to use and will be publicly documented) - but of course there are some concerns about the fact that it would be a breaking change. Please share feedback, if you think more conversations are needed to decide whether to move ahead with this or make the new OpenTelemetry traces an opt-in and stick to the legacy ones by default. --> Feedback here was that changes to the structure of OpenTelemetry events/attributes at this point is not uncommon and moving to the new semantic profile is the right approach. Fallback to previous event structure can be enabled via system property "-DCOSMOS.USE_LEGACY_TRACING=true"
  • For now I intentionally decided against pushing the diagnostics processing to a separate thread (off the hot path) because that would increase cost for the happy path. There is some risk if customers inject inefficient diagnostic handlers - but at least for now I think it is better to optimize for steady-state vs. some self-injected problems when using inefficient custom handlers.

Sample OpenTelemetry traces (captured from custom Tracer serializing in json)

Sample meta-data operation (read container in this case)

{
    "name": "dependency",
    "spanName": "readContainer.2eb2d602-9af9-4630-a87a-bebbad5b2166",
    "kind": "INTERNAL",
    "startTime": "2023-03-27T08:45:35.656536300Z",
    "endTime": "2023-03-27T08:45:35.660539100Z",
    "duration": "PT0.0040028S",
    "statusMessage": null,
    "user_agent.original": "azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18",
    "db.cosmosdb.sub_status_code": "0",
    "db.cosmosdb.status_code": "200",
    "db.cosmosdb.resource_type": "DocumentCollection",
    "db.cosmosdb.max_response_content_length_bytes": "0",
    "db.operation": "readContainer.2eb2d602-9af9-4630-a87a-bebbad5b2166",
    "db.cosmosdb.connection_mode": "direct",
    "db.cosmosdb.container": "2eb2d602-9af9-4630-a87a-bebbad5b2166",
    "db.cosmosdb.regions_contacted": "south central us",
    "net.peer.name": "localhost",
    "db.cosmosdb.request_content_length": "0",
    "db.system": "cosmosdb",
    "db.cosmosdb.retry_count": "0",
    "db.cosmosdb.operation_type": "Read",
    "db.cosmosdb.client_id": "vmId_6cd5db0c-b198-464b-b8aa-72be1b419b6a_00004",
    "db.cosmosdb.request_charge": "1.0",
    "db.name": "RxJava.SDKTest.SharedDatabase_20230327T084528_fmf"
}

Sample point operation (create item in this case) with request tracing enabled

{
    "name": "dependency",
    "spanName": "createItem.2eb2d602-9af9-4630-a87a-bebbad5b2166",
    "kind": "INTERNAL",
    "startTime": "2023-03-27T08:45:35.853501400Z",
    "endTime": "2023-03-27T08:45:35.875498600Z",
    "duration": "PT0.0219972S",
    "statusMessage": null,
    "user_agent.original": "azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18",
    "db.cosmosdb.sub_status_code": "0",
    "db.cosmosdb.status_code": "201",
    "db.cosmosdb.resource_type": "Document",
    "db.cosmosdb.max_response_content_length_bytes": "230",
    "db.operation": "createItem.2eb2d602-9af9-4630-a87a-bebbad5b2166",
    "db.cosmosdb.connection_mode": "direct",
    "db.cosmosdb.container": "2eb2d602-9af9-4630-a87a-bebbad5b2166",
    "db.cosmosdb.regions_contacted": "south central us",
    "net.peer.name": "localhost",
    "db.cosmosdb.request_content_length": "18",
    "db.system": "cosmosdb",
    "db.cosmosdb.retry_count": "0",
    "db.cosmosdb.operation_type": "Create",
    "db.cosmosdb.client_id": "vmId_6cd5db0c-b198-464b-b8aa-72be1b419b6a_00004",
    "db.cosmosdb.request_charge": "5.52",
    "db.name": "RxJava.SDKTest.SharedDatabase_20230327T084528_fmf",
    "events": [
        {
            "name": "rntbd.request",
            "timestamp": "2023-03-27T08:45:35.874498800Z",
            "rntbd.operation_type": "Create",
            "rntbd.request_charge": "5.52",
            "rntbd.resource_type": "Document",
            "rntbd.latency_channelacquisitionstarted": "PT0.001001S",
            "rntbd.lsn": "958",
            "rntbd.latency_transittime": "PT0.0199964S",
            "rntbd.backend_latency": "19.575",
            "rntbd.is_new_channel": false,
            "rntbd.request_size_bytes": 18,
            "rntbd.url": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer0/partitions/a4cb494c-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
            "rntbd.partition_key_range_id": "0",
            "rntbd.latency": "PT0.0219996S",
            "rntbd.region": "south central us",
            "rntbd.status_code": "201",
            "rntbd.response_size_bytes": 230
        }
    ]
}

Sample query with request traces - and also violating latency threshold (emitting diagnostics)

{
    "name": "dependency",
    "spanName": "queryItems.2d99aa85-ddf2-4cab-9286-367ca1d23b00",
    "kind": "INTERNAL",
    "startTime": "2023-03-27T15:18:04.407681400Z",
    "endTime": "2023-03-27T15:18:04.429718400Z",
    "duration": "PT0.022037S",
    "statusMessage": null,
    "user_agent.original": "azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18",
    "db.cosmosdb.sub_status_code": "0",
    "db.cosmosdb.status_code": "200",
    "db.cosmosdb.resource_type": "Document",
    "db.cosmosdb.max_response_content_length_bytes": "64285",
    "db.operation": "queryItems.2d99aa85-ddf2-4cab-9286-367ca1d23b00",
    "db.cosmosdb.connection_mode": "direct",
    "db.cosmosdb.container": "2d99aa85-ddf2-4cab-9286-367ca1d23b00",
    "db.cosmosdb.regions_contacted": "south central us",
    "net.peer.name": "localhost",
    "db.cosmosdb.request_content_length": "0",
    "db.system": "cosmosdb",
    "db.cosmosdb.retry_count": "0",
    "db.cosmosdb.operation_type": "Query",
    "db.cosmosdb.client_id": "vmId_6cd5db0c-b198-464b-b8aa-72be1b419b6a_00004",
    "db.cosmosdb.request_charge": "12.05",
    "db.name": "RxJava.SDKTest.SharedDatabase_20230327T151747_BwZ",
    "events": [
        {
            "name": "threshold_violation",
            "timestamp": "2023-03-27T15:18:04.428714100Z",
            "Diagnostics": "{\"spanName\":\"queryItems.2d99aa85-ddf2-4cab-9286-367ca1d23b00\",\"account\":\"localhost\",\"db\":\"RxJava.SDKTest.SharedDatabase_20230327T151747_BwZ\",\"container\":\"2d99aa85-ddf2-4cab-9286-367ca1d23b00\",\"resource\":\"Document\",\"operation\":\"Query\",\"operationId\":\"queryItems.2d99aa85-ddf2-4cab-9286-367ca1d23b00\",\"consistency\":\"Session\",\"status\":200,\"RUs\":12.05,\"maxRequestSizeInBytes\":0,\"maxResponseSizeInBytes\":64285,\"maxItems\":1000,\"actualItems\":540,\"diagnostics\":[{\"userAgent\":\"azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18\",\"cosmosDiagnostics\":{\"queryMetricsMap\":{\"{\\\"min\\\":\\\"\\\",\\\"max\\\":\\\"05C1DFFFFFFFFC\\\"},pkrId:0\":{\"retrievedDocumentCount\":268,\"retrievedDocumentSize\":63027,\"outputDocumentCount\":268,\"outputDocumentSize\":63345,\"indexHitDocumentCount\":268,\"totalQueryExecutionTime\":0.000760000,\"queryPreparationTimes\":{\"queryCompilationTime\":0.000030000,\"logicalPlanBuildTime\":0.0,\"physicalPlanBuildTime\":0.000010000,\"queryOptimizationTime\":0.0},\"indexLookupTime\":0.0,\"documentLoadTime\":0.000510000,\"runtimeExecutionTimes\":{\"queryEngineExecutionTime\":0.000100000,\"systemFunctionExecutionTime\":0.0,\"userDefinedFunctionExecutionTime\":0.0},\"documentWriteTime\":0.000020000,\"clientSideMetrics\":{\"retries\":0,\"requestCharge\":6.0,\"fetchExecutionRanges\":[{\"startTime\":1679930284.409684400,\"endTime\":1679930284.424684400,\"partitionId\":\"{\\\"min\\\":\\\"\\\",\\\"max\\\":\\\"05C1DFFFFFFFFC\\\"}\",\"numberOfDocuments\":268,\"retryCount\":0,\"activityId\":\"9200477f-ccb2-11ed-9477-67e786871a39\"}],\"partitionSchedulingTimeSpans\":[{\"{\\\"min\\\":\\\"\\\",\\\"max\\\":\\\"05C1DFFFFFFFFC\\\"}\":{\"turnaroundTime\":15,\"responseTime\":0,\"runTime\":15,\"waitTime\":0,\"numPreemptions\":1}}]},\"indexUtilizationInfo\":null,\"indexHitRatio\":1.0,\"retries\":0,\"vmexecutionTime\":0.000630000}},\"clientSideRequestStatistics\":[{\"userAgent\":\"azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18\",\"activityId\":\"9200477f-ccb2-11ed-9477-67e786871a39\",\"requestLatencyInMs\":15,\"requestStartTimeUTC\":\"2023-03-27T15:18:04.409684400Z\",\"requestEndTimeUTC\":\"2023-03-27T15:18:04.424710800Z\",\"responseStatisticsList\":[{\"storeResult\":{\"storePhysicalAddress\":\"rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer19/partitions/a4cb495f-38c8-11e6-8106-8cdcd42c33be/replicas/1p/\",\"lsn\":984,\"globalCommittedLsn\":-1,\"partitionKeyRangeId\":\"0\",\"isValid\":true,\"statusCode\":200,\"subStatusCode\":0,\"isGone\":false,\"isNotFound\":false,\"isInvalidPartition\":false,\"isThroughputControlRequestRateTooLarge\":false,\"requestCharge\":6.0,\"itemLSN\":-1,\"sessionToken\":\"0:-1#984\",\"backendLatencyInMs\":1.05,\"retryAfterInMs\":null,\"replicaStatusList\":[\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Unknown\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\"],\"transportRequestTimeline\":[{\"eventName\":\"created\",\"startTimeUTC\":\"2023-03-27T15:18:04.409684400Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"queued\",\"startTimeUTC\":\"2023-03-27T15:18:04.409684400Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"channelAcquisitionStarted\",\"startTimeUTC\":\"2023-03-27T15:18:04.409684400Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"pipelined\",\"startTimeUTC\":\"2023-03-27T15:18:04.409684400Z\",\"durationInMilliSecs\":13.0229},{\"eventName\":\"transitTime\",\"startTimeUTC\":\"2023-03-27T15:18:04.422707300Z\",\"durationInMilliSecs\":1.0083},{\"eventName\":\"decodeTime\",\"startTimeUTC\":\"2023-03-27T15:18:04.423715600Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"received\",\"startTimeUTC\":\"2023-03-27T15:18:04.423715600Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"completed\",\"startTimeUTC\":\"2023-03-27T15:18:04.423715600Z\",\"durationInMilliSecs\":0.0}],\"rntbdRequestLengthInBytes\":474,\"rntbdResponseLengthInBytes\":64384,\"requestPayloadLengthInBytes\":27,\"responsePayloadLengthInBytes\":63345,\"channelStatistics\":{\"channelId\":\"f5819f1b\",\"channelTaskQueueSize\":0,\"pendingRequestsCount\":0,\"lastReadTime\":\"2023-03-27T15:18:04.403671700Z\",\"waitForConnectionInit\":false},\"serviceEndpointStatistics\":{\"availableChannels\":1,\"acquiredChannels\":0,\"executorTaskQueueSize\":0,\"inflightRequests\":1,\"lastSuccessfulRequestTime\":\"2023-03-27T15:18:04.407Z\",\"lastRequestTime\":\"2023-03-27T15:18:04.402Z\",\"createdTime\":\"2023-03-27T15:18:00.609140100Z\",\"isClosed\":false,\"cerMetrics\":{}}},\"requestResponseTimeUTC\":\"2023-03-27T15:18:04.424710800Z\",\"requestStartTimeUTC\":\"2023-03-27T15:18:04.409684400Z\",\"requestResourceType\":\"Document\",\"requestOperationType\":\"Query\",\"requestSessionToken\":\"0:-1#984\"}],\"supplementalResponseStatisticsList\":[],\"addressResolutionStatistics\":{},\"regionsContacted\":[\"south central us\"],\"retryContext\":{\"statusAndSubStatusCodes\":null,\"retryCount\":0,\"retryLatency\":0},\"metadataDiagnosticsContext\":{\"metadataDiagnosticList\":null},\"serializationDiagnosticsContext\":{\"serializationDiagnosticsList\":null},\"gatewayStatistics\":null,\"systemInformation\":{\"usedMemory\":\"148060 KB\",\"availableMemory\":\"3022244 KB\",\"systemCpuLoad\":\"(2023-03-27T15:17:49.539367300Z 12.7%), (2023-03-27T15:17:54.541559100Z 8.6%), (2023-03-27T15:17:59.552473300Z 4.8%)\",\"availableProcessors\":8},\"clientCfgs\":{\"id\":4,\"machineId\":\"vmId_6cd5db0c-b198-464b-b8aa-72be1b419b6a\",\"connectionMode\":\"DIRECT\",\"numberOfClients\":3,\"clientEndpoints\":{\"https://localhost:8903/\":1,\"https://localhost:8081/\":3},\"connCfg\":{\"rntbd\":\"(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:true)\",\"gw\":\"(cps:1000, nrto:PT1M, icto:PT1M, p:false)\",\"other\":\"(ed: true, cs: false, rv: true)\"},\"consistencyCfg\":\"(consistency: Session, mm: true, prgns: [])\",\"proactiveInit\":null}}],\"queryPlanDiagnosticsContext\":{\"startTimeUTC\":\"2023-03-27T15:18:04.407681400Z\",\"endTimeUTC\":\"2023-03-27T15:18:04.409684400Z\",\"requestTimeline\":[{\"eventName\":\"connectionAcquired\",\"startTimeUTC\":\"2023-03-27T15:18:04.407681400Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"connectionConfigured\",\"startTimeUTC\":\"2023-03-27T15:18:04.407681400Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"requestSent\",\"startTimeUTC\":\"2023-03-27T15:18:04.407681400Z\",\"durationInMilliSecs\":1.002},{\"eventName\":\"transitTime\",\"startTimeUTC\":\"2023-03-27T15:18:04.408683400Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"received\",\"startTimeUTC\":\"2023-03-27T15:18:04.408683400Z\",\"durationInMilliSecs\":0.0}]}}},{\"userAgent\":\"azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18\",\"cosmosDiagnostics\":{\"queryMetricsMap\":{\"{\\\"min\\\":\\\"05C1DFFFFFFFFC\\\",\\\"max\\\":\\\"FF\\\"},pkrId:1\":{\"retrievedDocumentCount\":272,\"retrievedDocumentSize\":63963,\"outputDocumentCount\":272,\"outputDocumentSize\":64285,\"indexHitDocumentCount\":272,\"totalQueryExecutionTime\":0.000760000,\"queryPreparationTimes\":{\"queryCompilationTime\":0.000020000,\"logicalPlanBuildTime\":0.0,\"physicalPlanBuildTime\":0.000010000,\"queryOptimizationTime\":0.0},\"indexLookupTime\":0.0,\"documentLoadTime\":0.000500000,\"runtimeExecutionTimes\":{\"queryEngineExecutionTime\":0.000090000,\"systemFunctionExecutionTime\":0.0,\"userDefinedFunctionExecutionTime\":0.0},\"documentWriteTime\":0.000040000,\"clientSideMetrics\":{\"retries\":0,\"requestCharge\":6.05,\"fetchExecutionRanges\":[{\"startTime\":1679930284.424684400,\"endTime\":1679930284.427684400,\"partitionId\":\"{\\\"min\\\":\\\"05C1DFFFFFFFFC\\\",\\\"max\\\":\\\"FF\\\"}\",\"numberOfDocuments\":272,\"retryCount\":0,\"activityId\":\"92029170-ccb2-11ed-9477-67e786871a39\"}],\"partitionSchedulingTimeSpans\":[{\"{\\\"min\\\":\\\"05C1DFFFFFFFFC\\\",\\\"max\\\":\\\"FF\\\"}\":{\"turnaroundTime\":18,\"responseTime\":15,\"runTime\":3,\"waitTime\":15,\"numPreemptions\":1}}]},\"indexUtilizationInfo\":null,\"indexHitRatio\":1.0,\"retries\":0,\"vmexecutionTime\":0.000630000}},\"clientSideRequestStatistics\":[{\"userAgent\":\"azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18\",\"activityId\":\"92029170-ccb2-11ed-9477-67e786871a39\",\"requestLatencyInMs\":3,\"requestStartTimeUTC\":\"2023-03-27T15:18:04.424710800Z\",\"requestEndTimeUTC\":\"2023-03-27T15:18:04.427713900Z\",\"responseStatisticsList\":[{\"storeResult\":{\"storePhysicalAddress\":\"rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer41/partitions/a4cb4975-38c8-11e6-8106-8cdcd42c33be/replicas/1p/\",\"lsn\":1460,\"globalCommittedLsn\":-1,\"partitionKeyRangeId\":\"1\",\"isValid\":true,\"statusCode\":200,\"subStatusCode\":0,\"isGone\":false,\"isNotFound\":false,\"isInvalidPartition\":false,\"isThroughputControlRequestRateTooLarge\":false,\"requestCharge\":6.05,\"itemLSN\":-1,\"sessionToken\":\"1:-1#1460\",\"backendLatencyInMs\":1.045,\"retryAfterInMs\":null,\"replicaStatusList\":[\"10253:Unknown\",\"10253:Unknown\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Unknown\",\"10253:Connected\",\"10253:Unknown\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\",\"10253:Connected\"],\"transportRequestTimeline\":[{\"eventName\":\"created\",\"startTimeUTC\":\"2023-03-27T15:18:04.425710900Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"queued\",\"startTimeUTC\":\"2023-03-27T15:18:04.425710900Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"channelAcquisitionStarted\",\"startTimeUTC\":\"2023-03-27T15:18:04.425710900Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"pipelined\",\"startTimeUTC\":\"2023-03-27T15:18:04.425710900Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"transitTime\",\"startTimeUTC\":\"2023-03-27T15:18:04.425710900Z\",\"durationInMilliSecs\":0.9979},{\"eventName\":\"decodeTime\",\"startTimeUTC\":\"2023-03-27T15:18:04.426708800Z\",\"durationInMilliSecs\":1.0051},{\"eventName\":\"received\",\"startTimeUTC\":\"2023-03-27T15:18:04.427713900Z\",\"durationInMilliSecs\":0.0},{\"eventName\":\"completed\",\"startTimeUTC\":\"2023-03-27T15:18:04.427713900Z\",\"durationInMilliSecs\":0.0}],\"rntbdRequestLengthInBytes\":475,\"rntbdResponseLengthInBytes\":65325,\"requestPayloadLengthInBytes\":27,\"responsePayloadLengthInBytes\":64285,\"channelStatistics\":{\"channelId\":\"f5819f1b\",\"channelTaskQueueSize\":1,\"pendingRequestsCount\":0,\"lastReadTime\":\"2023-03-27T15:18:04.423715600Z\",\"waitForConnectionInit\":false},\"serviceEndpointStatistics\":{\"availableChannels\":1,\"acquiredChannels\":0,\"executorTaskQueueSize\":0,\"inflightRequests\":2,\"lastSuccessfulRequestTime\":\"2023-03-27T15:18:04.407Z\",\"lastRequestTime\":\"2023-03-27T15:18:04.409Z\",\"createdTime\":\"2023-03-27T15:18:00.609140100Z\",\"isClosed\":false,\"cerMetrics\":{}}},\"requestResponseTimeUTC\":\"2023-03-27T15:18:04.427713900Z\",\"requestStartTimeUTC\":\"2023-03-27T15:18:04.424710800Z\",\"requestResourceType\":\"Document\",\"requestOperationType\":\"Query\",\"requestSessionToken\":\"1:-1#1460\"}],\"supplementalResponseStatisticsList\":[],\"addressResolutionStatistics\":{},\"regionsContacted\":[\"south central us\"],\"retryContext\":{\"statusAndSubStatusCodes\":null,\"retryCount\":0,\"retryLatency\":0},\"metadataDiagnosticsContext\":{\"metadataDiagnosticList\":null},\"serializationDiagnosticsContext\":{\"serializationDiagnosticsList\":null},\"gatewayStatistics\":null,\"systemInformation\":{\"usedMemory\":\"148060 KB\",\"availableMemory\":\"3022244 KB\",\"systemCpuLoad\":\"(2023-03-27T15:17:49.539367300Z 12.7%), (2023-03-27T15:17:54.541559100Z 8.6%), (2023-03-27T15:17:59.552473300Z 4.8%)\",\"availableProcessors\":8},\"clientCfgs\":{\"id\":4,\"machineId\":\"vmId_6cd5db0c-b198-464b-b8aa-72be1b419b6a\",\"connectionMode\":\"DIRECT\",\"numberOfClients\":3,\"clientEndpoints\":{\"https://localhost:8903/\":1,\"https://localhost:8081/\":3},\"connCfg\":{\"rntbd\":\"(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:true)\",\"gw\":\"(cps:1000, nrto:PT1M, icto:PT1M, p:false)\",\"other\":\"(ed: true, cs: false, rv: true)\"},\"consistencyCfg\":\"(consistency: Session, mm: true, prgns: [])\",\"proactiveInit\":null}}],\"queryPlanDiagnosticsContext\":null}}]}"
        },
        {
            "name": "rntbd.request",
            "timestamp": "2023-03-27T15:18:04.423715600Z",
            "rntbd.operation_type": "Query",
            "rntbd.activity_id": "9200477f-ccb2-11ed-9477-67e786871a39",
            "rntbd.request_charge": "6.0",
            "rntbd.resource_type": "Document",
            "rntbd.latency_pipelined": "PT0.0130229S",
            "rntbd.request_session_token": "0:-1#984",
            "rntbd.session_token": "0:-1#984",
            "rntbd.lsn": "984",
            "rntbd.latency_transittime": "PT0.0010083S",
            "rntbd.backend_latency": "1.05",
            "rntbd.is_new_channel": false,
            "rntbd.request_size_bytes": 27,
            "rntbd.url": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer19/partitions/a4cb495f-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
            "rntbd.partition_key_range_id": "0",
            "rntbd.latency": "PT0.0150264S",
            "rntbd.region": "south central us",
            "rntbd.status_code": "200",
            "rntbd.response_size_bytes": 63345
        },
        {
            "name": "rntbd.request",
            "timestamp": "2023-03-27T15:18:04.427713900Z",
            "rntbd.operation_type": "Query",
            "rntbd.activity_id": "92029170-ccb2-11ed-9477-67e786871a39",
            "rntbd.request_charge": "6.05",
            "rntbd.resource_type": "Document",
            "rntbd.latency_decodetime": "PT0.0010051S",
            "rntbd.request_session_token": "1:-1#1460",
            "rntbd.session_token": "1:-1#1460",
            "rntbd.lsn": "1460",
            "rntbd.latency_transittime": "PT0.0009979S",
            "rntbd.backend_latency": "1.045",
            "rntbd.is_new_channel": false,
            "rntbd.request_size_bytes": 27,
            "rntbd.url": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer41/partitions/a4cb4975-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
            "rntbd.partition_key_range_id": "1",
            "rntbd.latency": "PT0.0030031S",
            "rntbd.region": "south central us",
            "rntbd.status_code": "200",
            "rntbd.response_size_bytes": 64285
        }
    ]
}

Sample point read with request traces - and also having retries for 429

{
    "name": "dependency",
    "spanName": "readItem.92f049a9-3eda-4312-8a7c-8d085afb5faf",
    "kind": "INTERNAL",
    "startTime": "2023-03-28T00:53:26.160121600Z",
    "endTime": "2023-03-28T00:53:27.188184200Z",
    "duration": "PT1.0280626S",
    "statusMessage": null,
    "user_agent.original": "azsdk-java-cosmos/4.43.0-beta.1 Windows11/10.0 JRE/11.0.18",
    "db.cosmosdb.sub_status_code": "0",
    "db.cosmosdb.status_code": "200",
    "db.cosmosdb.resource_type": "Document",
    "db.cosmosdb.max_response_content_length_bytes": "233",
    "db.operation": "readItem.92f049a9-3eda-4312-8a7c-8d085afb5faf",
    "db.cosmosdb.connection_mode": "direct",
    "db.cosmosdb.container": "92f049a9-3eda-4312-8a7c-8d085afb5faf",
    "db.cosmosdb.regions_contacted": "south central us",
    "net.peer.name": "localhost",
    "db.cosmosdb.request_content_length": "0",
    "db.system": "cosmosdb",
    "db.cosmosdb.retry_count": "2",
    "db.cosmosdb.operation_type": "Read",
    "db.cosmosdb.client_id": "vmId_6cd5db0c-b198-464b-b8aa-72be1b419b6a_00002",
    "db.cosmosdb.request_charge": "1.0",
    "db.name": "RxJava.SDKTest.SharedDatabase_20230328T005306_Dzp",
    "events": [
        {
            "name": "rntbd.request",
            "timestamp": "2023-03-28T00:53:26.161123400Z",
            "rntbd.operation_type": "Read",
            "rntbd.activity_id": "f2917141-cd02-11ed-bf0f-ddabff0f227e",
            "rntbd.request_charge": "0.0",
            "rntbd.resource_type": "Document",
            "rntbd.latency_channelacquisitionstarted": "PT0.0010018S",
            "rntbd.request_session_token": "1:-1#2010",
            "rntbd.session_token": "1:-1#2010",
            "rntbd.fault_injection_rule_id": "Injected410bc2802a6-1cf9-4724-b7e6-e8a180f4bf56",
            "rntbd.is_new_channel": false,
            "rntbd.request_size_bytes": 0,
            "rntbd.url": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer0/partitions/a4cb494c-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
            "rntbd.latency": "PT0.0010018S",
            "rntbd.region": "south central us",
            "rntbd.status_code": "429",
            "rntbd.response_size_bytes": -1
        },
        {
            "name": "rntbd.request",
            "timestamp": "2023-03-28T00:53:26.675821300Z",
            "rntbd.operation_type": "Read",
            "rntbd.activity_id": "f2dfdf62-cd02-11ed-bf0f-ddabff0f227e",
            "rntbd.request_charge": "0.0",
            "rntbd.resource_type": "Document",
            "rntbd.latency_channelacquisitionstarted": "PT0.0010038S",
            "rntbd.request_session_token": "1:-1#2010",
            "rntbd.session_token": "1:-1#2010",
            "rntbd.fault_injection_rule_id": "Injected410bc2802a6-1cf9-4724-b7e6-e8a180f4bf56",
            "rntbd.is_new_channel": false,
            "rntbd.request_size_bytes": 0,
            "rntbd.url": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer0/partitions/a4cb494c-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
            "rntbd.latency": "PT0.5156997S",
            "rntbd.region": "south central us",
            "rntbd.status_code": "429",
            "rntbd.response_size_bytes": -1
        },
        {
            "name": "rntbd.request",
            "timestamp": "2023-03-28T00:53:27.186157Z",
            "rntbd.operation_type": "Read",
            "rntbd.activity_id": "f32db143-cd02-11ed-bf0f-ddabff0f227e",
            "rntbd.request_charge": "1.0",
            "rntbd.resource_type": "Document",
            "rntbd.latency_pipelined": "PT0.0009996S",
            "rntbd.request_session_token": "1:-1#2010",
            "rntbd.session_token": "1:-1#2010",
            "rntbd.lsn": "2010",
            "rntbd.backend_latency": "0.297",
            "rntbd.is_new_channel": false,
            "rntbd.request_size_bytes": 0,
            "rntbd.url": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer0/partitions/a4cb494c-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
            "rntbd.partition_key_range_id": "1",
            "rntbd.latency": "PT1.0260354S",
            "rntbd.region": "south central us",
            "rntbd.status_code": "200",
            "rntbd.response_size_bytes": 233
        }
    ]
}

Perf regression tests

Summary

Results of baseline (main branch)

image

Results of candidate (this PR)

image

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Parent feature: #33425

Initial DRAFT for public API review
@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @FabianMeiswinkel , all changes look good to me except for few minor questions and one revert ask on getDuration() API on CosmosResponse. Thanks for the awesome work.

@@ -127,6 +156,10 @@ public CosmosDiagnostics getDiagnostics() {
* @return end-to-end request latency for the current request to Azure Cosmos DB service.
*/
public Duration getDuration() {
if (resourceResponseWrapper == null) {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. Currently if the response is null, we return Duration.ZERO. Can we please revert this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss on Monday - I wont' change it right now because I don't agree that it is a breaking change - it would have thrown a NullPointerException here - I only added this to avoid the NullPointerException. If you still think this is considered breaking I can add an internal property - but seems ok to do that in a separate PR or within the retriable write PR after we discussed on Monday.

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel FabianMeiswinkel enabled auto-merge (squash) April 2, 2023 00:17
@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

Test failure readFeedDocumentsAfterSplit has known flakiness - will override and merge the PR

@FabianMeiswinkel
Copy link
Member Author

/check-enforcer override

@FabianMeiswinkel FabianMeiswinkel merged commit 68fdcf4 into Azure:main Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Turn off CosmosDiagnostics when logging CosmosException
6 participants