Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos Metrics: allow enabling only metrics of certain categories #33436

Merged

Conversation

FabianMeiswinkel
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel commented Feb 10, 2023

Description

This PR is modifying which MicroMeter metrics are collected by default when registering a MeterRegistry and allows enabling/disabling metrics based on categories - even at runtime based on configuration changes. This change is motivated by feedback from a few customers who have enabled metrics.

New public API

Filtering by metric category

The CosmosMicrometerMetricsOptions gets three new methods to modify the metric categories of to-be-emitted metrics.

  • setMetricCategories - allows adjusting the metric categories to a specific set of metrics
  • addMetricCategories - allows adding metric categories to the list of already emitted metrics
  • removeMetricCategories - allows removing a specific set of metric categories (except for MINIMUM categories (OperationSummary and System) which are required metrics)

All three methods can be called on the CosmosMicrometerMetricsOptions instance even after building a Cosmos(Async)Client - and changes to the categories will be reflected at runtime. This allows changing the to-be-collected metrics via configuration at runtime without the need to restart CosmosClients or the app.

Sample

CosmosMicrometerMetricsOptions metricOptions = new CosmosMicrometerMetricsOptions()
    .meterRegistry(myMeterRegistry)
    .setMetricCategories(CosmosMetricCategory.MINIMUM)
    .addMetricCategories(CosmosMetricCategory.OPERATION_DETAILS, CosmosMetricCategory.REQUEST_DETAILS)
    .removeMetricCategories(CosmosMetricCategory.DIRECT_CHANNELS, CosmosMetricCategory.REQUEST_DETAILS)

new CosmosClientTelemetryConfig()
    .metricsOptions(metricOptions);

this.client = getClientBuilder()
    .clientTelemetryConfig(inputClientTelemetryConfig)
    .buildClient();

Adjusting percentiles/histogram capturing

For some of the metrics collected in the Cosmos SDK by default percentiles (0.95 and 0.99) and histograms (to allow calculation of percentiles across multiple client machines) are published. Collecting percentiles and especially histograms has some overhead (memory and CPU) - so, there is a new API that allows more fine granular configuration of which percentiles to colelct and whether histograms should be published.

Default settings

The default settings can be overriden on the CosmosMicrometerMetricsOptions the defaults are applicable whenever a meter gets registered in MicroMeter for the first time unless there is a meter-specific override. Some of the settings (like whether a meter is enabled or not) can be changed afterwards at runtime.

Sample

CosmosMicrometerMetricsOptions metricOptions = new CosmosMicrometerMetricsOptions()
    .meterRegistry(myMeterRegistry)
    .configureDefaultPercentiles(0.9) // only collects 90th percentile
    .enableHistogramsByDefault(false) // disables histogram

new CosmosClientTelemetryConfig()
    .metricsOptions(metricOptions);

this.client = getClientBuilder()
    .clientTelemetryConfig(inputClientTelemetryConfig)
    .buildClient();

Overriding meter specific settings.

The sample below adds a meter-specific override of the settings (in this case for percentiles and whether to publish histograms for the cosmos.client.op.RUs meter.

Sample

CosmosMicrometerMetricsOptions metricOptions = new CosmosMicrometerMetricsOptions()
    .meterRegistry(myMeterRegistry)
    .configureDefaultPercentiles(0.9) // only collects 90th percentile
    .enableHistogramsByDefault(false) // disables histogram

new CosmosClientTelemetryConfig()
    .metricsOptions(metricOptions);

this.client = getClientBuilder()
    .clientTelemetryConfig(inputClientTelemetryConfig)
    .buildClient();

metricOptions
    .configureMeter(
        CosmosMetricName.OPERATION_SUMMARY_REQUEST_CHARGE,
        new CosmosMicrometerMeterOptions()
            .configurePercentiles(0.75, 0.9) // only collect 75th and 90th percentile instead of default 95th and 99th
            .enableHistograms(false) // disable histograms
    );

Configuring tags

For each of the meters emitted by the Cosmos DB SDK we associate a certain set of tags/dimensions. Tags have a certain overhead especially when they have a high-cardinality (for example the ServiceEndpoint, ServiceAddress or PartitionKeyRangeId have a cardinality varying by the number of physical partitions). While these tags/dimensions are super useful when trying to triage why latency is higher than expected or whether certain errors are only coming from a certain client-machine or backend endpoint, whether or not to collect them is a trade-off. The following APIs allow overriding the default behavior for which tags to use.

Defaults

CosmosMicrometerMetricsOptions metricOptions = new CosmosMicrometerMetricsOptions()
    .meterRegistry(myMeterRegistry)
    .configureDefaultTagNames(CosmosMetricTagName.MINIMUM, CosmosMetricTagName.PartitionKeyRangeId); // reduces tags to a minimum but adds PKRangeId

new CosmosClientTelemetryConfig()
    .metricsOptions(metricOptions);

this.client = getClientBuilder()
    .clientTelemetryConfig(inputClientTelemetryConfig)
    .buildClient();

Meter specific override

CosmosMicrometerMetricsOptions metricOptions = new CosmosMicrometerMetricsOptions()
    .meterRegistry(myMeterRegistry)
    .configureDefaultTagNames(CosmosMetricTagName.ALL) // Enables all tags

new CosmosClientTelemetryConfig()
    .metricsOptions(metricOptions);

this.client = getClientBuilder()
    .clientTelemetryConfig(inputClientTelemetryConfig)
    .buildClient();

metricOptions
    .configureMeter(
        CosmosMetricName.OPERATION_SUMMARY_REQUEST_CHARGE,
        new CosmosMicrometerMeterOptions()
            .suppressTagNames(CosmosMetricName.SERVICE_ENDPOINT, CosmosMetricTagName.SERVICE_ADDRESS, CosmosMetricTagName.PARTITION_KET_RANGE_ID) // suppresses the high-cardinality tags ServiceEndpoint, ServiceAddress and PKRangeId for RU charge meter only
    );

Disabling individual meters

As described above usually filtering of which meters to collect would be done by meter category. If there is a need to still control whether a meter should be emitted on a per meter base the CosmosMicrometerMeterOptions.setEnabled(boolean) API can be used.

CosmosMicrometerMetricsOptions metricOptions = new CosmosMicrometerMetricsOptions()
    .meterRegistry(myMeterRegistry);

new CosmosClientTelemetryConfig()
    .metricsOptions(metricOptions);

this.client = getClientBuilder()
    .clientTelemetryConfig(inputClientTelemetryConfig)
    .buildClient();

metricOptions
    .configureMeter(
        CosmosMetricName.OPERATION_SUMMARY_REQUEST_CHARGE,
        new CosmosMicrometerMeterOptions()
            .setEnabled(false); // disables collection of this meter even when teh category-based filtering would have included it
    );

Behavioral breaking change

  • NOTE: Technically changing which metrics to collect by default is a breaking change. we strongly believe that the new default metrics will provide the most critical metrics and reduce the overhead of collecting metrics - especially when a container has a high number of partitions. It is still possible to fallback to the same behavior up to 4.40.0 (enabling all metrics) by adding the following statement when building the CosmosClientTelemetryConfig .setMetricCategories(CosmosMetricCategory.ALL) - see below for an example
CosmosClientTelemetryConfig inputClientTelemetryConfig = new CosmosClientTelemetryConfig()
    .metricsOptions(
        new CosmosMicrometerMetricsOptions()
            .meterRegistry(myMeterRegistry)
            .setMetricCategories(CosmosMetricCategory.ALL));

this.client = getClientBuilder()
    .clientTelemetryConfig(inputClientTelemetryConfig)
    .buildClient();
  • In addition we have deprecated the Api CosmosClientTelemetryConfig.metricTagNames in this PR - as described above the APIs CosmosMicrometerMetricsOptions.defaultTagNames or CosmosMicrometerMeterOptions.suppressTagNames can be used instead.

Considered alternatives

  • One way to avoid the breaking behavior change would be to by default keep all metrics enabled - but still allow to use the new metricCategories API to be used to make it easier to filter metrics to be emitted (and to change filtering at runtime)
  • In theory the MicroMeter MeterRegistry API allows filtering out unwanted metrics as well. One alternative to even introducing the .metricCategories PAI in the Cosmos DB SDK would be to completely rely on Micrometer MeterRegistry API to filter out metrics. The drawbacks are that this wouldn't be possible at runtime - and that it wouldn't avoid the overhead of recording metrics for the unwanted metrics - so, it would still incur partial overhead for these metrics.
  • Just to add a few more specifics why I don't think it is realistic that filtering out metrics at runtime with MicroMeter APIs is realistic - whenever a CompositeMeterRegistry is used the filtering on the parent (the composite registry) is intentionally very hesitant with filtering out meters - because the goal is to allow adding a new child registry at runtime. Filtering is done conservatively to allow for scenarios where even when at beginning all child registries would filter out a metric but then a new registry is added that would support the meter it is supposed to be emitted. More details here - MeterFilter do not work for composite meter registres spring-projects/spring-boot#28188 - the discussion in the GitHub issue also shows that there are other libraires like WebClient allowing to control more granularly whether/which metrics to emit.

Internal implementation considerations

  • It is desired to be able to change the metricCategories even after the client has been built at runtime.
  • Reorganized the metric name definitions also into the same categories that can be used by cx to enable/disable metrics of a certain category for simplification reasons.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes] - see above notes regarding the breaking change.
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel FabianMeiswinkel merged commit 0c0625d into Azure:main Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants