Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Micrometer Metrics that capture aspects of GraphQL query latency. #52

Merged
merged 1 commit into from Mar 4, 2021

Conversation

berngp
Copy link
Contributor

@berngp berngp commented Feb 9, 2021

Provide metrics that capture the graphql endpoint latency.

This PR introduces the graphql-dgs-spring-boot-micrometer module. It has an auto-configuration that enables the capture of GraphQL query metrics, via micrometer. The GraphQLMetricsInstrumentation will provide the following metrics:

Name Kind Description
gql.query Timer Captures the time it takes to execute the query.
gql.error Counter Captures number of GraphQL errors, note that one GraphQL request can have multiple.
gql.resolver Timer Tagged with the resolver's GraphQL field via gql.field, it captures the resolver's execution time.
gql.dataLoader Timer Captures the elapse time for a data loader invocation for batch of queries. This is useful if you want to find data loaders that might be responsible for poor query performance.

All metrics are tagged with the tags provided by the DefaultGraphQLTagsProvider. Developers can override the DefaultGraphQLTagsProvider by defining their own implementation of the GraphQLTagsProvider interface.

The DefaultGraphQLTagsProvider defines the following tags.

Tag Description
outcome "SUCCESS" if the ExecutionResult has data, else an "ERROR".
gql.path (optional) ValidationError's queryPath or the GraphQLError path.1
gql.errorCode (optional) The ErrorType's name if the error is a ValidationError, else it will attempt to extract its value from the Error's extensions via the errorType key.1
gql.errorDetail (optional) it will attempt to extract its value from the GgraphQLError's extensions via the errorDetail key. For details on the cardinality of the value please see refer to the typederror interface.
gql.loaderName (specific to gql.Loader) Name of the data loader.
gql.loaderBatchSize (specific to gql.Loader) The number of queries executed in the batch.

Examples


Footnotes

  1. Remember that one request can have multiple errors. 2

@berngp berngp force-pushed the feature/metrics branch 2 times, most recently from c95dce7 to 8787ac4 Compare February 9, 2021 20:21
@berngp berngp added the enhancement New feature or request label Feb 9, 2021
@berngp berngp self-assigned this Feb 9, 2021
@berngp berngp force-pushed the feature/metrics branch 3 times, most recently from c4a93f7 to af26e4f Compare February 10, 2021 19:52
@berngp berngp changed the title Provide metrics that capture the graphql endpoint latency, via Micrometer Provide Micrometer Metrics that capture aspects of GraphQL query latency. Feb 10, 2021
@berngp berngp force-pushed the feature/metrics branch 2 times, most recently from 0a45cbc to 95bb8a7 Compare February 11, 2021 05:33
@jkschneider
Copy link

+1 for this support. Look forward to it :)


private fun recordDataFetcherTime(timerSampler: Timer.Sample, tags: Tags, registry: MeterRegistry) {
timerSampler.stop(registry,
Timer.builder("gql.resolver.time").tags(tags).publishPercentileHistogram())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest striking publishPercentileHistogram as it it can be turned on later by an application developer if they want histograms for this timer in application.properties with management.metrics.distribution.percentiles-histogram.gql=true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had no idea, thanks for the pointer.

exception: Throwable?): Iterable<Tag> {

return if (result.errors.isNotEmpty()) {
InstrumentationTags.sanitizeErrorPathsForTags(result).map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to add the tags here vs in the instrumentation class as part of instrumentExecutionResult?


override fun instrumentExecutionResult(executionResult: ExecutionResult, parameters: InstrumentationExecutionParameters): CompletableFuture<ExecutionResult> {
sanitizeErrorPathsForTags(executionResult).forEach { _ ->
registry.counter("gql.error",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this counter? Timers always ship a count as well as other distribution statistics like max, sum, percentiles, etc. So if I understand this correctly, this data already is retrievable by filtering gql.query{outcome='ERROR'} and looking at the count statistic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please review the latest code. In the latest implementation a qgl.error is emitted per error in the Grql Request. Per the spec a request can have multiple errors.

@berngp berngp force-pushed the feature/metrics branch 2 times, most recently from c16d862 to 2894c3f Compare March 3, 2021 02:41
@berngp
Copy link
Contributor Author

berngp commented Mar 3, 2021

There is plenty of documentation to write for this feature. In the meantime the MicrometerServletSmokeTest.kt offers a hint on the metrics, and the tags, according to different scenarios.

@berngp
Copy link
Contributor Author

berngp commented Mar 4, 2021

Mining for decent. Most likely will merge tomorrow and have a soft release of this feature since I'm planning on adapting our internal implementation to it.

h2. Context

*Work in Progress*

h3. Provide metrics that capture the `graphql` endpoint latency.
This PR introduces the `graphql-dgs-spring-boot-micrometer` module. It has an _auto-configuration_ that enables
the capture of GraphQL query metrics, via [micrometer].
The `GraphQLMetricsInstrumentation` will provide the following metrics:

| Name           | Kind   | Description |
| ------------   | ------ | ----------- |
| gql.query      | Timer  | Captures the time it takes to execute the query.  |
| gql.error      | Counter| Captures number of GraphQL errors, note that one GraphQL request can have multiple.|
| gql.resolver   | Timer  | Tagged with the resolver's GraphQL field via `gql.field`, it captures the resolver's execution time.|
| gql.dataLoader | Timer  |  Captures the elapse time for a data loader invocation for batch of queries. This is useful if you want to find data loaders that might be responsible for poor query performance. |

All metrics are tagged with the tags provided by the `DefaultGraphQLTagsProvider`.
Developers can override the `DefaultGraphQLTagsProvider` by defining their own implementation
of the `GraphQLTagsProvider` interface.

The `DefaultGraphQLTagsProvider` defines the following tags.
| Tag             | Description |
| ------------    | ----------- |
| outcome         | `"SUCCESS"` if the [ExecutionResult] has data, else an `"ERROR"`. |
| gql.path        | (optional) [ValidationError]'s `queryPath` or the [GraphQLError] `path`.[^1]|
| gql.errorCode   | (optional) The [ErrorType]'s name if the error is a [ValidationError], else it will attempt to extract its value from the Error's extensions via the `errorType` key.[^1]|
| gql.errorDetail | (optional)  it will attempt to extract its value from the GgraphQLError's extensions via the `errorDetail` key. For details on the cardinality of the value please see refer to the [typederror interface](https://netflix.github.io/dgs/error-handling/#the-typederror-interface). |
| gql.loaderName | (specific to gql.Loader) Name of the data loader. |
| gql.loaderBatchSize | (specific to gql.Loader) The number of queries executed in the batch.  |

[^1]: Remember that one request can have multiple errors.

h3. Examples

* Included as part of this PR we have the [MicrometerServletSmokeTest.kt] test that depicts the metrics emitted on particular scenarios.

----

[micrometer]: https://micrometer.io/
[ErrorType]: https://github.com/graphql-java/graphql-java/blob/master/src/main/java/graphql/ErrorType.java
[ExecutionResult]: https://github.com/graphql-java/graphql-java/blob/master/src/main/java/graphql/ExecutionResult.java
[GraphQLError]:https://github.com/graphql-java/graphql-java/blob/master/src/main/java/graphql/GraphQLError.java
[ValidationError]: https://github.com/graphql-java/graphql-java/blob/master/src/main/java/graphql/validation/ValidationError.java
[MicrometerServletSmokeTest.kt]: https://github.com/Netflix/dgs-framework/blob/2894c3fc96388f0e21e4954740a558dda6626f3a/graphql-dgs-spring-boot-micrometer/src/test/kotlin/com/netflix/graphql/dgs/metrics/micrometer/MicrometerServletSmokeTest.kt

----

h2. DGS Resolver Metrics

Initially we had the a `dgs.resolver.timer` and a `dgs.resolver.counter`.
These were removed in favor of a `dgs.resolver` timer.

As mentioned in Micrometer's [Counters](https://micrometer.io/docs/concepts#_counters) documentation...
> Never count something you can time with a Timer or summarize with a DistributionSummary!
> Both Timer and DistributionSummary always publish a count of events in addition to other measurements.

Flags to enable/disable features, Smoke Test metrics for DataLoaders
@berngp berngp removed the request for review from rpalcolea March 4, 2021 02:59
@berngp berngp merged commit 7aae070 into master Mar 4, 2021
@berngp berngp deleted the feature/metrics branch April 9, 2021 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants