Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export Common Duration Metrics #34168

Open
ldb opened this issue May 15, 2024 · 4 comments
Open

Export Common Duration Metrics #34168

ldb opened this issue May 15, 2024 · 4 comments
Labels
area/stats enhancement Feature requests. Not bugs or questions.

Comments

@ldb
Copy link

ldb commented May 15, 2024

Title: Export Common Duration Metrics

Description:
With #33240 we got the ability to export various commonly used durations via access logs (thank you!!!). However, it would be great if there was a way to also export these as metrics so they can be ingested by Prometheus.

I don't have a specific design in mind right now, but anything that would pre-aggregate these durations would help immensely in ensuring we can easily alert on the performance of downstream, upstream and envoy itself.

@ldb ldb added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels May 15, 2024
@ravenblackx ravenblackx added area/stats and removed triage Issue requires triage labels May 16, 2024
@ravenblackx
Copy link
Contributor

@wbpcode about the specific data, @jmarantz as a stats expert

@wbpcode
Copy link
Member

wbpcode commented May 17, 2024

It's not easy to provide a feature like this in our core stats system. Dynamic and flexible stats means additional memory, additional complexity. (And I think it's complex enough)

But the good news is our stats is extendable. I am okay if we do it in an optional filter (or logger? @kyessenov )

@kyessenov
Copy link
Contributor

I filed #30619 which replicates the Istio design with high cardinality metrics so you can do break downs by upstream/downstream paths easily. The general problem is that doing all of this in Envoy would push its stats subsystem beyond its design capabilities, so you still need to run a collector or some stats engine to hold the aggregate data. I'd recommend using delta aggregation temporality as well to flush metrics which Envoy doesn't directly support it.

@ldb
Copy link
Author

ldb commented May 21, 2024

To be clear, what I am mostly looking for is to have specific metrics available for the kind of deltas that #33240 enables, for example:

ds_rx_duration: '%COMMON_DURATION(DS_RX_BEG:DS_RX_END:ms)%',  // Total duration in milliseconds of the request from the start time to the last byte of the request received from the downstream.
routing_duration: '%COMMON_DURATION(DS_RX_END:US_TX_BEG:ms)%',  // Total duration in milliseconds of the request from the last byte of the request received from the downstream to the first byte of the response sent to the upstream.
us_tx_duration: '%COMMON_DURATION(US_TX_BEG:US_TX_END:ms)%',  // Total duration in milliseconds of the request from the first byte of the response sent to the upstream to the last byte of the response sent to the downstream.
us_rx_duration: '%COMMON_DURATION(US_RX_BEG:US_RX_END:ms)%',  // Total duration in milliseconds of the request from the last byte of the response received from the upstream to the first byte of the response sent to the downstream.
ds_tx_duration: '%COMMON_DURATION(DS_TX_BEG:DS_TX_END:ms)%',  // Total duration in milliseconds of the request from the first byte of the response sent to the downstream to the last byte of the response sent to the downstream.

We currently expose these in access logs, but aggregating these into metrics is quite an expensive process if all we are after is some aggregates per cluster / method / status.

These metrics do not need the same granularity (read: cardinality) as the access logs, an aggregation by upstream cluster, HTTP method and HTTP status would already be a very useful start.

I do like the idea of this being added as an optional filter, too. The metrics could be created dynamically and if the set of potential attributes is limited, cardinality should not be a big problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stats enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests

4 participants