Export Common Duration Metrics #34168

ldb · 2024-05-15T10:01:33Z

Title: Export Common Duration Metrics

Description:
With #33240 we got the ability to export various commonly used durations via access logs (thank you!!!). However, it would be great if there was a way to also export these as metrics so they can be ingested by Prometheus.

I don't have a specific design in mind right now, but anything that would pre-aggregate these durations would help immensely in ensuring we can easily alert on the performance of downstream, upstream and envoy itself.

ravenblackx · 2024-05-16T15:07:17Z

@wbpcode about the specific data, @jmarantz as a stats expert

wbpcode · 2024-05-17T13:58:22Z

It's not easy to provide a feature like this in our core stats system. Dynamic and flexible stats means additional memory, additional complexity. (And I think it's complex enough)

But the good news is our stats is extendable. I am okay if we do it in an optional filter (or logger? @kyessenov )

kyessenov · 2024-05-20T20:37:48Z

I filed #30619 which replicates the Istio design with high cardinality metrics so you can do break downs by upstream/downstream paths easily. The general problem is that doing all of this in Envoy would push its stats subsystem beyond its design capabilities, so you still need to run a collector or some stats engine to hold the aggregate data. I'd recommend using delta aggregation temporality as well to flush metrics which Envoy doesn't directly support it.

ldb · 2024-05-21T07:50:15Z

To be clear, what I am mostly looking for is to have specific metrics available for the kind of deltas that #33240 enables, for example:

ds_rx_duration: '%COMMON_DURATION(DS_RX_BEG:DS_RX_END:ms)%',  // Total duration in milliseconds of the request from the start time to the last byte of the request received from the downstream.
routing_duration: '%COMMON_DURATION(DS_RX_END:US_TX_BEG:ms)%',  // Total duration in milliseconds of the request from the last byte of the request received from the downstream to the first byte of the response sent to the upstream.
us_tx_duration: '%COMMON_DURATION(US_TX_BEG:US_TX_END:ms)%',  // Total duration in milliseconds of the request from the first byte of the response sent to the upstream to the last byte of the response sent to the downstream.
us_rx_duration: '%COMMON_DURATION(US_RX_BEG:US_RX_END:ms)%',  // Total duration in milliseconds of the request from the last byte of the response received from the upstream to the first byte of the response sent to the downstream.
ds_tx_duration: '%COMMON_DURATION(DS_TX_BEG:DS_TX_END:ms)%',  // Total duration in milliseconds of the request from the first byte of the response sent to the downstream to the last byte of the response sent to the downstream.

We currently expose these in access logs, but aggregating these into metrics is quite an expensive process if all we are after is some aggregates per cluster / method / status.

These metrics do not need the same granularity (read: cardinality) as the access logs, an aggregation by upstream cluster, HTTP method and HTTP status would already be a very useful start.

I do like the idea of this being added as an optional filter, too. The metrics could be created dynamically and if the set of potential attributes is limited, cardinality should not be a big problem.

ldb added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels May 15, 2024

ravenblackx added area/stats and removed triage Issue requires triage labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export Common Duration Metrics #34168

Export Common Duration Metrics #34168

ldb commented May 15, 2024

ravenblackx commented May 16, 2024

wbpcode commented May 17, 2024

kyessenov commented May 20, 2024

ldb commented May 21, 2024

Export Common Duration Metrics #34168

Export Common Duration Metrics #34168

Comments

ldb commented May 15, 2024

ravenblackx commented May 16, 2024

wbpcode commented May 17, 2024

kyessenov commented May 20, 2024

ldb commented May 21, 2024