Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] Add a more precise histogram implementation #4601

Merged
merged 6 commits into from
Sep 16, 2022
Merged

[metrics] Add a more precise histogram implementation #4601

merged 6 commits into from
Sep 16, 2022

Conversation

andll
Copy link
Contributor

@andll andll commented Sep 13, 2022

Prometheus histogram implementation requires specifying buckets and reported histogram therefore is not very precise.

The implementation in this PR takes different approach and calculates 'true' histogram by ordering data points.

This approach is more expensive but provides better data and does not require specifying buckets in advance. For things like aggregating network latencies this seems like a better trade off, as it will show outliers better(e.g. there is no 'cut off' on the histogram) and calculate more exact values.

There are more details on caveats and potential usages in documentation for HistogramVec struct.

Current PR also replaces histogram in only one place(SafeClient) for now, hopefully we can see some better data from that.

The result from this PR is that latency histogram becomes much more granular and trustworthy.

Previously we've seen problem where real p90 was largely skewed by unfortunate bucket choice in prometheus histogram:

image

Note how all latencies on this graph gravitate to constant values.

With adapted histograms (this PR):

image

crates/sui-core/src/histogram.rs Show resolved Hide resolved
crates/sui-core/src/histogram.rs Outdated Show resolved Hide resolved
crates/sui-core/src/histogram.rs Show resolved Hide resolved
Copy link
Contributor

@sadhansood sadhansood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@andll andll disabled auto-merge September 15, 2022 22:27
@andll andll enabled auto-merge (squash) September 15, 2022 22:27
Copy link
Contributor

@velvia velvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @andll I like the innovation you show by having a solution that dynamically calculates the quantiles. The obvious disadvantage is that fixed percentiles cannot be aggregated accurately, which means we can't get meaningful aggregated percentiles across nodes.

I don't like that all the labels are cloned per point, which makes this solution very very expensive.

To see how viable this is, I'm wondering how often the cycle interval is, and the reporting interval.

It's quite common to measure latencies over minutes. So for example, at 10k TPS (yeah I know I'm thinking for the future), for 5m sampling interval, you are looking at 56010000 = 3 million points. 100k would be only 1/30th of the samples in the above scenario, making this approach not viable.

What I suggest is, instead of keeping all the samples, we should use a digest structure that is explicitly designed to measure distributions. There are multiple of these that I can recommend, but the T-digest is a widely used one that I can recommend:

I also suggest using Prometheus summaries instead of disparate gauges, since this is the exact use case for summaries.

/// Unlike the histogram from prometheus crate, this histogram does not require to specify buckets
/// It works by calculating 'true' histogram by aggregating and sorting values.
///
/// The values are reported into prometheus gauge with requested labels and additional dimension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the Prometheus summary which is designed for this exact purpose, to report percentiles:
https://www.robustperception.io/how-does-a-prometheus-summary-work/

The major downsides are that percentiles cannot be properly aggregated. For example, we cannot aggregate these percentiles from individual nodes into one for all nodes, there isn't a good way to combine percentiles like that. Also that the percentiles are fixed.

///
/// On the bright side, this histogram exports less data to Prometheus comparing to prometheus::Histogram,
/// it exports each requested percentile into separate prometheus gauge, while original implementation creates
/// gauge per bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurately, Prom histograms are a set of counters in two dimensions. This is an important difference: Prom counters are designed to be resilient to restarts.

crates/sui-core/src/histogram.rs Show resolved Hide resolved

impl Histogram {
pub fn report(&self, v: Point) {
match self.channel.try_send((self.labels.clone(), v)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not clone the labels on each report, but instead send a u64 ID or something? That means several allocations totalling 500 bytes - 1K per point, which is slow and also memory hungry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Label is an Arc so it is essentially sending u64

}
}
}
if count > MAX_POINTS {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using a rotating or ring buffer here. Newer data just automatically overwrites older data, which makes for a nice sliding window. No need to error here, because in the long run we will always overflow ...unless there is some periodic reset thing, which I didn't see

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind, it resets every cycle.

data.sort_unstable();
for pct1000 in self.percentiles.iter() {
let index = Self::pct1000_index(data.len(), *pct1000);
let point = *data.get(index).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't data a Vec? I don't see a get method for data.

self.known_labels.insert(label.clone());
reset_labels.remove(&label);
assert!(!data.is_empty());
data.sort_unstable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often will report get called?

@andll
Copy link
Contributor Author

andll commented Sep 16, 2022

I don't like that all the labels are cloned per point, which makes this solution very very expensive.

As I replied in other comments, labels are Arc<>, so I don't think this is expensive.

For the time series format - yes, the format is changed, as mentioned in the PR description. I don't see how this is a practical problem to be honest, we can add other percentiles to the defaults if needed, adding additional percentiles is not expensive.

@velvia I understand that there are probably endless way to improve histograms.

I believe that what I use here for safe_client latencies is better then what we currently have with fixed prometheus buckets - it provides more precise results and with load we have on those counters the solution works just fine. Do you see any practical issues with this implementation for this particular use case?

Perhaps it is not a universal solution, and regular prometheus histograms are still available for other use cases, where we have a lot of data points or other reasons preventing usage of the histogram from this PR. People can chose what they want, I don't even mind if prometheus histograms will be a default choice.

Since this seem to be working solution that is better that what we had before and solves practical problem that we encountered(with us misunderstanding latencies on prometheus graphs), I propose to just move forward with it. When someone has time to implement better histograms we can do that, and I don't mind if this code will get deleted in a month. Working beats perfect.

Copy link
Contributor

@velvia velvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve it for now, but we should acknowledge and document the drawbacks with the approach. Namely, that

  1. percentiles cannot be accurately aggregated across nodes
  2. Accuracy will go down once more than the limit samples is accumulated

crates/sui-core/src/histogram.rs Show resolved Hide resolved
Prometheus histogram implementation requires specifying buckets and reported histogram therefore is not very precise.

The implementation in this PR takes different approach and calculates 'true' histogram by ordering data points.

This approach is more expensive but provides better data and does not require specifying buckets in advance. For things like aggregating network latencies this seems like a better trade off, as it will show outliers better(e.g. there is no 'cut off' on the histogram) and calculate more exact values.

There are more details on caveats and potential usages in documentation for HistogramVec struct.

Current PR also replaces histogram in only one place(SafeClient) for now, hopefully we can see some better data from that.
@changeset-bot
Copy link

changeset-bot bot commented Sep 16, 2022

⚠️ No Changeset found

Latest commit: dda9f45

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@andll andll merged commit 279bfe6 into main Sep 16, 2022
@andll andll deleted the andrey-23 branch September 16, 2022 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants