Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data model proto #126

Merged
merged 6 commits into from
Nov 5, 2020
Merged

data model proto #126

merged 6 commits into from
Nov 5, 2020

Conversation

RichiH
Copy link
Collaborator

@RichiH RichiH commented Jan 15, 2019

No description provided.

Copy link
Collaborator

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably more, but here's what I spotted that need to be made consistent with the text format

// Required.
int64 seconds = 1; // Seconds since the epoch. Negative values are permitted

uint32 nanoseconds = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics of this when seconds is negative?

Copy link
Collaborator

@robskillington robskillington Jan 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanoseconds_offset perhaps is a better concept? And if its signed instead, perhaps that's more permissive and allows you to simply always just blindly add these two together:

unix_nanos = (timestamp.seconds * SECOND_NANOSECONDS) + timestamp.nanoseconds_offset

Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quite understand. We can add even when nanoseconds is unsigned

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we'll just use the google protobuf timestamp type here.

message Point {
oneof value {
double double_value = 1;
int64 int_value = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, did we agree uint64 here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a little strange that double value can be negative but int value cannot be?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to restrict it to unsigned? The proto doc has int64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this cannot be negative for a gauge?

double sum = 1;

// Required.
int64 count = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

message BucketCount {
// Required.
// Count is the number of values for a bucket of the histogram.
int64 count = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

repeated Quantile quantile = 3;
message Quantile {
// Required.
// Must be in the interval (0.0, 1.0].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fully closed, why allow 1 but not 0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you calculate the 0th percentile from a sample of values? i don't think subtracting delta from the lowest value in the sample is mathematically sound.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for 100th percentile. It's either both or neither.

// A bucket has a 0 lower bound and an inclusive upper bound for the
// values that are counted for that bucket. That is, buckets are
// cumulative. The upper bound of successive buckets must be increasing.
// There is an implicit overflow bucket that extends up to +infinity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and must match the value of _count

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are just the bucket boundaries. I've added a comment later about the sum of bucket counts.


// Value for CUMULATIVE_HISTOGRAM or GAUGE_HISTOGRAM point.
message HistogramValue {
// Required.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither of these should be required

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some systems don't have them (MySQL)

double sum = 1;
int64 count = 2;

repeated Quantile quantile = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be clear that this is optional

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying the quantile's are optional?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

InfoValue info_value = 5;
SummaryValue summary_value = 6;
}
// Required for COUNTER, SUMMARY or CUMULATIVE_HISTOGRAM type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had agreed this is required from a model perspective.
Are you saying optional just for backward compatibility with the old text format? The proto model does not deal with that compatibility. It defines the baseline OpenMetrics model and a client that does not send start_timestamp is not OpenMetrics compliant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we had not agreed this. It's optional, as many many systems out there do not support this.

// interval.
Timestamp start_timestamp = 7;

// If not specified, the timestamp will be decided by the backend.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make clear this is optional, and discouraged

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to resolve the pull/push debate before we say "discouraged". IMO the same data model suffices for push in which case timestamp should be encouraged.
For now I am fine with saying "discouraged for pull and encouraged for push".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say encouraged for push, it depends on what's on the other side. The other side can always add its own.


// TODO: Format?
// STATE_SETs and INFO types must not have a unit.
string unit = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this perhaps instead be an enum?

Copy link
Collaborator

@brian-brazil brian-brazil Jan 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's too many potential values, it's a string to allow for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want any restriction on this string, or just UTF-8?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd have to be the same limits as metric names.

message Linear {
// Required.
// Must be greater than 0.
int32 num_explicit_buckets = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint32 perhaps since there's no such thing as negative buckets?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

message Exponential {
// Required.
// Must be greater than 0.
int32 num_exponential_buckets = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint32 perhaps here too since there's no such thing as negative buckets?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// The quantiles can be reset at arbitrary unknown times.

double sum = 1;
int64 count = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64 since there can't be negative sampled values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

string value = 2;
}

message Timestamp {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the default timestamp from proto3? "google/protobuf/timestamp.proto"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to keep self-contained

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least we can copy the definition from there and avoid any arguments about nanoseconds being int vs uint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why it is int32 in google/protobuf/timestamp.proto even though it says "Non-negative fractions of a second at nanosecond resolution."?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons that I know:

  1. This needs to be compatible with languages that do not have unsigned (like Java).
  2. It makes diff calculation easier (you do the diff initially then normalize the nanos).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either is a problem in practice, it's not going to have values where that'll be an issue.


// One or more timeseries for a single metric, where each timeseries has
// one or more points.
message MetricPoints {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just Metric?

message Point {
oneof value {
double double_value = 1;
int64 int_value = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this cannot be negative for a gauge?

// (and indicated by comments).

// The top-level message sent on the wire.
message MetricSet {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a better name is MetricList. Set semantics are a bit unclear here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a set, you aren't allowed have duplicates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then probably document that, to make it clear why is a set and what is the "hash" function.


enum Type {
UNKNOWN = 0; // double or int valued point
GAUGE = 1; // double or int valued point

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid confusion and have backends to support both double/int for the same metric what about making the type include the type of the valued point. Something like GAUGE_DOUBLE?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this at length, this is the way we're going.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the current state or what I proposed?

}

// A single timeseries.
message Timeseries {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anywhere something like MonitoredResource in Stackdriver or __meta__ labels (in Prometheus, hope I used the correct name). Is this intentional? How are these reported?

Couple of use-cases where these are necessary and cannot be determine even if this data model is focused on a pull mechanism which would allow the backend to annotate these meta labels in general:

  1. Task (metric producer) -> Proxy -> Metric Backend. In this case the meta labels should be included in the metric because the backend does not know about the metric producer and cannot associate the right meta labels.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what Info metrics are for.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, but I have some problems understanding some of the things here, maybe worth some clarification:

  1. I don't understand how does the correlation between Info metrics and the other metrics happen. Should we have one info metric per MetricSet? Can I have in the same MetricSet metrics from multiple producers (in case of a proxy if I monitor multiple tasks)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are allowing multiple info metrics, e.g. build_info. It is up to the receiver to decide what to do with them. But they will all apply to the whole MetricSet, so one shouldn't mix multiple producers in the same MetricSet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no limit on how many info metrics you can have.

// timeseries, value of INFO metrics, and exemplars in Histograms.
message Label {
// Required.
string name = 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support a label description? Human readable description of this metric?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this previously, and no.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be marked as resolved.


// Additional information about the example value // (e.g. trace id).
// The sum of lengths of all the strings in all the labels must not
// exceed 64 UTF-8 characters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 characters for all the labels here seems unreasonable. trace-id has 32-hex characters and span-id has 16-hex characters only these values will represent almost the entire limit.

See https://github.com/w3c/trace-context

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some concern about allowing this to be a dumping ground for stuff.
With UTF-8, the current limit lets someone use 64*4 = 256 bytes.
How about we state the limit in the sum of bytes needed for names and values so that it doesn't penalize users who are restricting to ascii.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec in general is UTF-8, I'd rather not switch to bytes here.

It sounds to me like you've 12 characters left over for the label name, so you're inside the limit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeer I am not suggesting to not have a limit, but the limit is too small.

@brian-brazil correct but if I use "trace-id" and "span-id" (32 + 8 + 16 + 7 = 63) as keys then I have one character left so no other exemplar. Also we would like to have a trace-options which will suggest if the trace is sample or not (2-hex characters).

I know that the solution can be to modify the w3c standard (impossible) or to use "shorter" keys like "tid" and "sid" and "to" but those are not that clear. I really think that the limit should be more realistic and we can go with 256 characters for this.


// A name-value pair. These are used in multiple places: identifying
// timeseries, value of INFO metrics, and exemplars in Histograms.
message Label {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because label is key, value you can use Labels as map<String,String>.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be ignored and marked as resolved.

bool enabled = 1;

// Required.
string name = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not a regular label? Recommended key can be state_name the label value is the state then this becomes a single boolean.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it'd not be a first-class type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine not having this as a first class. Probably on the gauge supporting boolean is enough

// exponential sequence, or each bucket can be specified explicitly.
// `BucketOptions` does not include the number of values in each bucket.
//
// A bucket has a 0 lower bound and an inclusive upper bound for the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say "inclusive 0 lower bound" for clarity

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buckets for values <0 don't seem possible with this data model. Is that intended?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussions it is, noone could come up with a realistic use case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My usual tech use-case is Spam Assassin scores.

However, OpenMetrics is supposed to be generally applicable. Observing negative values is a very normal thing for histograms in general. It is almost zero cost to support it. Kicking the support out would be a serious blow to the applicability of the format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current handling of negative bounds by Prometheus is defined here: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile

…ents and add explicit advice (#142)

* Use google.protobuf.Timestamp for timestamps and add clarity to certain comments
* Address that unknown, gauge and counter should all use double point values
* Add optional/required to fields missing it
* Mark summary quantiles as optional
* Make linear buckets able to use negative observations
* Call out that points are optional
* Remove mention of 0 lower bound to allow negative buckets
* Update protobuf types to match latest OpenMetrics RFC data model

// A name-value pair. These are used in multiple places: identifying
// timeseries, value of INFO metrics, and exemplars in Histograms.
message Label {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be ignored and marked as resolved.

Comment on lines 78 to 83
// Optional.
// If there is more than one point, all must have a timestamp field
// and they must be in increasing order of timestamp.
// If there is only one point and has no timestamp field, the receiver
// will assign a timestamp.
repeated Point points = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of cases you have multiple points vs one point for the same labels is extremely small, I would suggest to optimize for the case where only one point per labels is sent and remove repeated (avoids an allocation or two depends on the languages) with the downside of repeating the labels for the case where multiple points are needed.

// timeseries, value of INFO metrics, and exemplars in Histograms.
message Label {
// Required.
string name = 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be marked as resolved.

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after merging latest changes now merging this to master

@robskillington robskillington merged commit 48e3d1c into master Nov 5, 2020
@RichiH
Copy link
Collaborator Author

RichiH commented Nov 5, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants