Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require consumers to normalize floats represented as strings #129

Closed
beorn7 opened this issue Feb 4, 2019 · 14 comments
Closed

Require consumers to normalize floats represented as strings #129

beorn7 opened this issue Feb 4, 2019 · 14 comments

Comments

@beorn7
Copy link
Collaborator

beorn7 commented Feb 4, 2019

If I understood correctly, the text representation of OpenMetrics will inherit Prometheus's “type dissonance” of representing the floating values for bucket boundaries of a histogram and for φ values of pre-calculated quantiles in summaries as strings. Concretely, the former is represented as an le label, the latter as a quantile label, and as per #56, all label values are strings.

Historical note: When the Prometheus text format was designed, this “type dissonance” was considered a pragmatic solution in cases where you could not use the “proper” solution of using the protobuf format (where neither bucket boundaries nor φ values are labels in the first place). As we know, the roles of text vs protobuf changed over time. But back then, there was certainly no intention to promote the somewhat dirty pragmatic solution to an industry standard.

Not having taken part in the discussion so far (for reasons…), it is not my intention to doubt the decisions made. However, I would like to suggest a mean to limit the damage: Please consider the requirement for consumers to normalize float values represented as strings.

For example, the wide-spread Prometheus Go library creates the following summary for the Go garbage collector:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000220391
go_gc_duration_seconds{quantile="0.25"} 0.000270805
go_gc_duration_seconds{quantile="0.5"} 0.000287239
go_gc_duration_seconds{quantile="0.75"} 0.000310831
go_gc_duration_seconds{quantile="1"} 0.005429449
go_gc_duration_seconds_sum 16.791648668
go_gc_duration_seconds_count 56134

Depending on preferences how to format floats as strings, {quantile="1"} could be rendered as

  • {quantile="+1"}
  • {quantile="1."}
  • {quantile="1.0"}
  • {quantile="1e0"}
  • {quantile="1E0"}

{quantile="0.5"} could be rendered as:

  • {quantile="+0.5"}
  • {quantile=".5"}
  • {quantile="0.50"}
  • {quantile="5e-1"}
  • {quantile="5E-1"}

Similar considerations apply to the le label of histogram buckets. The practical impact is perhaps even more severe, as the numerical value is actually needed as such to calculate quantiles from an histogram (see recent Prometheus code changes to work around the resulting problems). For example, {le="10000"} could reasonably be written as:

  • {le="10000"}
  • {le="10000."}
  • {le="10000.0"}
  • {le="1e4"}
  • {le="1E4"}

If the label is naively ingested in the same way as other labels, each of the above would create a different time series. My suggestion implies that consumers interpret those special labels as the string-rendering of a float and evaluate the actual float value during ingestion, which they then save in their preferred form (which could be a string again, like in the case of Prometheus, but it would be normalized to the preferred string formatting of the ingester).

The alternative would be to precisely specify how to format floats as strings as part of the OpenMetrics standard, which would include a whole bunch of corner cases and is hard to get complete and right and would also put a huge burden on any implementations generating the format as they cannot just use the preferred formatter of their platform.

Another historical note: Prometheus 1.x was doing exactly the normalization suggested here. Prometheus 2.x paradoxically dropped both the normalization and our excuse to represent floats as strings, namely the option to use protobuf and thus avoid the problematic representation altogether. The decision to do so was not discussed widely, pushed through by very few developers in times where no formalized decision making process was established yet. Prometheus is not a good role model here.

@beorn7
Copy link
Collaborator Author

beorn7 commented Feb 4, 2019

For context: This issue became more relevant as I'm laying out the roadmap for the Prometheus Go client and was thinking about when and how to introduce potentially necessary breaking changes for OpenMetrics adoption. The fallout of the change in float formatting has already proven to be way more serious than anticipated. See prometheus/common#168 for illustration. By now, I have received personal communication from another large scale Prometheus user who was affected by the same issue, with similar amount of confusion and lost productivity due to troubleshooting.

@beorn7
Copy link
Collaborator Author

beorn7 commented Feb 26, 2019

Any thoughts on this? I would really appreciate any idea how the OpenMetrics folks think about this so that I can act accordingly for the next steps in the Prometheus Go exposition library.

@RichiH
Copy link
Collaborator

RichiH commented Mar 26, 2019

Sorry for taking so long.

We talked about this in our bi-weekly call and this is the relevant part:

  • Float spec for quantile/le labels
    • Beorn opened Require consumers to normalize floats represented as strings #129
    • Parser vs exposition handling
    • Brian: SHOULD is workable, but still will cause confusion
    • Brian: Not all language version (e.g. Python 2.6) will support this easily, due to general floating point accuracy issues
    • RichiH: Want to avoid SNMP-like variance over time; what will happen in a decade?
    • Rob: What about SHOULD/MAY for parsers to be able to handle non-canonical input?
      • Ends up de facto as a SHOULD
    • Brian: Could suggest that clients with issues could allow users to specify the right value
    • Rob: Only a problem if you're not using OM client libraries
    • Rob: Is a problem for other floats?
      • Brian: No, want to allow the standard way of rendering floats in various languages to Just Work as much as practical
    • Ben: Choose a canonical representation, at a minimum a SHOULD, also fine with MUST
    • RichiH: If it's a MUST, people will still get it wrong. SHOULD makes this worse. Need to have it very clearly in the spec so we can point out incorrect implementations are against spec
    • Consensus: MUST
      • Prometheus itself should verify this
    • RichiH: Suggestion: “Parsers MAY do sufficient parsing to detect this, if so they SHOULD reject it. If they do deep parsing and accept it, they MUST offer an option to reject”
    • Rob: Supportive of parsers SHOULD do deep parsing and if that surfaces an error then they MUST reject
      • Brian: Per Rob, but parsers MAY do deep parsing. Keep things easy for simple parsing use cases.
      • Rob: True, probably better (to allow for- catch all parsers)
    • Brian: Let's think on this subpoint more

@beorn7
Copy link
Collaborator Author

beorn7 commented Mar 27, 2019

I don't quite get the frame of reference:

Consensus: MUST

Is this "consumers MUST normalize floats" or "producers MUST use a canonical text format for floats"?

@brian-brazil
Copy link
Collaborator

It's "producers MUST use a canonical text format for floats" for le/quantile label.

@beorn7
Copy link
Collaborator Author

beorn7 commented Mar 27, 2019

Oops. I guess you have discussed all of that at length, and this issue is probably not the right place to shout from the sidelines. Still I want to provide for the record that a good part of the success story of the Prometheus text format has been that it's easy to produce. There has been a worrying tendency in OpenMetrics to favor ease of consumption over ease of production. But now things get really tough, as you are essentially requiring each implementation of a producer to include a float formatter that adheres exactly to a (still to be formulated) standard.

@brian-brazil
Copy link
Collaborator

We are still discussing how lenient a parser can be in relation to that.

@RichiH
Copy link
Collaborator

RichiH commented Mar 27, 2019

The underlying problem is that as we gain traction outside of cloud native, aka modern tech, implementors will become less good and more confused. Long-term, I am more worried about fragmentation and people doing things wrong over immediate adoption.

We can always relax requirements later, but never tighten them.

@beorn7
Copy link
Collaborator Author

beorn7 commented Mar 28, 2019

If I'm not mistaken, OpenMetrics is about production and consumption. In this particular case, tightening the requirements for producers means relaxing the requirements for consumers. I filed this issue because I think it's the most reasonable way here to tighten the requirements for consumers (normalize float values even if they happen to be represented as strings) rather than for producers (implement a custom float formatter to create standardized string representation of floats).

Note that Prometheus 1 was fulfilling the requirement for consumers, and then Prometheus 2 stopped doing so, which wasn't really an informed decision but more or less an accident. On the other hand, no existing text format producer has ever included a custom float formatter. (It is as of now impossible anyway to produce a conforming formatter as OpenMetrics hasn't yet provided a complete spec about how to format floats. Which will be quite an effort.)

@beorn7
Copy link
Collaborator Author

beorn7 commented Mar 28, 2019

Or in other words:

There is already collective experience with normalizing floats, so tightening the requirements for consumers is a proven concept and also defuses a problematic design decision (representing floats as strings in the exposition format).

On the other hand, tightening the requirements for producers has no precedent and makes a problematic design decision (representing floats as strings in the exposition format) even worse (representing floats as strings using a delicately (to be) defined canonical format).

@beorn7
Copy link
Collaborator Author

beorn7 commented Jun 18, 2019

To reiterate about “Go's %g formatting with added .0 for numbers with neither e nor . in it”:

Besides the general badness of taking the behavior of a particular implementation as the spec (rather than actually specifying the behavior), I don't think we can rely on that being stable for our purposes.

Both the documentation for the fmt package and that of strconv.FormatFloat specify (only) two things about %g:

  1. “%e for large exponents, %f otherwise.”
  2. “Precision […] for %g it is the smallest number of digits necessary to identify the value uniquely.”

In my understanding, any implementation of the %g format in the fmt package is conforming to the Go standard if it fulfills those two.

Note that in (1) the switching point between %e and %f is not precisely defined (“large” is apparently “greater than 5 or smaller than -4” in practice, which is hardcoded in strconv/ftoa.go with no hint in the code that this has been standardized anywhere).

Item (2) is much harder to tackle, though. Example (see code):

  • A number written as 1.234567890123456789 results in a float64 that Go's %g formats as 1.2345678901234567. Both numbers result in the same bit pattern if parsed as a float64.
  • However, a number written as 1.2345678901234568 will also result in the same float64.
  • A Go implementation formatting that same float64 as 1.2345678901234568 would be as correct as the current one that formats it as 1.2345678901234567. In fact, I wouldn't be surprised if the exact behavior depended on the hardware the code is running on.

Looking at the code in strconv/ftoa.go how the exact representation is created, you can see that it tries the relatively recent Grisu3 algorithm first and falls back to a more expensive algorithm if Grisu3 is unsure. Grisu3 is younger than Go itself (although Go1.0 was released after the Grisu3 paper), so I'm fairly sure that the exact algorithm used is not part of the Go stability guarantees.

Even ignoring Go stability guarantees, and coming back to the actual issue at hand here: If OpenMetrics included an exact specification how a float64 has to be represented unambiguously by a unique string, you would need to include the exact algorithm how to do that, and you would need to guarantee somehow that it yields the same results on all hardware platforms (which, I assume, was never a concern of those coming up with efficient algorithms for formatting floating point numbers, as the goal was to create any one of the shortest strings that would parse back into the exact same float but not a particular one selected reproducibly among several possible shortest strings).

Finally, even assuming that you can, in fact, specify an algorithm that would guarantee a unique string representation on all hardware platforms, it is unlikely that this algorithm will be very efficient. As the generation of the text format involves a lot of float formatting, this would be a significant added cost to generating the format (not to speak about the significant cost of implementing said algorithm in each language supporting OpenMetrics).

My apologies for the long comment. I wished my previous comments had been enough to make my point, but apparently they weren't. I hope it is now overwhelmingly obvious that requiring a unique test representation for floats is opening a can of worms and that it is much saner to require consumers to normalize float values so that different string representations of the very same float64 will be treated the same.

@brian-brazil
Copy link
Collaborator

Finally, even assuming that you can, in fact, specify an algorithm that would guarantee a unique string representation on all hardware platforms, it is unlikely that this algorithm will be very efficient.

Grisu3 (with fallback) is that algorithm, and what everything bar Python uses from my research thus far. Python produces the same result for your example input.

@beorn7
Copy link
Collaborator Author

beorn7 commented Jun 18, 2019

Grisu3 advertises to be “correct: any printed number will evaluate to the same number, when read again”. That's not equivalent to “any given float64 will always result in the same printed number”. I don't know where you gain confidence that that's the case in general and on different hardware in particular.

But even if Grisu3 had these properties (which needs to be shown), the fallback algorithm also needs to have the same properties.

And even if that's all the case, most OpenMetrics needed to come with its own Grisu3 plus fallback implementation as there will be hardly any language where the standard library guarantees to always use Grisu3 and that exact falllback algorithm.

@RichiH
Copy link
Collaborator

RichiH commented Dec 1, 2020

Per https://github.com/OpenObservability/OpenMetrics/blob/master/specification/OpenMetrics.md#considerations-canonical-numbers producers SHOULD produce canonical numbers for a small set of common values, and it is encouraged for other values. The format does not specify what consumers should do with these values, beyond not rejecting non-canonical numbers other than +Inf for Histogram buckets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants