Skip to content

fix(metrics): unblock OTLP/JSON histogram, expo, summary ingestion#60261

Merged
DanielVisca merged 1 commit into
masterfrom
05-27-fix_metrics_unblock_otlp_json_histogram_expo_summary_ingestion
May 27, 2026
Merged

fix(metrics): unblock OTLP/JSON histogram, expo, summary ingestion#60261
DanielVisca merged 1 commit into
masterfrom
05-27-fix_metrics_unblock_otlp_json_histogram_expo_summary_ingestion

Conversation

@jonmcwest
Copy link
Copy Markdown
Contributor

@jonmcwest jonmcwest commented May 27, 2026

Problem

The opentelemetry-proto Rust crate has several upstream deserialization gaps that cause OTLP/JSON metric payloads to silently drop data rather than returning errors:

  1. [upstream-1253] Empty value: {} AnyValue objects fail to deserialize.
  2. [upstream-3328] Several fixed64/uint64/sfixed64 fields (count, zeroCount, asInt, bucketCounts, and timestamp fields) lack the deserialize_string_to_u64 annotation, so the OTLP/JSON spec-canonical string encoding silently produces data: None instead of a parsed value.
  3. [upstream-unreported] ExponentialHistogram, ExponentialHistogramDataPoint, SummaryDataPoint, Buckets, and Exemplar all lack #[serde(default)], so any missing non-Option proto field hard-errors and trips the upstream silencing pattern.

The silencing pattern itself — Metric.data being #[serde(flatten)] Option<Data> on a #[serde(default)] struct — means any inner deserialization failure is swallowed and the metric is dropped with no log line and no error returned to the client.

Changes

patch_otel_json is extended with three independently-removable workaround layers, each tagged with a FIXME referencing its upstream issue:

  • String→integer coercion for count, zeroCount, asInt, and bucketCounts elements via coerce_string_to_integer, which tries i64 first then u64 to handle both signed and unsigned spec-valid values.
  • Timestamp coercion for timeUnixNano and startTimeUnixNano descendants inside exponentialHistogram and summary variants via coerce_unix_nano_descendants.
  • Default injection for all required-by-serde fields in ExponentialHistogram, ExponentialHistogramDataPoint, SummaryDataPoint, Buckets, and Exemplar via fill_*_defaults functions, preventing hard-errors on minimal but spec-valid payloads.

How did you test this code?

A new integration test file tests/metrics_test.rs covers:

  • Histogram with string-encoded u64 fields alongside a sum counter (the primary regression case).
  • Histogram with unquoted u64 fields (baseline sanity check).
  • Exponential histogram with string-encoded u64 and timestamp fields.
  • Summary with string-encoded u64 fields.
  • NumberDataPoint.asInt as a JSON string.
  • u64::MAX round-trip via the u64 fallback path in coerce_string_to_integer.
  • Signed boundary values (i64::MAX, i64::MIN, 0, ±1) for asInt.
  • Mixed string/number encoding in the same bucketCounts array.
  • Minimal (field-sparse) exponential histogram and summary payloads that would previously silently drop.
  • Empty exponentialHistogram: {} variant.

Two tests (edge_negative_value_in_u64_field_should_error, edge_non_numeric_string_in_u64_field_should_error) are marked #[ignore] because the upstream silencing pattern currently prevents them from returning errors. They should be un-ignored once the upstream structure changes.

Publish to changelog?

No

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@jonmcwest jonmcwest force-pushed the 05-27-fix_metrics_unblock_otlp_json_histogram_expo_summary_ingestion branch from 7606959 to b46e116 Compare May 27, 2026 15:07
@jonmcwest jonmcwest marked this pull request as ready for review May 27, 2026 15:14
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Comments Outside Diff (1)

  1. rust/capture-logs/src/service.rs, line 80-114 (link)

    P2 New coercions apply to log and trace payloads too

    patch_otel_json is shared by parse_otel_message (logs) and parse_otel_traces_message (traces) as well as parse_otel_metrics_message. The new object-level checks for keys count, zeroCount, asInt, and bucketCounts now fire whenever those keys appear anywhere in any OTLP payload type. In practice the risk is very low (these keys don't appear in log/trace schemas), but any future field named count in a log or trace proto would be silently coerced. Worth a comment noting the scope, or alternatively narrowing the coercions to a separate patch_otel_metrics_json function that is only called from parse_otel_metrics_message.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: rust/capture-logs/src/service.rs
    Line: 80-114
    
    Comment:
    **New coercions apply to log and trace payloads too**
    
    `patch_otel_json` is shared by `parse_otel_message` (logs) and `parse_otel_traces_message` (traces) as well as `parse_otel_metrics_message`. The new object-level checks for keys `count`, `zeroCount`, `asInt`, and `bucketCounts` now fire whenever those keys appear anywhere in any OTLP payload type. In practice the risk is very low (these keys don't appear in log/trace schemas), but any future field named `count` in a log or trace proto would be silently coerced. Worth a comment noting the scope, or alternatively narrowing the coercions to a separate `patch_otel_metrics_json` function that is only called from `parse_otel_metrics_message`.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
rust/capture-logs/tests/metrics_test.rs:46-59
**Superfluous `eprintln!` debug prints in tests**

Several tests contain `eprintln!` calls (e.g. `counter.data is_some`, `histogram.data is_some`, `dp.count=...`) that were clearly left in from development. They produce noise on `--nocapture` runs and add no assertion value since the asserts below them already capture the failure message. Per the project simplicity rule "has no superfluous parts," these can be removed. The same pattern appears in `exponential_histogram_with_string_u64s`, `summary_with_string_u64s`, `number_data_point_as_int_string`, `edge_u64_above_i64_max_round_trips`, `edge_as_int_signed_boundaries`, and `histogram_with_unquoted_u64_works`.

### Issue 2 of 3
rust/capture-logs/tests/metrics_test.rs:395-430
**Repeated test structure for `minimal fields` scenario**

`edge_expo_with_minimal_fields_only` and `edge_summary_with_minimal_fields_only` are structurally identical: send a payload with a single metric variant containing one data point with only `timeUnixNano`, then assert `data` is not `None`. The team preference is parameterised tests. These two (and potentially the three `*_with_string_u64s` variants) could be expressed as a single table-driven test, keeping the payloads in a `&[(&str, fn(&Metric) -> bool)]` slice and removing the repeated boilerplate.

### Issue 3 of 3
rust/capture-logs/src/service.rs:80-114
**New coercions apply to log and trace payloads too**

`patch_otel_json` is shared by `parse_otel_message` (logs) and `parse_otel_traces_message` (traces) as well as `parse_otel_metrics_message`. The new object-level checks for keys `count`, `zeroCount`, `asInt`, and `bucketCounts` now fire whenever those keys appear anywhere in any OTLP payload type. In practice the risk is very low (these keys don't appear in log/trace schemas), but any future field named `count` in a log or trace proto would be silently coerced. Worth a comment noting the scope, or alternatively narrowing the coercions to a separate `patch_otel_metrics_json` function that is only called from `parse_otel_metrics_message`.

Reviews (1): Last reviewed commit: "cargo fmt" | Re-trigger Greptile

Comment on lines +46 to +59
let counter = &metrics[0];
let histogram = &metrics[1];

eprintln!("counter.data is_some = {}", counter.data.is_some());
eprintln!("histogram.data is_some = {}", histogram.data.is_some());
if let Some(Data::Histogram(h)) = &histogram.data {
eprintln!("histogram.data_points.len = {}", h.data_points.len());
if let Some(dp) = h.data_points.first() {
eprintln!(
"dp.count={} sum={:?} bucket_counts={:?} explicit_bounds={:?}",
dp.count, dp.sum, dp.bucket_counts, dp.explicit_bounds
);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Superfluous eprintln! debug prints in tests

Several tests contain eprintln! calls (e.g. counter.data is_some, histogram.data is_some, dp.count=...) that were clearly left in from development. They produce noise on --nocapture runs and add no assertion value since the asserts below them already capture the failure message. Per the project simplicity rule "has no superfluous parts," these can be removed. The same pattern appears in exponential_histogram_with_string_u64s, summary_with_string_u64s, number_data_point_as_int_string, edge_u64_above_i64_max_round_trips, edge_as_int_signed_boundaries, and histogram_with_unquoted_u64_works.

Prompt To Fix With AI
This is a comment left during a code review.
Path: rust/capture-logs/tests/metrics_test.rs
Line: 46-59

Comment:
**Superfluous `eprintln!` debug prints in tests**

Several tests contain `eprintln!` calls (e.g. `counter.data is_some`, `histogram.data is_some`, `dp.count=...`) that were clearly left in from development. They produce noise on `--nocapture` runs and add no assertion value since the asserts below them already capture the failure message. Per the project simplicity rule "has no superfluous parts," these can be removed. The same pattern appears in `exponential_histogram_with_string_u64s`, `summary_with_string_u64s`, `number_data_point_as_int_string`, `edge_u64_above_i64_max_round_trips`, `edge_as_int_signed_boundaries`, and `histogram_with_unquoted_u64_works`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +395 to +430
"bucketCounts":["1"],
"explicitBounds":[]
}]
}}
]
}]
}]
}"#;

let result = parse_otel_metrics_message(&Bytes::from(json));
assert!(
result.is_err(),
"spec-violating count=\"not-a-number\" must be rejected, not silently dropped"
);
}

/// Regression for the gap that necessitated the expanded EXPONENTIAL_HISTOGRAM
/// defaults: client sends a minimal spec-valid expo without any of the upstream-
/// undeclared-default fields. Without our defaults, this hard-errors and the
/// metric silently drops.
#[test]
fn edge_expo_with_minimal_fields_only() {
let json = r#"{
"resourceMetrics":[{
"resource":{"attributes":[]},
"scopeMetrics":[{
"scope":{"name":"x"},
"metrics":[
{"name":"minimal.expo","exponentialHistogram":{
"aggregationTemporality":2,
"dataPoints":[{"timeUnixNano":"1700000000000000000"}]
}}
]
}]
}]
}"#;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Repeated test structure for minimal fields scenario

edge_expo_with_minimal_fields_only and edge_summary_with_minimal_fields_only are structurally identical: send a payload with a single metric variant containing one data point with only timeUnixNano, then assert data is not None. The team preference is parameterised tests. These two (and potentially the three *_with_string_u64s variants) could be expressed as a single table-driven test, keeping the payloads in a &[(&str, fn(&Metric) -> bool)] slice and removing the repeated boilerplate.

Context Used: Do not attempt to comment on incorrect alphabetica... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: rust/capture-logs/tests/metrics_test.rs
Line: 395-430

Comment:
**Repeated test structure for `minimal fields` scenario**

`edge_expo_with_minimal_fields_only` and `edge_summary_with_minimal_fields_only` are structurally identical: send a payload with a single metric variant containing one data point with only `timeUnixNano`, then assert `data` is not `None`. The team preference is parameterised tests. These two (and potentially the three `*_with_string_u64s` variants) could be expressed as a single table-driven test, keeping the payloads in a `&[(&str, fn(&Metric) -> bool)]` slice and removing the repeated boilerplate.

**Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@DanielVisca DanielVisca requested review from a team and DanielVisca May 27, 2026 20:53
Copy link
Copy Markdown
Contributor

@DanielVisca DanielVisca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good,
when one of these upstream gaps closes the failing canary a good signal. The boundary tests seem thorough too.

Greptile's comments are probably fine to ignore for now.The eprintln only print on test failure (cargo captures stderr by default), and the inner state diagnostics on failure are worthwhile at this point.

@DanielVisca
Copy link
Copy Markdown
Contributor

image Validated locally and it works so merging!

@DanielVisca DanielVisca merged commit e086842 into master May 27, 2026
194 checks passed
@DanielVisca DanielVisca deleted the 05-27-fix_metrics_unblock_otlp_json_histogram_expo_summary_ingestion branch May 27, 2026 21:40
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 27, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-27 23:50 UTC Run
prod-us ✅ Deployed 2026-05-28 00:02 UTC Run
prod-eu ✅ Deployed 2026-05-28 00:04 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants