Summary
When one span in an ingest payload has a field that fails Pydantic validation, the entire batch is dropped silently and the endpoint returns 202 Accepted with {"count": 0, "links": []}. From the client's perspective the call succeeded; in reality nothing was persisted.
The existing design is supposed to preserve partial spans by moving invalid fields into ag.unsupported rather than rejecting the whole span. That pattern exists in initialize_ag_attributes but is defeated by two layers above it.
Why keep accepting — rather than validating hard
This endpoint is the OTel write path. We don't want strict validation errors to bounce customer telemetry — the product commitment is "best-effort ingest, keep what we can, put anything we don't understand under unsupported." The design is correct; the implementation has two gaps that turn best-effort ingest into silent total-loss.
Gap 1 — batch-level try/except swallows everything
At api/oss/src/core/tracing/utils/parsing.py:351-362, parse_spans_from_request wraps the whole per-span loop in a single try/except Exception. One bad span raises → span_dtos = [] → every good span in the same batch is lost with it. The router then returns count: 0.
try:
for span_group in spans.values():
...
for span in raw_span_dtos:
span_dtos.extend(_parse_span_from_request(span))
except Exception:
log.error(...)
span_dtos = [] # <-- drops the entire batch
Fix: move the try/except to wrap a single span's parse. Keep what succeeds, drop only what failed (and send the failing span's raw attributes to ag.unsupported where possible).
Gap 2 — _parse_span_from_request assumes ag["metrics"] exists after fallback
At parsing.py:285-290:
if duration_ms is not None:
ag["metrics"]["duration"] = {"cumulative": duration_ms}
And at parsing.py:292-302:
if raw_span.events:
errors = ag["metrics"]["errors"] = {"incremental": 0}
Both assume ag["metrics"] is a dict. But initialize_ag_attributes may have removed metrics entirely (via _sanitize_invalid_ag_fields, which does cleaned_ag.pop(top_key) to move the field to unsupported). When that happens, the two lines above raise KeyError: 'metrics'. That exception then bubbles to Gap 1 and drops the whole batch.
Reproduced locally (/tmp/agenta-test/trace_error.py): a span with metrics.duration.cumulative as a scalar triggers sanitize → metrics gone → KeyError on the next assignment.
Fix: use ag.setdefault("metrics", {}).setdefault("duration", {}) instead of chained index assignment.
Concrete example
Customer script round-trips traces between two Agenta instances via query + ingest. Every span comes back from query with metrics.duration.cumulative = <scalar> (that is a separate bug — see companion issue). On re-ingest: validation fails on that field → _sanitize_invalid_ag_fields moves metrics to unsupported → ag["metrics"]["duration"] = ... KeyError → outer except drops every span in the batch → count: 0. Customer sees "success" and loses all data.
Acceptance
- A span with one invalid field lands in the DB with that field under
ag.unsupported.* and everything else intact.
- A batch of 10 spans where 1 is invalid persists 9 spans and (ideally) reports the failure count in the response or in a structured log.
KeyError on ag["metrics"] is impossible regardless of what initialize_ag_attributes returned.
Out of scope
- Changing the response contract to surface failures to the client (worth doing but breaks OTel semantics of "accept the write"). Could be added as a separate optional
failures field on the response.
Summary
When one span in an ingest payload has a field that fails Pydantic validation, the entire batch is dropped silently and the endpoint returns
202 Acceptedwith{"count": 0, "links": []}. From the client's perspective the call succeeded; in reality nothing was persisted.The existing design is supposed to preserve partial spans by moving invalid fields into
ag.unsupportedrather than rejecting the whole span. That pattern exists ininitialize_ag_attributesbut is defeated by two layers above it.Why keep accepting — rather than validating hard
This endpoint is the OTel write path. We don't want strict validation errors to bounce customer telemetry — the product commitment is "best-effort ingest, keep what we can, put anything we don't understand under
unsupported." The design is correct; the implementation has two gaps that turn best-effort ingest into silent total-loss.Gap 1 — batch-level try/except swallows everything
At
api/oss/src/core/tracing/utils/parsing.py:351-362,parse_spans_from_requestwraps the whole per-span loop in a singletry/except Exception. One bad span raises →span_dtos = []→ every good span in the same batch is lost with it. The router then returnscount: 0.Fix: move the try/except to wrap a single span's parse. Keep what succeeds, drop only what failed (and send the failing span's raw attributes to
ag.unsupportedwhere possible).Gap 2 —
_parse_span_from_requestassumesag["metrics"]exists after fallbackAt
parsing.py:285-290:And at
parsing.py:292-302:Both assume
ag["metrics"]is a dict. Butinitialize_ag_attributesmay have removedmetricsentirely (via_sanitize_invalid_ag_fields, which doescleaned_ag.pop(top_key)to move the field tounsupported). When that happens, the two lines above raiseKeyError: 'metrics'. That exception then bubbles to Gap 1 and drops the whole batch.Reproduced locally (
/tmp/agenta-test/trace_error.py): a span withmetrics.duration.cumulativeas a scalar triggers sanitize → metrics gone → KeyError on the next assignment.Fix: use
ag.setdefault("metrics", {}).setdefault("duration", {})instead of chained index assignment.Concrete example
Customer script round-trips traces between two Agenta instances via query + ingest. Every span comes back from query with
metrics.duration.cumulative = <scalar>(that is a separate bug — see companion issue). On re-ingest: validation fails on that field →_sanitize_invalid_ag_fieldsmovesmetricstounsupported→ag["metrics"]["duration"] = ...KeyError → outer except drops every span in the batch →count: 0. Customer sees "success" and loses all data.Acceptance
ag.unsupported.*and everything else intact.KeyErroronag["metrics"]is impossible regardless of whatinitialize_ag_attributesreturned.Out of scope
failuresfield on the response.