Skip to content

Tracing ingest: per-span validation errors drop the whole batch instead of landing in unsupported #4173

@mmabrouk

Description

@mmabrouk

Summary

When one span in an ingest payload has a field that fails Pydantic validation, the entire batch is dropped silently and the endpoint returns 202 Accepted with {"count": 0, "links": []}. From the client's perspective the call succeeded; in reality nothing was persisted.

The existing design is supposed to preserve partial spans by moving invalid fields into ag.unsupported rather than rejecting the whole span. That pattern exists in initialize_ag_attributes but is defeated by two layers above it.

Why keep accepting — rather than validating hard

This endpoint is the OTel write path. We don't want strict validation errors to bounce customer telemetry — the product commitment is "best-effort ingest, keep what we can, put anything we don't understand under unsupported." The design is correct; the implementation has two gaps that turn best-effort ingest into silent total-loss.

Gap 1 — batch-level try/except swallows everything

At api/oss/src/core/tracing/utils/parsing.py:351-362, parse_spans_from_request wraps the whole per-span loop in a single try/except Exception. One bad span raises → span_dtos = [] → every good span in the same batch is lost with it. The router then returns count: 0.

try:
    for span_group in spans.values():
        ...
    for span in raw_span_dtos:
        span_dtos.extend(_parse_span_from_request(span))
except Exception:
    log.error(...)
    span_dtos = []  # <-- drops the entire batch

Fix: move the try/except to wrap a single span's parse. Keep what succeeds, drop only what failed (and send the failing span's raw attributes to ag.unsupported where possible).

Gap 2 — _parse_span_from_request assumes ag["metrics"] exists after fallback

At parsing.py:285-290:

if duration_ms is not None:
    ag["metrics"]["duration"] = {"cumulative": duration_ms}

And at parsing.py:292-302:

if raw_span.events:
    errors = ag["metrics"]["errors"] = {"incremental": 0}

Both assume ag["metrics"] is a dict. But initialize_ag_attributes may have removed metrics entirely (via _sanitize_invalid_ag_fields, which does cleaned_ag.pop(top_key) to move the field to unsupported). When that happens, the two lines above raise KeyError: 'metrics'. That exception then bubbles to Gap 1 and drops the whole batch.

Reproduced locally (/tmp/agenta-test/trace_error.py): a span with metrics.duration.cumulative as a scalar triggers sanitize → metrics gone → KeyError on the next assignment.

Fix: use ag.setdefault("metrics", {}).setdefault("duration", {}) instead of chained index assignment.

Concrete example

Customer script round-trips traces between two Agenta instances via query + ingest. Every span comes back from query with metrics.duration.cumulative = <scalar> (that is a separate bug — see companion issue). On re-ingest: validation fails on that field → _sanitize_invalid_ag_fields moves metrics to unsupportedag["metrics"]["duration"] = ... KeyError → outer except drops every span in the batch → count: 0. Customer sees "success" and loses all data.

Acceptance

  • A span with one invalid field lands in the DB with that field under ag.unsupported.* and everything else intact.
  • A batch of 10 spans where 1 is invalid persists 9 spans and (ideally) reports the failure count in the response or in a structured log.
  • KeyError on ag["metrics"] is impossible regardless of what initialize_ag_attributes returned.

Out of scope

  • Changing the response contract to surface failures to the client (worth doing but breaks OTel semantics of "accept the write"). Could be added as a separate optional failures field on the response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions