Tracing ingest: per-span validation errors drop the whole batch instead of landing in `unsupported`

## Summary

When one span in an ingest payload has a field that fails Pydantic validation, the **entire batch** is dropped silently and the endpoint returns `202 Accepted` with `{"count": 0, "links": []}`. From the client's perspective the call succeeded; in reality nothing was persisted.

The existing design is supposed to preserve partial spans by moving invalid fields into `ag.unsupported` rather than rejecting the whole span. That pattern exists in `initialize_ag_attributes` but is defeated by two layers above it.

## Why keep accepting — rather than validating hard

This endpoint is the OTel write path. We don't want strict validation errors to bounce customer telemetry — the product commitment is "best-effort ingest, keep what we can, put anything we don't understand under `unsupported`." The design is correct; the implementation has two gaps that turn best-effort ingest into silent total-loss.

## Gap 1 — batch-level try/except swallows everything

At `api/oss/src/core/tracing/utils/parsing.py:351-362`, `parse_spans_from_request` wraps the whole per-span loop in a single `try/except Exception`. One bad span raises → `span_dtos = []` → every good span in the same batch is lost with it. The router then returns `count: 0`.

```python
try:
    for span_group in spans.values():
        ...
    for span in raw_span_dtos:
        span_dtos.extend(_parse_span_from_request(span))
except Exception:
    log.error(...)
    span_dtos = []  # <-- drops the entire batch
```

**Fix**: move the try/except to wrap a single span's parse. Keep what succeeds, drop only what failed (and send the failing span's raw attributes to `ag.unsupported` where possible).

## Gap 2 — `_parse_span_from_request` assumes `ag["metrics"]` exists after fallback

At `parsing.py:285-290`:

```python
if duration_ms is not None:
    ag["metrics"]["duration"] = {"cumulative": duration_ms}
```

And at `parsing.py:292-302`:

```python
if raw_span.events:
    errors = ag["metrics"]["errors"] = {"incremental": 0}
```

Both assume `ag["metrics"]` is a dict. But `initialize_ag_attributes` may have **removed** `metrics` entirely (via `_sanitize_invalid_ag_fields`, which does `cleaned_ag.pop(top_key)` to move the field to `unsupported`). When that happens, the two lines above raise `KeyError: 'metrics'`. That exception then bubbles to Gap 1 and drops the whole batch.

Reproduced locally (`/tmp/agenta-test/trace_error.py`): a span with `metrics.duration.cumulative` as a scalar triggers sanitize → metrics gone → KeyError on the next assignment.

**Fix**: use `ag.setdefault("metrics", {}).setdefault("duration", {})` instead of chained index assignment.

## Concrete example

Customer script round-trips traces between two Agenta instances via query + ingest. Every span comes back from query with `metrics.duration.cumulative = <scalar>` (that is a separate bug — see companion issue). On re-ingest: validation fails on that field → `_sanitize_invalid_ag_fields` moves `metrics` to `unsupported` → `ag["metrics"]["duration"] = ...` KeyError → outer except drops every span in the batch → `count: 0`. Customer sees "success" and loses all data.

## Acceptance

* A span with one invalid field lands in the DB with that field under `ag.unsupported.*` and everything else intact.
* A batch of 10 spans where 1 is invalid persists 9 spans and (ideally) reports the failure count in the response or in a structured log.
* `KeyError` on `ag["metrics"]` is impossible regardless of what `initialize_ag_attributes` returned.

## Out of scope

* Changing the response contract to surface failures to the client (worth doing but breaks OTel semantics of "accept the write"). Could be added as a separate optional `failures` field on the response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracing ingest: per-span validation errors drop the whole batch instead of landing in `unsupported` #4173

Summary

Why keep accepting — rather than validating hard

Gap 1 — batch-level try/except swallows everything

Gap 2 — `_parse_span_from_request` assumes `ag["metrics"]` exists after fallback

Concrete example

Acceptance

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracing ingest: per-span validation errors drop the whole batch instead of landing in unsupported #4173

Description

Summary

Why keep accepting — rather than validating hard

Gap 1 — batch-level try/except swallows everything

Gap 2 — _parse_span_from_request assumes ag["metrics"] exists after fallback

Concrete example

Acceptance

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Tracing ingest: per-span validation errors drop the whole batch instead of landing in `unsupported` #4173

Gap 2 — `_parse_span_from_request` assumes `ag["metrics"]` exists after fallback