Skip to content

[FE Fix] hide string-type evaluator metrics from scenario table & filters#4477

Merged
mmabrouk merged 9 commits into
release/v0.100.6from
fe-fix/eval-scenarios-table-metric-columns-fix
May 28, 2026
Merged

[FE Fix] hide string-type evaluator metrics from scenario table & filters#4477
mmabrouk merged 9 commits into
release/v0.100.6from
fe-fix/eval-scenarios-table-metric-columns-fix

Conversation

@ardaerzin
Copy link
Copy Markdown
Contributor

Summary

String-typed evaluator outputs (e.g. an LLM-judge reasoning field) resolve through the metric layer as {type: "string", count, ...} stats blobs. unwrapStatsForCompare only unwraps binary and numeric variants, so the raw stats object falls through to JSON.stringify in the cell renderer and is shown as {"type":"string","count":...} in every row. The same blob makes filter predicates impossible — equality checks would compare a user string against a stats object and never match.

Drop string-typed leaves at column-build time in useEtlColumns and from the filter dropdown in ScenarioFilterBar by reusing the existing buildColumnValueTypeResolver (the metricType-from-evaluator-schema lookup that already powers the filter bar's value-type detection). The focus drawer is unaffected — it reads columnResult.groups/columns directly via useFocusDrawerSections, bypassing both code paths, so string metrics still appear there.

Testing

Verified locally

reasoning column is not shown

QA follow-up

  • test different output metrics

Checklist

  • I have included a video or screen recording for UI changes, or marked Demo as N/A
  • Relevant tests pass locally
  • Relevant linting and formatting pass locally
  • I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

… and filter

String-typed evaluator outputs (e.g. an LLM-judge `reasoning` field) resolve
through the metric layer as `{type: "string", count, ...}` stats blobs.
`unwrapStatsForCompare` only unwraps `binary` and `numeric` variants, so the
raw stats object falls through to `JSON.stringify` in the cell renderer and is
shown as `{"type":"string","count":...}` in every row. The same blob makes
filter predicates impossible — equality checks would compare a user string
against a stats object and never match.

Drop string-typed leaves at column-build time in `useEtlColumns` and from the
filter dropdown in `ScenarioFilterBar` by reusing the existing
`buildColumnValueTypeResolver` (the metricType-from-evaluator-schema lookup
that already powers the filter bar's value-type detection). The focus drawer
is unaffected — it reads `columnResult.groups/columns` directly via
`useFocusDrawerSections`, bypassing both code paths, so string metrics still
appear there.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment May 28, 2026 8:35pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7b6e44b2-c4cc-4356-b134-0184576aa150

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

String-typed evaluator outputs in ETL evaluation tables are now detected as backend placeholders and resolved from trace annotations; EtlResolvedCell materializes all referenced traces and ScenarioFilterBar excludes string-typed evaluator fields from the filter UI.

Changes

String-type evaluator field handling

Layer / File(s) Summary
String-type metric placeholder resolver
web/packages/agenta-entities/src/evaluationRun/etl/resolveMappings.ts, web/packages/agenta-entities/src/evaluationRun/etl/__tests__/resolveMappings.test.ts
A new isStringTypeMetricPlaceholder helper recognizes {type: "string", count: N} placeholder shapes. resolveFromMetric returns null for such placeholders so composed resolvers fall back to trace-based annotation resolution. Two tests verify fallback when metric is a placeholder and non-fallback when metric includes distribution (freq).
All-trace materialization & clamping
web/oss/src/components/EvalRunDetails/etl/cells/EtlResolvedCell.tsx, web/oss/src/components/EvalRunDetails/etl/useEtlColumns.tsx
EtlResolvedCell now depends on the traces slice for evaluator columns, collects all trace IDs from cached results, requests missing trace entities, and keeps the cell loading until every required trace entity is present (or failed). Per-row clamping constants are retuned and column-level ellipsis is removed in favor of cell-level clamping.
Filter UI hiding for string-type evaluators
web/oss/src/components/EvalRunDetails/etl/ScenarioFilterBar.tsx
ScenarioFilterBar filters the field list to exclude evaluator fields with valueType: "string", preventing generation of UI filter rows for those evaluator outputs.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: hiding string-type evaluator metrics from both the scenario table and filters, which aligns with the primary objectives and changeset.
Description check ✅ Passed The description is well-related to the changeset, explaining the problem (string-typed evaluator outputs rendered as JSON blobs), the solution (drop them from columns and filters), and testing/verification steps.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fe-fix/eval-scenarios-table-metric-columns-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5ca5e55f-0f27-4e12-a419-219b62617f7e

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9012d and 42ac683.

📒 Files selected for processing (3)
  • web/oss/src/components/EvalRunDetails/Table.tsx
  • web/oss/src/components/EvalRunDetails/etl/ScenarioFilterBar.tsx
  • web/oss/src/components/EvalRunDetails/etl/useEtlColumns.tsx

Comment thread web/oss/src/components/EvalRunDetails/etl/useEtlColumns.tsx Outdated
@ardaerzin ardaerzin marked this pull request as ready for review May 28, 2026 16:33
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. Frontend labels May 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

Railway Preview Environment

Status Destroyed (PR closed)

Updated at 2026-05-28T20:36:18.635Z

`buildColumnValueTypeResolver` falls back to a column-name-only lookup, so
without a kind guard a same-named testset / application / metrics column
could inherit an evaluator output's string `metricType` and be incorrectly
hidden from the scenario table and filter dropdown. Restrict the
string-suppression check to `evaluator` leaves in both `useEtlColumns` and
`ScenarioFilterBar`.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

Copy link
Copy Markdown
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ardaerzin

The fix is not correct from a product level. We should not hide the evaluator reasoning column from the table, just show the string like we used to.

I am working on the other fixes, I will merge as is if no update until then, we can fix it later

Per product feedback: hiding string-type evaluator outputs from the scenario
table is the wrong move. The column should stay visible and render the
actual string (rendering fix lands separately). The filter dropdown should
keep these fields selectable for the same reason.

Reverts the table-column filter from `useEtlColumns` and the
filter-dropdown filter from `ScenarioFilterBar`, restoring both files to
their main-branch state. Drawer behavior is unchanged either way (it reads
`columnResult` directly).
@dosubot dosubot Bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 28, 2026
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels May 28, 2026
…able

String-typed evaluator outputs (e.g. an LLM-judge's `reasoning` field) only
land in the metric layer as a `{type: "string", count: N}` placeholder —
the backend can't build a distribution over arbitrary text, so it only
records that *some* string was emitted. The actual string lives on the
annotation trace; the focus drawer already resolves it from there. The
scenario table was hitting the placeholder via `resolveFromMetric` first,
short-circuiting the composed `resolveFromTrace` fallback, and rendering
the raw stats blob via `JSON.stringify`.

Three coordinated changes:

- `resolveFromMetric` (agenta-entities/evaluationRun/etl) detects the
  bare `{type: "string", count}` placeholder shape and returns `null` so
  the composed `resolveFromTrace` resolver picks the real string out of
  the annotation trace. Distribution-bearing `{type: "string", freq: …}`
  shapes are returned as-is (they carry real data). Covered by two new
  unit tests in `resolveMappings.test.ts`.
- `EtlResolvedCell.SLICES_BY_KIND.evaluator` now includes `"traces"` so
  the annotation trace gets materialized for evaluator cells. Both the
  cell-level materializer request and the slice-loading check now
  iterate every result's `trace_id`, not just the first — a scenario
  carries multiple traces (invocation, annotation, …) and the
  annotation isn't always result[0].
- Reverts the column-build hiding in `useEtlColumns` + the
  `columnResult` plumbing in `Table.tsx` introduced earlier. The column
  stays visible and renders the actual string now that the resolver
  chain works correctly.

Filter dropdown hiding in `ScenarioFilterBar` is intentionally kept —
filtering on free-form LLM-generated reasoning isn't a useful product
operation.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

…uncation

Two coordinated fixes so long strings (e.g. an evaluator's `reasoning`)
fill the cell across multiple lines and truncate cleanly at the bottom:

- Drop `ellipsis: true` from the `useEtlColumns` column spec. Antd
  applies `white-space: nowrap` when it is set, forcing all cell content
  onto a single line and overriding the `EtlResolvedCell` body's
  `-webkit-line-clamp`. The column header has its own ellipsis-ing
  inside the `Tooltip` span, so the header is unaffected.

- Re-tune `MAX_LINES_BY_HEIGHT` in `EtlResolvedCell` to match what
  actually fits inside `.scenario-table-cell` at each row-height
  variant (3 / 6 / 12 lines, down from 4 / 9 / 18). With the previous
  values the clamp point sat past the parent's `overflow: hidden` cut,
  so the ellipsis was never visible. Matching the line count to the
  visible area places the ellipsis on the last fully-rendered line.
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 28, 2026
@ardaerzin
Copy link
Copy Markdown
Contributor Author

Thanks @ardaerzin

The fix is not correct from a product level. We should not hide the evaluator reasoning column from the table, just show the string like we used to.

I am working on the other fixes, I will merge as is if no update until then, we can fix it later

done

Screenshot 2026-05-28 at 22 23 44

@ardaerzin ardaerzin requested a review from mmabrouk May 28, 2026 20:25
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

`computeColumnGroup`'s default label for an evaluator group is
`slugToTitle(evaluatorSlug)`, which leaks the slug shape into the header
(e.g. "with-reasoning-jifn" rendered as "With Reasoning Jifn") whenever a
real evaluator name is known. The testset and application headers were
already resolving real entity names via reference query atoms; the
evaluator branch fell through to the slug-derived fallback.

`EtlColumnHeader` now also subscribes to
`evaluationEvaluatorsByRunQueryAtomFamily(runId)` and, for evaluator-kind
groups, matches by `refs.evaluator.id` first and then by slug
(`refs.evaluator.slug` / `refs.evaluator_revision.slug`), surfacing the
evaluator's real `name`. Falls back to the existing slug-titled label when
no match is found (run not loaded yet, ad-hoc evaluator with no
definition, etc). `useEtlColumns` now threads `runId` into each header.
@mmabrouk mmabrouk changed the base branch from main to release/v0.100.6 May 28, 2026 20:30
Headers now read as self-describing pairs:
  "Testset: completion_testset"
  "Application: comp-1"
  "Evaluator: With Reasoning"

Each kind branch builds the label from `Kind: ` + the resolved entity
name. The slug-derived fallback for testset / application reads from
`group.slug` (not `group.label`) — the default `computeColumnGroup` label
already embeds the kind word, so reusing it here would render
"Testset: Testset completion-tst".
@mmabrouk mmabrouk merged commit b8f48ac into release/v0.100.6 May 28, 2026
20 of 21 checks passed
@mmabrouk
Copy link
Copy Markdown
Member

Thanks @ardaerzin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Frontend size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants