[FE Feat] Evaluation scenario filtering by ardaerzin · Pull Request #4405 · Agenta-AI/agenta

ardaerzin · 2026-05-22T16:23:21Z

Summary

implements filtering for evaluation run scenarios

Testing

Verified locally

filtering works for evaluator metrics

QA follow-up

testing filtering with:

large runs
comparisons

Demo

loom

Checklist

I have included a video or screen recording for UI changes, or marked Demo as N/A
Relevant tests pass locally
Relevant linting and formatting pass locally
I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

vercel · 2026-05-22T16:23:27Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	May 26, 2026 4:50pm

coderabbitai · 2026-05-22T16:23:30Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b2b7eb6-dab0-46c3-99fe-69b804d5f854

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fe-experiment/etl-eval-scenario-filtering

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…wed) Phased plan to productionize the EtlPocScenarios PoC into the real EvalRunDetails scenarios table: thin store, schema columns, self-hydrating cells, ETL filtering, comparison, live updates; retire the PoC. Eng review: 5 outside-voice gaps folded in (non-terminal rendering unbuilt, useEtlColumns drops "other" columns, perf premise unmeasured → perf gate, T5 comparison is a build, CSV export path missed). Design review: focused on interaction states + filter UX (5/10 → 9/10).

Reading evaluationPreviewTableStore.ts confirmed it is already a thin store: PreviewTableRow carries only identity + testcaseId + status + scenarioIndex + comparison fields (no column data), and already does per-eval-type window order. T1 ("promote a thin store") was lateral churn — re-implementing an existing store. Phase 1 collapses to the coupled T2+T3 column+cell swap against the existing store. Confirms the eng-review outside voice's finding.

Reading Table.tsx showed the CSV export path (exportResolveValue, columnLookupMap, loadAllPagesBeforeExport) is keyed off columnResult column ids, which differ from useEtlColumns keys. Phase 1 swaps display columns only; usePreviewColumns/columnResult stay alive for export until Phase 3 (T5). The "other"-column un-drop ripples into ColumnLeaf, EtlResolvedCell, and useCellMaterialization.

Phase 1 (T2+T3) of eval-scenarios-table-integration. The eval run scenarios table now derives its schema columns from the run graph and renders cells that self-hydrate from molecule caches, replacing the backend-metadata + per-cell-fetch display path. - groupRunColumns: pure mapping-grouping in @agenta/entities etl; keeps "other"-kind columns (the PoC dropped them) - EvalRunDetails/etl/: useEtlColumns, useHydrateScenarios, run-aware useCellMaterialization, useScopeChangeEviction, EtlColumnHeader, EtlResolvedCell - EtlResolvedCell renders pending/running/failed scenarios distinctly and distinguishes slice-not-hydrated from scenario-not-run - Table.tsx swaps only the display columns; usePreviewColumns / columnResult stay alive for the CSV export path (Phase 3) - column-parity regression test for groupRunColumns

Phase 2 filtering ships multi-condition AND/OR from day 1, not the PoC's single predicate. Records the decision, generalises the planned predicate type to a flat condition group, and marks T2+T3 as shipped.

…rSchema Pre-stages the Phase 2 (T4) filtering core — pure logic, decoupled from the D5 perf gate that gates T4 wiring. - PredicateGroup (flat AND/OR, one nesting level) + RowFilter type; evaluateRowPredicate / evaluatePredicateGroup / evaluateRowFilter / matchesRowFilter row-level evaluators; makePredicateGroupFilter pipeline transform. makeRowPredicateFilter left intact. - predicateToEntitySlices accepts a PredicateGroup — slice set is the union of every condition's slices (AND vs OR does not change the fetch). - buildFilterSchema: derives filterable fields + type-matched operators from the run schema, with a resolveValueType refinement seam. - 25 unit tests (predicateGroup, filterSchema).

Two regressions from the Phase 1 ETL column swap — both because ETL columns derive from the run's raw mappings, not the curated backend column set: - `testcase_dedup_id` (any `*_dedup_id` column) was rendered. These are internal dedup keys, not user-facing — `groupRunColumns` now drops them, matching the backend-metadata column path. - the static invocation-metrics group (cost / duration / tokens) disappeared. `useEtlColumns` now skips metrics-kind groups and `Table.tsx` keeps the production metric group(s), rendered by the existing metric cell.

Wires Phase 2 filtering into the real scenarios table — multi-condition AND/OR (decision D8), not the /etl-poc page. - ScenarioFilterBar: a filter bar in the table header — column / operator / value per condition, AND/OR join, add / remove / clear. Columns and type-matched operators come from buildFilterSchema; a name heuristic refines value types for the common cases. - scenarioFilterState: per-run PredicateGroup atom; half-built conditions are dropped from the evaluated filter. - useScenarioFilter: filters the base rows via evaluateRowFilter against molecule-cache-resolved columns, with a viewport-fill loop so a strict filter still fills the viewport; unhydrated rows stay visible until known; the confirmed-match count gates the loop. - Table.tsx: feeds filtered base rows into the merge, drives predicate- aware hydration, remounts the table on filter change, guards cell renders against the transient undefined record.

Only evaluator-output and metric columns are offered in the filter bar for v1 — testset (input) and application (output) columns are withheld behind a UI allowlist (FILTERABLE_COLUMN_KINDS). The filter engine and buildFilterSchema still support every column kind, so enabling the rest later is a one-line flip — no structural change.

The filter bar typed columns with a name heuristic, so a boolean evaluator output (e.g. an LLM-judge field) was offered numeric comparators and a numeric value input. Column value types now come from the evaluator output schema: the columnResult column `metricType` — itself read from each output property's JSON-schema `type` by `extractMetrics` — maps to the filter value type via `buildColumnValueTypeResolver`. A boolean column now gets only equals / not-equals and a true/false input; a numeric one gets the comparators. The name heuristic is removed.

extractMetrics typed an output property only when `schema.type` was a plain string, so a nullable field (`type: ["boolean", "null"]` or `anyOf: [{type: "boolean"}, {type: "null"}]`) fell back to "string" — which made the scenario filter bar offer a boolean field a text input instead of true/false. resolveSchemaType (new, dependency-free module) unwraps array and anyOf/oneOf nullable encodings to the first non-"null" type.

The filtered row list kept unhydrated rows "visible until known", so it grew and shrank as hydration revealed non-matches — flickering the table between rows and the empty state until a page finished materializing. filteredBaseRows now holds confirmed matches only: a row appears once it is hydrated AND matches, so the list only ever grows during a scan. Rows materialize as their data arrives, with no show-then-drop. The full empty + loading overlay shows only until the first match lands; after that a "N matches · scanning…" indicator in the filter bar covers the rest of the scan.

The always-visible inline filter conditions wrapped and grew the header row, pushing the table down — worse with every condition added. Following the observability `Filters` pattern, the conditions now live in a popover behind a compact "Filters" button (with an active-condition count badge); the strip itself is a single fixed-height row. Edits are staged in a draft and committed on Apply — Cancel/outside-click discards them — so the table is not re-scanned on every keystroke.

Replaces the top-level "Match All/Any" segmented control with a row-level connector: the first row reads "Where", every later row shows a borderless ghost-style And/Or select. The group has a single op (flat group, D8), so the connectors stay in sync — toggling any one sets the group op.

The "Where" label and the And/Or connector select sat directly in the flex row with their own width classes, which resolved to different widths — so the first row's Column select didn't line up with the rows below. Both connectors now sit inside one shared fixed-width slot, so every row's Column select starts at the same x.

The 64px connector slot truncated the default-size And/Or select to "A…". Widened to 80px so the connector shows in full.

The scanning indicator gated on `hasMore`, so it stayed on forever once the viewport-fill loop had stopped (it stops at the match target) — the dataset still "has more" pages, but the loop is idle. useScenarioFilter now exposes `isFilling` (the loop still intends to load pages). scanInProgress = isFilling OR a page fetch / hydrate batch in flight — so "scanning" turns off once enough matches are found and nothing is in flight, and reappears only while a (scroll-triggered) load is actually running.

The filter sat on its own strip below the run header, taking a second line. It now renders inline in the "Evaluations:" header row (Scenarios tab only) — one line for everything. The filter bar is now self-contained: given a runId it derives the run schema, column value types, and live scan status from atoms. The scenarios table publishes its scan status (match count + scanning) via scenarioFilterStatusAtomFamily; the header's filter bar reads it.

The filter is now a compact icon-only funnel button with a condition- count badge, placed in the header's right group just before "Compare" (was a text "Filters" button + inline match-count on the left). The match-count / scanning indicator moves into the popover header ("N matches · scanning…" next to the title), so the closed affordance stays minimal.

The /etl-poc test page proved the run-graph + molecule-cache strategy that Phase 1 + 2 shipped into the real scenarios table. Production has its own copies of the ported hooks, so the PoC is now dead test-page code. - delete components/EtlPocScenarios/ and the /etl-poc routes (oss + ee) - drop the /etl-poc branch from the Layout human-eval route check - mark T4 + T7 done in the design doc; Phase 3 (T5/T6/T8) remains

The filter engine already supported in/nin; the popover now exposes them. A list operator shows a tag-style multi-value input (comma / Enter to add entries), numeric columns coerce entries to numbers, and a condition with an empty list counts as incomplete. Switching a condition between a scalar and a list operator resets its value to the right shape.

Comparison interleaves each matched base row with its compare-run counterparts joined by testcase_id (mergedRows). But with a filter active the viewport-fill loop only scans the base run, and a strict filter means the table may never scroll — so compare runs stayed on page 1 and the join missed most counterparts. While a filter is active, eagerly load every compare-run page so the testcase-id join can resolve every matched base row's counterpart.

While a run is still executing, periodically refetch the loaded scenario pages (to refresh row statuses) and evict + re-prefetch the results / metrics molecule caches of running or just-finished scenarios. Without this a scenario that completes after the table loaded keeps a stale empty cache and a stale "running" status, so its cells show the running indicator forever. The loop runs one final pass when the run reaches a terminal status, then stops.

The focus drawer and SingleScenarioViewer were traced through scenarioColumnValues.ts and SingleScenarioViewerPOC: both resolve their values via the old data path's independent atom families, so the T2+T3 cell swap does not regress them. Full ETL migration is deferred — useScenarioCellValue still backs the static invocation-metrics group kept in the table and the CSV export, so it cannot be deleted yet. Marks Phases 1-3 complete.

The focus drawer and SingleScenarioViewer resolve their values from the scenario-steps and evaluation-metric query atoms, which never polled — opening the drawer on a running scenario showed a frozen snapshot. Add a run-status-gated refetchInterval (5s while non-terminal, off once terminal) mirroring evaluationRunQueryAtomFamily. Since evaluation-metric also backs the table's static invocation-metric cells, those columns now refresh live during a run too.

…etl-eval-scenario-filtering

github-actions · 2026-05-25T09:48:36Z

Railway Preview Environment


Status	Destroyed (PR closed)

Updated at 2026-05-26T17:07:53.112Z

…etl-eval-scenario-filtering

…s filter modal

…etl-eval-scenario-filtering

ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from be34a9e to 4cfdc24 Compare May 22, 2026 19:24

vercel Bot deployed to Preview May 22, 2026 19:25 View deployment

ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 4cfdc24 to 0300d73 Compare May 23, 2026 06:56

vercel Bot deployed to Preview May 23, 2026 06:57 View deployment

ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 0300d73 to 185fdc3 Compare May 23, 2026 12:11

vercel Bot deployed to Preview May 23, 2026 12:12 View deployment

ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 185fdc3 to b253edc Compare May 23, 2026 12:41

vercel Bot deployed to Preview May 23, 2026 12:42 View deployment

ardaerzin force-pushed the fe-experiment/etl-batch-add-traces branch from 086b175 to 1577fb1 Compare May 23, 2026 12:45

ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from b253edc to 899ef91 Compare May 23, 2026 12:46

vercel Bot deployed to Preview May 23, 2026 12:47 View deployment

ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 899ef91 to c9d6a3e Compare May 23, 2026 15:08

vercel Bot deployed to Preview May 23, 2026 15:09 View deployment

ardaerzin marked this pull request as ready for review May 24, 2026 08:18

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. Feature Request New feature or request Frontend labels May 24, 2026

ardaerzin added 11 commits May 25, 2026 10:50

doc(eval): close T4 filter composition — multi-predicate AND/OR (D8)

9724541

Phase 2 filtering ships multi-condition AND/OR from day 1, not the PoC's single predicate. Records the decision, generalises the planned predicate type to a flat condition group, and marks T2+T3 as shipped.

ardaerzin added 14 commits May 25, 2026 10:50

fix(oss): widen scenario filter connector slot to fit And/Or

20a5274

The 64px connector slot truncated the default-size And/Or select to "A…". Widened to 80px so the connector shows in full.

ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from c9d6a3e to 8cb6bd1 Compare May 25, 2026 08:50

vercel Bot deployed to Preview May 25, 2026 08:51 View deployment

Merge branch 'fe-experiment/etl-batch-add-traces' into fe-experiment/…

54c80b5

…etl-eval-scenario-filtering

vercel Bot deployed to Preview May 25, 2026 09:40 View deployment

Merge branch 'fe-experiment/etl-batch-add-traces' into fe-experiment/…

0d36c9a

…etl-eval-scenario-filtering

vercel Bot deployed to Preview May 26, 2026 10:02 View deployment

ashrafchowdury approved these changes May 26, 2026

View reviewed changes

bekossy approved these changes May 26, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label May 26, 2026

ardaerzin and others added 2 commits May 26, 2026 17:16

fix(frontend): prevent removing the last filter condition in Scenario…

3f86613

…s filter modal

Merge branch 'fe-experiment/etl-batch-add-traces' into fe-experiment/…

49a7324

…etl-eval-scenario-filtering

vercel Bot deployed to Preview May 26, 2026 16:37 View deployment

Merge branch 'fe-experiment/etl-batch-add-traces' into fe-experiment/…

d707f84

…etl-eval-scenario-filtering

vercel Bot deployed to Preview May 26, 2026 16:50 View deployment

bekossy merged commit be9ebb7 into fe-experiment/etl-batch-add-traces May 26, 2026
45 of 46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FE Feat] Evaluation scenario filtering#4405

[FE Feat] Evaluation scenario filtering#4405
bekossy merged 33 commits into
fe-experiment/etl-batch-add-tracesfrom
fe-experiment/etl-eval-scenario-filtering

ardaerzin commented May 22, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ardaerzin commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Verified locally

QA follow-up

Demo

Checklist

Contributor Resources

Uh oh!

vercel Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Railway Preview Environment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ardaerzin commented May 22, 2026 •

edited

Loading

vercel Bot commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 25, 2026 •

edited

Loading