Skip to content

[FE Feat] Evaluation scenario filtering#4405

Merged
bekossy merged 33 commits into
fe-experiment/etl-batch-add-tracesfrom
fe-experiment/etl-eval-scenario-filtering
May 26, 2026
Merged

[FE Feat] Evaluation scenario filtering#4405
bekossy merged 33 commits into
fe-experiment/etl-batch-add-tracesfrom
fe-experiment/etl-eval-scenario-filtering

Conversation

@ardaerzin
Copy link
Copy Markdown
Contributor

@ardaerzin ardaerzin commented May 22, 2026

Summary

implements filtering for evaluation run scenarios

Testing

Verified locally

  • filtering works for evaluator metrics

QA follow-up

testing filtering with:

  • large runs
  • comparisons

Demo

loom

Checklist

  • I have included a video or screen recording for UI changes, or marked Demo as N/A
  • Relevant tests pass locally
  • Relevant linting and formatting pass locally
  • I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment May 26, 2026 4:50pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b2b7eb6-dab0-46c3-99fe-69b804d5f854

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fe-experiment/etl-eval-scenario-filtering

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ardaerzin ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from be34a9e to 4cfdc24 Compare May 22, 2026 19:24
@ardaerzin ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 4cfdc24 to 0300d73 Compare May 23, 2026 06:56
@ardaerzin ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 0300d73 to 185fdc3 Compare May 23, 2026 12:11
@ardaerzin ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 185fdc3 to b253edc Compare May 23, 2026 12:41
@ardaerzin ardaerzin force-pushed the fe-experiment/etl-batch-add-traces branch from 086b175 to 1577fb1 Compare May 23, 2026 12:45
@ardaerzin ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from b253edc to 899ef91 Compare May 23, 2026 12:46
@ardaerzin ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from 899ef91 to c9d6a3e Compare May 23, 2026 15:08
@ardaerzin ardaerzin marked this pull request as ready for review May 24, 2026 08:18
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. Feature Request New feature or request Frontend labels May 24, 2026
ardaerzin added 11 commits May 25, 2026 10:50
…wed)

Phased plan to productionize the EtlPocScenarios PoC into the real
EvalRunDetails scenarios table: thin store, schema columns, self-hydrating
cells, ETL filtering, comparison, live updates; retire the PoC.

Eng review: 5 outside-voice gaps folded in (non-terminal rendering unbuilt,
useEtlColumns drops "other" columns, perf premise unmeasured → perf gate,
T5 comparison is a build, CSV export path missed). Design review: focused on
interaction states + filter UX (5/10 → 9/10).
Reading evaluationPreviewTableStore.ts confirmed it is already a thin store:
PreviewTableRow carries only identity + testcaseId + status + scenarioIndex
+ comparison fields (no column data), and already does per-eval-type window
order. T1 ("promote a thin store") was lateral churn — re-implementing an
existing store. Phase 1 collapses to the coupled T2+T3 column+cell swap
against the existing store. Confirms the eng-review outside voice's finding.
Reading Table.tsx showed the CSV export path (exportResolveValue,
columnLookupMap, loadAllPagesBeforeExport) is keyed off columnResult column
ids, which differ from useEtlColumns keys. Phase 1 swaps display columns
only; usePreviewColumns/columnResult stay alive for export until Phase 3
(T5). The "other"-column un-drop ripples into ColumnLeaf, EtlResolvedCell,
and useCellMaterialization.
Phase 1 (T2+T3) of eval-scenarios-table-integration. The eval run
scenarios table now derives its schema columns from the run graph and
renders cells that self-hydrate from molecule caches, replacing the
backend-metadata + per-cell-fetch display path.

- groupRunColumns: pure mapping-grouping in @agenta/entities etl;
  keeps "other"-kind columns (the PoC dropped them)
- EvalRunDetails/etl/: useEtlColumns, useHydrateScenarios,
  run-aware useCellMaterialization, useScopeChangeEviction,
  EtlColumnHeader, EtlResolvedCell
- EtlResolvedCell renders pending/running/failed scenarios distinctly
  and distinguishes slice-not-hydrated from scenario-not-run
- Table.tsx swaps only the display columns; usePreviewColumns /
  columnResult stay alive for the CSV export path (Phase 3)
- column-parity regression test for groupRunColumns
Phase 2 filtering ships multi-condition AND/OR from day 1, not the PoC's
single predicate. Records the decision, generalises the planned predicate
type to a flat condition group, and marks T2+T3 as shipped.
…rSchema

Pre-stages the Phase 2 (T4) filtering core — pure logic, decoupled from
the D5 perf gate that gates T4 wiring.

- PredicateGroup (flat AND/OR, one nesting level) + RowFilter type;
  evaluateRowPredicate / evaluatePredicateGroup / evaluateRowFilter /
  matchesRowFilter row-level evaluators; makePredicateGroupFilter
  pipeline transform. makeRowPredicateFilter left intact.
- predicateToEntitySlices accepts a PredicateGroup — slice set is the
  union of every condition's slices (AND vs OR does not change the fetch).
- buildFilterSchema: derives filterable fields + type-matched operators
  from the run schema, with a resolveValueType refinement seam.
- 25 unit tests (predicateGroup, filterSchema).
Two regressions from the Phase 1 ETL column swap — both because ETL
columns derive from the run's raw mappings, not the curated backend
column set:

- `testcase_dedup_id` (any `*_dedup_id` column) was rendered. These are
  internal dedup keys, not user-facing — `groupRunColumns` now drops
  them, matching the backend-metadata column path.
- the static invocation-metrics group (cost / duration / tokens)
  disappeared. `useEtlColumns` now skips metrics-kind groups and
  `Table.tsx` keeps the production metric group(s), rendered by the
  existing metric cell.
Wires Phase 2 filtering into the real scenarios table — multi-condition
AND/OR (decision D8), not the /etl-poc page.

- ScenarioFilterBar: a filter bar in the table header — column / operator
  / value per condition, AND/OR join, add / remove / clear. Columns and
  type-matched operators come from buildFilterSchema; a name heuristic
  refines value types for the common cases.
- scenarioFilterState: per-run PredicateGroup atom; half-built conditions
  are dropped from the evaluated filter.
- useScenarioFilter: filters the base rows via evaluateRowFilter against
  molecule-cache-resolved columns, with a viewport-fill loop so a strict
  filter still fills the viewport; unhydrated rows stay visible until
  known; the confirmed-match count gates the loop.
- Table.tsx: feeds filtered base rows into the merge, drives predicate-
  aware hydration, remounts the table on filter change, guards cell
  renders against the transient undefined record.
Only evaluator-output and metric columns are offered in the filter bar
for v1 — testset (input) and application (output) columns are withheld
behind a UI allowlist (FILTERABLE_COLUMN_KINDS). The filter engine and
buildFilterSchema still support every column kind, so enabling the rest
later is a one-line flip — no structural change.
The filter bar typed columns with a name heuristic, so a boolean
evaluator output (e.g. an LLM-judge field) was offered numeric
comparators and a numeric value input.

Column value types now come from the evaluator output schema: the
columnResult column `metricType` — itself read from each output
property's JSON-schema `type` by `extractMetrics` — maps to the filter
value type via `buildColumnValueTypeResolver`. A boolean column now gets
only equals / not-equals and a true/false input; a numeric one gets the
comparators. The name heuristic is removed.
extractMetrics typed an output property only when `schema.type` was a
plain string, so a nullable field (`type: ["boolean", "null"]` or
`anyOf: [{type: "boolean"}, {type: "null"}]`) fell back to "string" —
which made the scenario filter bar offer a boolean field a text input
instead of true/false.

resolveSchemaType (new, dependency-free module) unwraps array and
anyOf/oneOf nullable encodings to the first non-"null" type.
ardaerzin added 14 commits May 25, 2026 10:50
The filtered row list kept unhydrated rows "visible until known", so it
grew and shrank as hydration revealed non-matches — flickering the table
between rows and the empty state until a page finished materializing.

filteredBaseRows now holds confirmed matches only: a row appears once it
is hydrated AND matches, so the list only ever grows during a scan. Rows
materialize as their data arrives, with no show-then-drop. The full
empty + loading overlay shows only until the first match lands; after
that a "N matches · scanning…" indicator in the filter bar covers the
rest of the scan.
The always-visible inline filter conditions wrapped and grew the header
row, pushing the table down — worse with every condition added.

Following the observability `Filters` pattern, the conditions now live
in a popover behind a compact "Filters" button (with an active-condition
count badge); the strip itself is a single fixed-height row. Edits are
staged in a draft and committed on Apply — Cancel/outside-click discards
them — so the table is not re-scanned on every keystroke.
Replaces the top-level "Match All/Any" segmented control with a
row-level connector: the first row reads "Where", every later row shows
a borderless ghost-style And/Or select. The group has a single op (flat
group, D8), so the connectors stay in sync — toggling any one sets the
group op.
The "Where" label and the And/Or connector select sat directly in the
flex row with their own width classes, which resolved to different
widths — so the first row's Column select didn't line up with the rows
below. Both connectors now sit inside one shared fixed-width slot, so
every row's Column select starts at the same x.
The 64px connector slot truncated the default-size And/Or select to
"A…". Widened to 80px so the connector shows in full.
The scanning indicator gated on `hasMore`, so it stayed on forever once
the viewport-fill loop had stopped (it stops at the match target) — the
dataset still "has more" pages, but the loop is idle.

useScenarioFilter now exposes `isFilling` (the loop still intends to
load pages). scanInProgress = isFilling OR a page fetch / hydrate batch
in flight — so "scanning" turns off once enough matches are found and
nothing is in flight, and reappears only while a (scroll-triggered)
load is actually running.
The filter sat on its own strip below the run header, taking a second
line. It now renders inline in the "Evaluations:" header row (Scenarios
tab only) — one line for everything.

The filter bar is now self-contained: given a runId it derives the run
schema, column value types, and live scan status from atoms. The
scenarios table publishes its scan status (match count + scanning) via
scenarioFilterStatusAtomFamily; the header's filter bar reads it.
The filter is now a compact icon-only funnel button with a condition-
count badge, placed in the header's right group just before "Compare"
(was a text "Filters" button + inline match-count on the left).

The match-count / scanning indicator moves into the popover header
("N matches · scanning…" next to the title), so the closed affordance
stays minimal.
The /etl-poc test page proved the run-graph + molecule-cache strategy
that Phase 1 + 2 shipped into the real scenarios table. Production has
its own copies of the ported hooks, so the PoC is now dead test-page
code.

- delete components/EtlPocScenarios/ and the /etl-poc routes (oss + ee)
- drop the /etl-poc branch from the Layout human-eval route check
- mark T4 + T7 done in the design doc; Phase 3 (T5/T6/T8) remains
The filter engine already supported in/nin; the popover now exposes them.
A list operator shows a tag-style multi-value input (comma / Enter to add
entries), numeric columns coerce entries to numbers, and a condition with
an empty list counts as incomplete. Switching a condition between a
scalar and a list operator resets its value to the right shape.
Comparison interleaves each matched base row with its compare-run
counterparts joined by testcase_id (mergedRows). But with a filter
active the viewport-fill loop only scans the base run, and a strict
filter means the table may never scroll — so compare runs stayed on
page 1 and the join missed most counterparts.

While a filter is active, eagerly load every compare-run page so the
testcase-id join can resolve every matched base row's counterpart.
While a run is still executing, periodically refetch the loaded
scenario pages (to refresh row statuses) and evict + re-prefetch the
results / metrics molecule caches of running or just-finished
scenarios. Without this a scenario that completes after the table
loaded keeps a stale empty cache and a stale "running" status, so its
cells show the running indicator forever. The loop runs one final pass
when the run reaches a terminal status, then stops.
The focus drawer and SingleScenarioViewer were traced through
scenarioColumnValues.ts and SingleScenarioViewerPOC: both resolve
their values via the old data path's independent atom families, so
the T2+T3 cell swap does not regress them. Full ETL migration is
deferred — useScenarioCellValue still backs the static
invocation-metrics group kept in the table and the CSV export, so it
cannot be deleted yet. Marks Phases 1-3 complete.
The focus drawer and SingleScenarioViewer resolve their values from
the scenario-steps and evaluation-metric query atoms, which never
polled — opening the drawer on a running scenario showed a frozen
snapshot. Add a run-status-gated refetchInterval (5s while
non-terminal, off once terminal) mirroring evaluationRunQueryAtomFamily.

Since evaluation-metric also backs the table's static invocation-metric
cells, those columns now refresh live during a run too.
@ardaerzin ardaerzin force-pushed the fe-experiment/etl-eval-scenario-filtering branch from c9d6a3e to 8cb6bd1 Compare May 25, 2026 08:50
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

Railway Preview Environment

Status Destroyed (PR closed)

Updated at 2026-05-26T17:07:53.112Z

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 26, 2026
@bekossy bekossy merged commit be9ebb7 into fe-experiment/etl-batch-add-traces May 26, 2026
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Request New feature or request Frontend lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants