feat(eval-overview): Plan B-backend — POST /api/get_evaluation_overview aggregate endpoint by yilu331 · Pull Request #86 · ReflexioAI/reflexio

yilu331 · 2026-05-23T21:01:34Z

Plan B-backend ships the aggregate endpoint that the redesigned /evaluations page (Plan B-frontend) reads from. All hero/tile/attribution/distribution computation happens server-side; the frontend renders, never computes.

Summary

POST /api/get_evaluation_overview — single round-trip returning hero state, four context tiles with week-over-week deltas, top rule attribution, and corrections distribution.
Config.shadow_mode_enabled: bool = False — new toggle that drives the hero state machine (drives full vs early vs shadow_off).
Pure-function building blocks, individually unit-tested:
- hero_state.compute_hero_state(...) — 4-state machine per spec §3.2
- distribution.bucket_corrections(...) — 6-bin corrections histogram (0/1/2/3/4/5+)
- rule_attribution.compute_net_sessions(...) — joins citations + success outcomes, top-N by net sessions
EvaluationOverviewService composes the four pieces.
Two default no-op storage methods added on BaseStorage (count_sessions_with_shadow_content, get_interactions_by_session) so backends can override without breaking the endpoint.

Two adjustments from spec

Score distribution uses correction counts, not a continuous score. The spec §4.5 assumed AgentSuccessEvaluationResult had a 0.0–1.0 score field; it doesn't (only is_success exists). Replaced with corrections-per-session bins — still shows the "shape of quality" and shifts visibly as sessions improve.
Rule attribution reuses PlaybookApplicationStat. The spec §4.4 proposed a new rule_application table; the existing Interaction.citations JSON column + /api/get_playbook_application_stats already cover this. No new table.

Test plan

Unit tests for the hero state machine (5 cases across all four states + boundaries)
Unit tests for distribution bucketing (boundary + empty)
Unit tests for rule attribution (join + top-N + missing session)
Schema round-trip test for request/response models
Service integration test through MagicMock storage (2 cases: shadow-on + empty)
Endpoint smoke test through TestClient (returns empty state on fresh app)
19 tests pass; ruff + pyright clean

Companion PR

ReflexioAI/reflexio-enterprise#<TBD> bumps the submodule pointer here.

Out of scope (out-of-scope follow-ups)

Shadow content count returns 0 until shadow-mode publishing lands (the hero falls back to SHADOW_OFF/EARLY accordingly).
Per-backend get_interactions_by_session defaults to []; rule attribution panel is empty until backends implement real queries.
Plan B-frontend — separate plan and PR.

References

Spec: docs/superpowers/specs/2026-05-21-evaluations-redesign-and-braintrust-integration-design.md (§3.2, §3.3, §4.1, §4.2, §4.4, §4.5)
Plan: docs/superpowers/plans/2026-05-23-evaluations-plan-b-backend.md

Summary by CodeRabbit

New Features
- New evaluation overview dashboard displaying performance trends and evaluation readiness status.
- Added rule attribution analysis revealing which rules drive success and failures.
- Introduced score distribution visualization showing correction patterns.
- Added configurable shadow mode to enable parallel evaluation runs for comparison.

…apper

coderabbitai · 2026-05-23T21:01:46Z

📝 Walkthrough

Walkthrough

This PR introduces a new /api/get_evaluation_overview endpoint that aggregates evaluation metrics into a dashboard summary. It defines request/response schemas, adds configuration and storage hooks, implements hero-state and rule-attribution logic, orchestrates computation via a new service, and integrates the endpoint with comprehensive test coverage.

Changes

Evaluations Overview Endpoint

Layer / File(s)	Summary
API Schema and Data Contracts `reflexio/models/api_schema/eval_overview_schema.py`	Defines typed literals for hero states and bucket granularity. Introduces Pydantic models for `HeroBucket`, `HeroBlock`, `NumberWithDelta`, `PercentWithDelta`, `ContextTile`, `RuleAttributionRow`, `ScoreDistribution`, and the request/response envelopes `GetEvaluationOverviewRequest` and `GetEvaluationOverviewResponse`.
Configuration and Storage Contracts `reflexio/models/config_schema.py`, `reflexio/server/services/storage/storage_base/_extras.py`	Adds `shadow_mode_enabled` boolean flag to `Config` (defaults `False`). Adds two default storage methods: `count_sessions_with_shadow_content` (returns `0`) and `get_interactions_by_session` (returns empty list) for backends to override.
Pure Logic Components `reflexio/server/services/evaluation_overview/hero_state.py`, `reflexio/server/services/evaluation_overview/distribution.py`, `reflexio/server/services/evaluation_overview/rule_attribution.py`	Implements three independent pure-logic modules: `HeroState` enum and `compute_hero_state` with precedence rules for four hero states (empty/shadow_off/early/full); `bucket_corrections` function binning correction counts into a fixed 6-bin histogram; and `compute_net_sessions` computing rule attribution via citation-based net-sessions ranking.
Service Orchestration `reflexio/server/services/evaluation_overview/service.py`	Core `EvaluationOverviewService` that orchestrates aggregation: fetches evaluation results, filters by time window, computes hero state via `_build_hero`, context-tile deltas via `_build_tiles`, rule attribution via `_build_attribution` (loading citations and resolving titles), and score distributions via `_build_distribution`. Helper functions compute success rates, escalation rates, means, and weekly bucket aggregations.
API Integration `reflexio/server/api.py`, `reflexio/lib/_dashboard.py`	Introduces new POST endpoint `/api/get_evaluation_overview` accepting `GetEvaluationOverviewRequest` and returning `GetEvaluationOverviewResponse`. Adds `DashboardMixin.get_evaluation_overview` method that calls the service or returns an early `_empty_overview_response` when storage is unconfigured.
Test Coverage `tests/models/test_config_shadow_mode.py`, `tests/models/test_eval_overview_schema.py`, `tests/server/services/evaluation_overview/test_*.py`, `tests/server/api_endpoints/test_evaluation_overview_api.py`	Unit tests for config flag defaulting, schema round-tripping, and each pure-logic component (hero state, distribution, rule attribution). Service integration test with mocked storage verifies full response with data and empty-state case. Endpoint integration test validates response structure, allowed hero states, and fixed score-distribution labels.

Sequence Diagram(s)

sequenceDiagram
  participant Request as POST /api/get_evaluation_overview
  participant API as FastAPI Endpoint
  participant Service as EvaluationOverviewService
  participant Storage as Storage Backend
  participant Response as JSON Response
  
  Request->>API: GetEvaluationOverviewRequest
  API->>Service: run(request)
  Service->>Storage: get_agent_success_evals(org_id)
  Storage-->>Service: evaluation results
  Service->>Service: compute hero state<br/>(days_since_first_eval, shadow_enabled)
  Service->>Service: compute context tiles<br/>(success/escalation/corrections deltas)
  Service->>Storage: count_sessions_with_shadow_content
  Storage-->>Service: shadow session count
  Service->>Storage: get_interactions_by_session(all sessions)
  Storage-->>Service: interactions with citations
  Service->>Service: compute rule attribution<br/>(net_sessions ranking)
  Service->>Storage: get_playbook_stats(rule ids)
  Storage-->>Service: rule titles
  Service->>Service: build score distribution<br/>(bucket correction counts)
  Service-->>API: GetEvaluationOverviewResponse
  API-->>Response: 200 JSON

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

ReflexioAI/reflexio#39: Introduced and persisted the per-session Interaction.citations schema and storage support that this PR's rule-attribution logic directly consumes to rank rules by net sessions.

Poem

🐰 A dashboard overview takes flight,
With hero states shining so bright,
Rules rank by their net,
Score bins—no regret,
Evaluations gleam in plain sight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main change: adding a new POST /api/get_evaluation_overview aggregate endpoint with backend support, which is the primary objective of this changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/evaluations-plan-b-backend

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (1)

tests/server/services/evaluation_overview/test_rule_attribution.py (1)
76-76: ⚡ Quick win

Remove confusing no-op statement.

The line _ = RuleAttribution serves no functional purpose. RuleAttribution is already imported at the module level (line 4), and assigning it to _ within a single test function has no effect on linting or exports. The comment acknowledges it's a "no-op," confirming it should be removed.
🧹 Suggested cleanup
 def test_session_missing_from_success_map_is_skipped() -> None:
     """If a citation references a session we have no AgentSuccessEvaluationResult
     for, treat it as unknown and don't count it on either side."""
-    _ = RuleAttribution  # imported for export sanity; no-op
     citations_by_session = {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/services/evaluation_overview/test_rule_attribution.py` at line
76, Remove the redundant no-op assignment `_ = RuleAttribution` (and its inline
comment) from the test; `RuleAttribution` is already imported at module level so
delete the assignment statement in the test to avoid confusion and keep the
import as-is.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@reflexio/models/api_schema/eval_overview_schema.py`:
- Around line 91-94: The model currently enforces non-negative from_ts and to_ts
but does not prevent to_ts < from_ts; add a Pydantic validation to the request
model (e.g., a root_validator or a validator on "to_ts") that checks from_ts <=
to_ts and raises a ValueError with a clear message when the window is invalid;
reference the fields by name ("from_ts" and "to_ts") in the validator and ensure
the error is surfaced as a validation error on the model.
- Around line 75-77: The schema currently allows variable-length lists for
current_bins, baseline_bins and labels; change those fields to enforce exactly 6
elements by using pydantic's constrained list type (e.g., replace
list[int]/list[str] with conlist(int, min_items=6, max_items=6) and conlist(str,
min_items=6, max_items=6)) or by using Field(..., min_items=6, max_items=6);
update the imports to include pydantic.conlist or Field and apply this to the
current_bins, baseline_bins and labels fields so malformed-length payloads are
rejected at the schema boundary.

In `@reflexio/models/config_schema.py`:
- Around line 617-621: The new boolean field shadow_mode_enabled should be made
backward-compatible with persisted nulls by adding its name to the
_migrate_field_names None-stripping list; update the _migrate_field_names logic
(where other nullable fields are stripped) to include "shadow_mode_enabled" so
that incoming configs with "shadow_mode_enabled": null are converted to missing
and will fall back to the default False declared on the shadow_mode_enabled
field.

In `@reflexio/server/api.py`:
- Around line 1520-1544: The get_evaluation_overview endpoint is unthrottled and
needs rate limiting; add the same rate-limiting/dependency used by other read
endpoints to throttle expensive aggregates. Modify the get_evaluation_overview
route to require the existing rate limiter (e.g., add a
Depends(rate_limit_dependency) or the shared rate_limit decorator) so requests
to core_router.post("/api/get_evaluation_overview") will be limited before
calling get_reflexio(org_id=...) and reflexio.get_evaluation_overview(request).
Ensure you reference the same rate limiter function/class used elsewhere so the
endpoint inherits consistent limits and behavior.

In `@reflexio/server/services/evaluation_overview/distribution.py`:
- Around line 28-30: The loop over corrections allows negative c to index bins
(Python negative indices), corrupting the histogram; update the computation of
idx in the loop that uses _BUCKET_COUNT, bins and c (the for c in corrections
block) to clamp c into the valid range before indexing (e.g., ensure c is at
least 0 and at most _BUCKET_COUNT-1 using max/min) or skip negative values, then
increment bins[idx].

In `@reflexio/server/services/evaluation_overview/rule_attribution.py`:
- Around line 80-83: The current sort key only uses (-r.net_sessions,
-(r.successes_with + r.failures_with)) so ties are broken by unpredictable set
iteration; update the ordering to include a deterministic tie-breaker (e.g.,
append r.rule_id or r.rule_name as the last element of the sort key) so
rows.sort(...) becomes deterministic before slicing to rows[:top_n]; if the row
objects lack a stable id/name, use a stable string representation (e.g., str of
a unique tuple of identifying fields) as the final tiebreaker.

In `@reflexio/server/services/evaluation_overview/service.py`:
- Around line 72-73: The code computes days_since_first_eval from
window-filtered results causing misclassification; change the logic in
_build_hero and _build_tiles to derive days_since_first_eval from the full org
history instead of the windowed `results` (e.g., compute earliest evaluation
date using `org_history`/`all_results` or a provided `results_all` collection),
update the places that currently use the windowed `results` (references:
days_since_first_eval, _build_hero, _build_tiles) to use that earliest date, and
propagate that value to the existing branching logic used in the code blocks
around the current uses (the blocks covering lines 86-95 and 99-104) so
`early`/`empty` classification is based on org-wide history not the selected
time window.
- Around line 62-64: The code is hard-capping results with limit=10_000 in the
call to storage.get_agent_success_evaluation_results which can silently truncate
org-level aggregates; change the call in service.py (the all_results assignment)
to either query by a time window (pass explicit start/end timestamps) or
implement paginated reads (loop calling get_agent_success_evaluation_results
with an offset/next_token until exhausted) so you compute aggregates over the
complete result set rather than a fixed 10_000 limit; ensure any new params
(time bounds or pagination token) are plumbed through and documented in the
calling flow.

---

Nitpick comments:
In `@tests/server/services/evaluation_overview/test_rule_attribution.py`:
- Line 76: Remove the redundant no-op assignment `_ = RuleAttribution` (and its
inline comment) from the test; `RuleAttribution` is already imported at module
level so delete the assignment statement in the test to avoid confusion and keep
the import as-is.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b288b680-4fa4-45cd-a409-ed5a2ddcb6ca

📥 Commits

Reviewing files that changed from the base of the PR and between 25f8034 and 2abf014.

📒 Files selected for processing (17)

reflexio/lib/_dashboard.py
reflexio/models/api_schema/eval_overview_schema.py
reflexio/models/config_schema.py
reflexio/server/api.py
reflexio/server/services/evaluation_overview/__init__.py
reflexio/server/services/evaluation_overview/distribution.py
reflexio/server/services/evaluation_overview/hero_state.py
reflexio/server/services/evaluation_overview/rule_attribution.py
reflexio/server/services/evaluation_overview/service.py
reflexio/server/services/storage/storage_base/_extras.py
tests/models/test_config_shadow_mode.py
tests/models/test_eval_overview_schema.py
tests/server/api_endpoints/test_evaluation_overview_api.py
tests/server/services/evaluation_overview/test_distribution.py
tests/server/services/evaluation_overview/test_hero_state.py
tests/server/services/evaluation_overview/test_rule_attribution.py
tests/server/services/evaluation_overview/test_service_integration.py

coderabbitai · 2026-05-23T21:07:57Z

+    current_bins: list[int]
+    baseline_bins: list[int]
+    labels: list[str]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the fixed 6-bin histogram shape at the schema boundary.

Line 75–77 currently accept any list lengths, but this endpoint contract is a fixed 6-bin distribution (0/1/2/3/4/5+). Without shape validation, malformed payloads can pass model validation and break consumers.

Proposed fix

class ScoreDistribution(BaseModel): """Corrections-per-session histogram, current window + baseline.""" - current_bins: list[int] - baseline_bins: list[int] - labels: list[str] + current_bins: list[int] = Field(min_length=6, max_length=6) + baseline_bins: list[int] = Field(min_length=6, max_length=6) + labels: list[str] = Field(min_length=6, max_length=6)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 75 - 77, The schema currently allows variable-length lists for current_bins, baseline_bins and labels; change those fields to enforce exactly 6 elements by using pydantic's constrained list type (e.g., replace list[int]/list[str] with conlist(int, min_items=6, max_items=6) and conlist(str, min_items=6, max_items=6)) or by using Field(..., min_items=6, max_items=6); update the imports to include pydantic.conlist or Field and apply this to the current_bins, baseline_bins and labels fields so malformed-length payloads are rejected at the schema boundary.

coderabbitai · 2026-05-23T21:07:57Z

+    from_ts: int = Field(ge=0)
+    to_ts: int = Field(ge=0)
+    bucket: BucketLiteral = "week"
+    include_shadow: bool = True


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate timestamp window ordering in the request model.

Line 91–94 validates non-negative values, but not window order. A request with to_ts < from_ts should be rejected early to avoid nonsensical aggregation windows.

Proposed fix

-from pydantic import BaseModel, Field +from pydantic import BaseModel, Field, model_validator @@ class GetEvaluationOverviewRequest(BaseModel): @@ from_ts: int = Field(ge=0) to_ts: int = Field(ge=0) bucket: BucketLiteral = "week" include_shadow: bool = True + + `@model_validator`(mode="after") + def validate_window(self) -> "GetEvaluationOverviewRequest": + if self.to_ts < self.from_ts: + raise ValueError("to_ts must be >= from_ts") + return self

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from_ts: int = Field(ge=0)

to_ts: int = Field(ge=0)

bucket: BucketLiteral = "week"

include_shadow: bool = True

from_ts: int = Field(ge=0)

to_ts: int = Field(ge=0)

bucket: BucketLiteral = "week"

include_shadow: bool = True

`@model_validator`(mode="after")

def validate_window(self) -> "GetEvaluationOverviewRequest":

if self.to_ts < self.from_ts:

raise ValueError("to_ts must be >= from_ts")

return self

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 91 - 94, The model currently enforces non-negative from_ts and to_ts but does not prevent to_ts < from_ts; add a Pydantic validation to the request model (e.g., a root_validator or a validator on "to_ts") that checks from_ts <= to_ts and raises a ValueError with a clear message when the window is invalid; reference the fields by name ("from_ts" and "to_ts") in the validator and ensure the error is surfaced as a validation error on the model.

coderabbitai · 2026-05-23T21:07:57Z

+    # Whether this org has opted into shadow-mode runs. Drives /healthz/eval
+    # liveness derivation and the /api/get_evaluation_overview hero state
+    # machine. When True, each publish optionally schedules a parallel
+    # "without Reflexio" generation for side-by-side comparison.
+    shadow_mode_enabled: bool = False


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden backward compatibility for nullable persisted values of the new field.

After adding shadow_mode_enabled on Line 621, please also include it in the _migrate_field_names None-stripping list (Line 656+ pattern). Otherwise stored configs with "shadow_mode_enabled": null will fail validation instead of falling back to False.

Proposed fix

if isinstance(data, dict): for key in ( "window_size", "stride_size", "extraction_backend", "search_backend", "reflection_config", "playbook_optimizer_config", + "shadow_mode_enabled", ): if key in data and data[key] is None: del data[key]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/models/config_schema.py` around lines 617 - 621, The new boolean field shadow_mode_enabled should be made backward-compatible with persisted nulls by adding its name to the _migrate_field_names None-stripping list; update the _migrate_field_names logic (where other nullable fields are stripped) to include "shadow_mode_enabled" so that incoming configs with "shadow_mode_enabled": null are converted to missing and will fall back to the default False declared on the shadow_mode_enabled field.

coderabbitai · 2026-05-23T21:07:57Z

+@core_router.post(
+    "/api/get_evaluation_overview",
+    response_model=GetEvaluationOverviewResponse,
+    response_model_exclude_none=True,
+)
+def get_evaluation_overview(
+    request: GetEvaluationOverviewRequest,
+    org_id: str = Depends(default_get_org_id),
+) -> GetEvaluationOverviewResponse:
+    """Return the redesigned /evaluations page payload.
+
+    Aggregates hero state, four context tiles with deltas, top rule
+    attribution, and a corrections-per-session distribution into a single
+    response shaped exactly as the frontend renders it.
+
+    Args:
+        request (GetEvaluationOverviewRequest): Window + bucket granularity.
+        org_id (str): Organization ID resolved by the auth dependency.
+
+    Returns:
+        GetEvaluationOverviewResponse: Full overview payload.
+    """
+    reflexio = get_reflexio(org_id=org_id)
+    return reflexio.get_evaluation_overview(request)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add rate limiting to this expensive aggregate endpoint.

This route is currently unthrottled, unlike other read endpoints, and it fans out into multiple data reads/aggregations.

Suggested fix

`@core_router.post`( "/api/get_evaluation_overview", response_model=GetEvaluationOverviewResponse, response_model_exclude_none=True, ) +@limiter.limit("120/minute") # Rate limit for read operations def get_evaluation_overview( - request: GetEvaluationOverviewRequest, + http_request: Request, # required by limiter + payload: GetEvaluationOverviewRequest, org_id: str = Depends(default_get_org_id), ) -> GetEvaluationOverviewResponse: @@ reflexio = get_reflexio(org_id=org_id) - return reflexio.get_evaluation_overview(request) + return reflexio.get_evaluation_overview(payload)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/api.py` around lines 1520 - 1544, The get_evaluation_overview endpoint is unthrottled and needs rate limiting; add the same rate-limiting/dependency used by other read endpoints to throttle expensive aggregates. Modify the get_evaluation_overview route to require the existing rate limiter (e.g., add a Depends(rate_limit_dependency) or the shared rate_limit decorator) so requests to core_router.post("/api/get_evaluation_overview") will be limited before calling get_reflexio(org_id=...) and reflexio.get_evaluation_overview(request). Ensure you reference the same rate limiter function/class used elsewhere so the endpoint inherits consistent limits and behavior.

coderabbitai · 2026-05-23T21:07:57Z

+    for c in corrections:
+        idx = min(_BUCKET_COUNT - 1, c)
+        bins[idx] += 1


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard negative correction counts before bin indexing.

A negative c currently increments the last bin because Python accepts negative list indexes, which silently corrupts the histogram.

Suggested fix

for c in corrections: - idx = min(_BUCKET_COUNT - 1, c) + idx = min(_BUCKET_COUNT - 1, max(0, c)) bins[idx] += 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for c in corrections:

idx = min(_BUCKET_COUNT - 1, c)

bins[idx] += 1

for c in corrections:

idx = min(_BUCKET_COUNT - 1, max(0, c))

bins[idx] += 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/evaluation_overview/distribution.py` around lines 28 - 30, The loop over corrections allows negative c to index bins (Python negative indices), corrupting the histogram; update the computation of idx in the loop that uses _BUCKET_COUNT, bins and c (the for c in corrections block) to clamp c into the valid range before indexing (e.g., ensure c is at least 0 and at most _BUCKET_COUNT-1 using max/min) or skip negative values, then increment bins[idx].

coderabbitai · 2026-05-23T21:07:57Z

+    rows.sort(
+        key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)),
+    )
+    return rows[:top_n]


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make tie ordering deterministic for stable top-N output.

When net_sessions and total fires tie, order depends on set iteration, so equal rows can shuffle between runs.

Suggested fix

rows.sort( - key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)), + key=lambda r: ( + -r.net_sessions, + -(r.successes_with + r.failures_with), + r.kind, + r.rule_id, + ), )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rows.sort(

key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)),

)

return rows[:top_n]

rows.sort(

key=lambda r: (

-r.net_sessions,

-(r.successes_with + r.failures_with),

r.kind,

r.rule_id,

),

)

return rows[:top_n]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/evaluation_overview/rule_attribution.py` around lines 80 - 83, The current sort key only uses (-r.net_sessions, -(r.successes_with + r.failures_with)) so ties are broken by unpredictable set iteration; update the ordering to include a deterministic tie-breaker (e.g., append r.rule_id or r.rule_name as the last element of the sort key) so rows.sort(...) becomes deterministic before slicing to rows[:top_n]; if the row objects lack a stable id/name, use a stable string representation (e.g., str of a unique tuple of identifying fields) as the final tiebreaker.

coderabbitai · 2026-05-23T21:07:57Z

+        all_results = self.storage.get_agent_success_evaluation_results(  # type: ignore[attr-defined]
+            agent_version=None, limit=10_000
+        )


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid hard-capping aggregate source data at 10,000 rows.

This silently truncates large org datasets and can skew hero state, tiles, attribution, and distribution outputs.

A safer approach is a storage query scoped by time window (or paginated reads until exhaustion) so aggregates are computed on complete data.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/evaluation_overview/service.py` around lines 62 - 64, The code is hard-capping results with limit=10_000 in the call to storage.get_agent_success_evaluation_results which can silently truncate org-level aggregates; change the call in service.py (the all_results assignment) to either query by a time window (pass explicit start/end timestamps) or implement paginated reads (loop calling get_agent_success_evaluation_results with an offset/next_token until exhausted) so you compute aggregates over the complete result set rather than a fixed 10_000 limit; ensure any new params (time bounds or pagination token) are plumbed through and documented in the calling flow.

coderabbitai · 2026-05-23T21:07:57Z

+        hero = self._build_hero(request, results)
+        tiles = self._build_tiles(results, results_prev)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compute onboarding age from org history, not just the current window.

days_since_first_eval is derived from window-filtered results, so older orgs can be incorrectly classified as early/empty when the selected window starts recently.

Suggested fix

- hero = self._build_hero(request, results) + earliest_all = min((r.created_at for r in all_results), default=None) + now_ts = int(datetime.now(UTC).timestamp()) + days_since_first_eval = ( + None if earliest_all is None else (now_ts - earliest_all) // 86_400 + ) + hero = self._build_hero(request, results, days_since_first_eval) @@ def _build_hero( self, request: GetEvaluationOverviewRequest, results: list[AgentSuccessEvaluationResult], + days_since_first_eval: int | None, ) -> HeroBlock: - if not results: - days_since = None - else: - earliest = min(r.created_at for r in results) - days_since = (int(datetime.now(UTC).timestamp()) - earliest) // 86_400 n_shadow = self.storage.count_sessions_with_shadow_content( # type: ignore[attr-defined] request.from_ts, request.to_ts ) state = compute_hero_state( shadow_enabled=self.config.shadow_mode_enabled, - days_since_first_eval=days_since, + days_since_first_eval=days_since_first_eval, n_shadow_in_window=n_shadow, total_results=len(results), )

Also applies to: 86-95, 99-104

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/evaluation_overview/service.py` around lines 72 - 73, The code computes days_since_first_eval from window-filtered results causing misclassification; change the logic in _build_hero and _build_tiles to derive days_since_first_eval from the full org history instead of the windowed `results` (e.g., compute earliest evaluation date using `org_history`/`all_results` or a provided `results_all` collection), update the places that currently use the windowed `results` (references: days_since_first_eval, _build_hero, _build_tiles) to use that earliest date, and propagate that value to the existing branching logic used in the code blocks around the current uses (the blocks covering lines 86-95 and 99-104) so `early`/`empty` classification is based on org-wide history not the selected time window.

yilu331 · 2026-05-26T01:31:07Z

Superseded by #91 — consolidated evaluations + Braintrust PR. Branch remains on remote for history.

…valuation_overview Surfaces imported_score data alongside the existing tiles in the overview response. - Adds `BraintrustTileRow` (scorer_name, current, n, delta) to the response model, exposed as a new `braintrust_tiles` list (defaults to []). - Adds a default no-op `get_imported_scores(org_id, from_ts, to_ts)` on BaseStorage. Real per-backend implementations are a follow-up. - Adds `EvaluationOverviewService._build_braintrust_tiles` that aggregates scores by scorer_name (mean + count) and computes a vs-prior-window delta. When no prior data exists, delta == current — matches the existing tile "no baseline" convention so the frontend can detect it. - Updates the storage-unconfigured default response to include `braintrust_tiles: []`. 3 new tests (aggregator, empty case, deltas with + without baseline). 53 tests pass; ruff + pyright clean on touched files. Depends on PR #86 (Plan B-backend — eval overview endpoint) AND PR #87 (Plan C-backend — Braintrust connector foundation).

…hods Combines items 7 and 8 — same file (sqlite_storage/_extras.py), same test fixture pattern, one PR. **Plan B-backend (count_sessions_with_shadow_content, get_interactions_by_session):** - Both methods JOIN interactions ↔ requests since session_id lives on the request. count_sessions_with_shadow_content filters by COALESCE of shadow_content != ''; get_interactions_by_session orders by i.created_at. **Plan C-backend + Plan C-overview (4 connection methods + get_imported_scores):** - New tables `braintrust_connection` (PK org_id) and `imported_score` (UNIQUE org_id+source+source_run_id+scorer_name for idempotent re-sync; indexes on session_id and ts). - Upsert via ON CONFLICT DO UPDATE on the existing PK / UNIQUE. - API keys stored encrypted (the service encrypts before save). **Tests (15 pass, all integration-marked, tempfile-isolated):** - Shadow-count: zero/distinct-sessions/window filter - get_interactions_by_session: correct session/empty/unknown - BraintrustConnection: roundtrip/upsert/get-unknown/delete-idempotent - ImportedScore: window/idempotent re-sync/org-filter/empty-noop Lint + pyright clean. Supabase + Postgres backends are explicit follow-ups. Depends on: PR #86 (Plan B-backend), PR #87 (Plan C-backend), PR #88 (Plan C-overview). All three must merge before this rebases cleanly.

yilu331 added 7 commits May 23, 2026 20:42

feat(config): add shadow_mode_enabled toggle to Config

3486bdf

feat(eval-overview): hero state machine (4 states from spec)

b15a4b5

feat(eval-overview): corrections-per-session histogram buckets

75016c0

feat(eval-overview): rule net-sessions attribution

cfbb7c8

feat(eval-overview): request/response Pydantic models

70bbf9d

feat(eval-overview): aggregator service + storage stubs

f861ce5

feat(eval-overview): POST /api/get_evaluation_overview route + lib wr…

2abf014

…apper

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

yilu331 mentioned this pull request May 26, 2026

feat(evaluations): consolidated — Plan A pipeline fix + Plan B/C evaluation overview + Braintrust #91

Open

5 tasks

yilu331 closed this May 26, 2026

		hero = self._build_hero(request, results)
		tiles = self._build_tiles(results, results_prev)

Conversation

yilu331 commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two adjustments from spec

Test plan

Companion PR

Out of scope (out-of-scope follow-ups)

References

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

yilu331 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yilu331 commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading