Skip to content

feat(eval-overview): Plan B-backend — POST /api/get_evaluation_overview aggregate endpoint#86

Closed
yilu331 wants to merge 7 commits into
mainfrom
feat/evaluations-plan-b-backend
Closed

feat(eval-overview): Plan B-backend — POST /api/get_evaluation_overview aggregate endpoint#86
yilu331 wants to merge 7 commits into
mainfrom
feat/evaluations-plan-b-backend

Conversation

@yilu331
Copy link
Copy Markdown
Collaborator

@yilu331 yilu331 commented May 23, 2026

Plan B-backend ships the aggregate endpoint that the redesigned /evaluations page (Plan B-frontend) reads from. All hero/tile/attribution/distribution computation happens server-side; the frontend renders, never computes.

Summary

  • POST /api/get_evaluation_overview — single round-trip returning hero state, four context tiles with week-over-week deltas, top rule attribution, and corrections distribution.
  • Config.shadow_mode_enabled: bool = False — new toggle that drives the hero state machine (drives full vs early vs shadow_off).
  • Pure-function building blocks, individually unit-tested:
    • hero_state.compute_hero_state(...) — 4-state machine per spec §3.2
    • distribution.bucket_corrections(...) — 6-bin corrections histogram (0/1/2/3/4/5+)
    • rule_attribution.compute_net_sessions(...) — joins citations + success outcomes, top-N by net sessions
  • EvaluationOverviewService composes the four pieces.
  • Two default no-op storage methods added on BaseStorage (count_sessions_with_shadow_content, get_interactions_by_session) so backends can override without breaking the endpoint.

Two adjustments from spec

  1. Score distribution uses correction counts, not a continuous score. The spec §4.5 assumed AgentSuccessEvaluationResult had a 0.0–1.0 score field; it doesn't (only is_success exists). Replaced with corrections-per-session bins — still shows the "shape of quality" and shifts visibly as sessions improve.
  2. Rule attribution reuses PlaybookApplicationStat. The spec §4.4 proposed a new rule_application table; the existing Interaction.citations JSON column + /api/get_playbook_application_stats already cover this. No new table.

Test plan

  • Unit tests for the hero state machine (5 cases across all four states + boundaries)
  • Unit tests for distribution bucketing (boundary + empty)
  • Unit tests for rule attribution (join + top-N + missing session)
  • Schema round-trip test for request/response models
  • Service integration test through MagicMock storage (2 cases: shadow-on + empty)
  • Endpoint smoke test through TestClient (returns empty state on fresh app)
  • 19 tests pass; ruff + pyright clean

Companion PR

ReflexioAI/reflexio-enterprise#<TBD> bumps the submodule pointer here.

Out of scope (out-of-scope follow-ups)

  • Shadow content count returns 0 until shadow-mode publishing lands (the hero falls back to SHADOW_OFF/EARLY accordingly).
  • Per-backend get_interactions_by_session defaults to []; rule attribution panel is empty until backends implement real queries.
  • Plan B-frontend — separate plan and PR.

References

  • Spec: docs/superpowers/specs/2026-05-21-evaluations-redesign-and-braintrust-integration-design.md (§3.2, §3.3, §4.1, §4.2, §4.4, §4.5)
  • Plan: docs/superpowers/plans/2026-05-23-evaluations-plan-b-backend.md

Summary by CodeRabbit

  • New Features
    • New evaluation overview dashboard displaying performance trends and evaluation readiness status.
    • Added rule attribution analysis revealing which rules drive success and failures.
    • Introduced score distribution visualization showing correction patterns.
    • Added configurable shadow mode to enable parallel evaluation runs for comparison.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces a new /api/get_evaluation_overview endpoint that aggregates evaluation metrics into a dashboard summary. It defines request/response schemas, adds configuration and storage hooks, implements hero-state and rule-attribution logic, orchestrates computation via a new service, and integrates the endpoint with comprehensive test coverage.

Changes

Evaluations Overview Endpoint

Layer / File(s) Summary
API Schema and Data Contracts
reflexio/models/api_schema/eval_overview_schema.py
Defines typed literals for hero states and bucket granularity. Introduces Pydantic models for HeroBucket, HeroBlock, NumberWithDelta, PercentWithDelta, ContextTile, RuleAttributionRow, ScoreDistribution, and the request/response envelopes GetEvaluationOverviewRequest and GetEvaluationOverviewResponse.
Configuration and Storage Contracts
reflexio/models/config_schema.py, reflexio/server/services/storage/storage_base/_extras.py
Adds shadow_mode_enabled boolean flag to Config (defaults False). Adds two default storage methods: count_sessions_with_shadow_content (returns 0) and get_interactions_by_session (returns empty list) for backends to override.
Pure Logic Components
reflexio/server/services/evaluation_overview/hero_state.py, reflexio/server/services/evaluation_overview/distribution.py, reflexio/server/services/evaluation_overview/rule_attribution.py
Implements three independent pure-logic modules: HeroState enum and compute_hero_state with precedence rules for four hero states (empty/shadow_off/early/full); bucket_corrections function binning correction counts into a fixed 6-bin histogram; and compute_net_sessions computing rule attribution via citation-based net-sessions ranking.
Service Orchestration
reflexio/server/services/evaluation_overview/service.py
Core EvaluationOverviewService that orchestrates aggregation: fetches evaluation results, filters by time window, computes hero state via _build_hero, context-tile deltas via _build_tiles, rule attribution via _build_attribution (loading citations and resolving titles), and score distributions via _build_distribution. Helper functions compute success rates, escalation rates, means, and weekly bucket aggregations.
API Integration
reflexio/server/api.py, reflexio/lib/_dashboard.py
Introduces new POST endpoint /api/get_evaluation_overview accepting GetEvaluationOverviewRequest and returning GetEvaluationOverviewResponse. Adds DashboardMixin.get_evaluation_overview method that calls the service or returns an early _empty_overview_response when storage is unconfigured.
Test Coverage
tests/models/test_config_shadow_mode.py, tests/models/test_eval_overview_schema.py, tests/server/services/evaluation_overview/test_*.py, tests/server/api_endpoints/test_evaluation_overview_api.py
Unit tests for config flag defaulting, schema round-tripping, and each pure-logic component (hero state, distribution, rule attribution). Service integration test with mocked storage verifies full response with data and empty-state case. Endpoint integration test validates response structure, allowed hero states, and fixed score-distribution labels.

Sequence Diagram(s)

sequenceDiagram
  participant Request as POST /api/get_evaluation_overview
  participant API as FastAPI Endpoint
  participant Service as EvaluationOverviewService
  participant Storage as Storage Backend
  participant Response as JSON Response
  
  Request->>API: GetEvaluationOverviewRequest
  API->>Service: run(request)
  Service->>Storage: get_agent_success_evals(org_id)
  Storage-->>Service: evaluation results
  Service->>Service: compute hero state<br/>(days_since_first_eval, shadow_enabled)
  Service->>Service: compute context tiles<br/>(success/escalation/corrections deltas)
  Service->>Storage: count_sessions_with_shadow_content
  Storage-->>Service: shadow session count
  Service->>Storage: get_interactions_by_session(all sessions)
  Storage-->>Service: interactions with citations
  Service->>Service: compute rule attribution<br/>(net_sessions ranking)
  Service->>Storage: get_playbook_stats(rule ids)
  Storage-->>Service: rule titles
  Service->>Service: build score distribution<br/>(bucket correction counts)
  Service-->>API: GetEvaluationOverviewResponse
  API-->>Response: 200 JSON
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • ReflexioAI/reflexio#39: Introduced and persisted the per-session Interaction.citations schema and storage support that this PR's rule-attribution logic directly consumes to rank rules by net sessions.

Poem

🐰 A dashboard overview takes flight,
With hero states shining so bright,
Rules rank by their net,
Score bins—no regret,
Evaluations gleam in plain sight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: adding a new POST /api/get_evaluation_overview aggregate endpoint with backend support, which is the primary objective of this changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/evaluations-plan-b-backend

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
tests/server/services/evaluation_overview/test_rule_attribution.py (1)

76-76: ⚡ Quick win

Remove confusing no-op statement.

The line _ = RuleAttribution serves no functional purpose. RuleAttribution is already imported at the module level (line 4), and assigning it to _ within a single test function has no effect on linting or exports. The comment acknowledges it's a "no-op," confirming it should be removed.

🧹 Suggested cleanup
 def test_session_missing_from_success_map_is_skipped() -> None:
     """If a citation references a session we have no AgentSuccessEvaluationResult
     for, treat it as unknown and don't count it on either side."""
-    _ = RuleAttribution  # imported for export sanity; no-op
     citations_by_session = {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/services/evaluation_overview/test_rule_attribution.py` at line
76, Remove the redundant no-op assignment `_ = RuleAttribution` (and its inline
comment) from the test; `RuleAttribution` is already imported at module level so
delete the assignment statement in the test to avoid confusion and keep the
import as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@reflexio/models/api_schema/eval_overview_schema.py`:
- Around line 91-94: The model currently enforces non-negative from_ts and to_ts
but does not prevent to_ts < from_ts; add a Pydantic validation to the request
model (e.g., a root_validator or a validator on "to_ts") that checks from_ts <=
to_ts and raises a ValueError with a clear message when the window is invalid;
reference the fields by name ("from_ts" and "to_ts") in the validator and ensure
the error is surfaced as a validation error on the model.
- Around line 75-77: The schema currently allows variable-length lists for
current_bins, baseline_bins and labels; change those fields to enforce exactly 6
elements by using pydantic's constrained list type (e.g., replace
list[int]/list[str] with conlist(int, min_items=6, max_items=6) and conlist(str,
min_items=6, max_items=6)) or by using Field(..., min_items=6, max_items=6);
update the imports to include pydantic.conlist or Field and apply this to the
current_bins, baseline_bins and labels fields so malformed-length payloads are
rejected at the schema boundary.

In `@reflexio/models/config_schema.py`:
- Around line 617-621: The new boolean field shadow_mode_enabled should be made
backward-compatible with persisted nulls by adding its name to the
_migrate_field_names None-stripping list; update the _migrate_field_names logic
(where other nullable fields are stripped) to include "shadow_mode_enabled" so
that incoming configs with "shadow_mode_enabled": null are converted to missing
and will fall back to the default False declared on the shadow_mode_enabled
field.

In `@reflexio/server/api.py`:
- Around line 1520-1544: The get_evaluation_overview endpoint is unthrottled and
needs rate limiting; add the same rate-limiting/dependency used by other read
endpoints to throttle expensive aggregates. Modify the get_evaluation_overview
route to require the existing rate limiter (e.g., add a
Depends(rate_limit_dependency) or the shared rate_limit decorator) so requests
to core_router.post("/api/get_evaluation_overview") will be limited before
calling get_reflexio(org_id=...) and reflexio.get_evaluation_overview(request).
Ensure you reference the same rate limiter function/class used elsewhere so the
endpoint inherits consistent limits and behavior.

In `@reflexio/server/services/evaluation_overview/distribution.py`:
- Around line 28-30: The loop over corrections allows negative c to index bins
(Python negative indices), corrupting the histogram; update the computation of
idx in the loop that uses _BUCKET_COUNT, bins and c (the for c in corrections
block) to clamp c into the valid range before indexing (e.g., ensure c is at
least 0 and at most _BUCKET_COUNT-1 using max/min) or skip negative values, then
increment bins[idx].

In `@reflexio/server/services/evaluation_overview/rule_attribution.py`:
- Around line 80-83: The current sort key only uses (-r.net_sessions,
-(r.successes_with + r.failures_with)) so ties are broken by unpredictable set
iteration; update the ordering to include a deterministic tie-breaker (e.g.,
append r.rule_id or r.rule_name as the last element of the sort key) so
rows.sort(...) becomes deterministic before slicing to rows[:top_n]; if the row
objects lack a stable id/name, use a stable string representation (e.g., str of
a unique tuple of identifying fields) as the final tiebreaker.

In `@reflexio/server/services/evaluation_overview/service.py`:
- Around line 72-73: The code computes days_since_first_eval from
window-filtered results causing misclassification; change the logic in
_build_hero and _build_tiles to derive days_since_first_eval from the full org
history instead of the windowed `results` (e.g., compute earliest evaluation
date using `org_history`/`all_results` or a provided `results_all` collection),
update the places that currently use the windowed `results` (references:
days_since_first_eval, _build_hero, _build_tiles) to use that earliest date, and
propagate that value to the existing branching logic used in the code blocks
around the current uses (the blocks covering lines 86-95 and 99-104) so
`early`/`empty` classification is based on org-wide history not the selected
time window.
- Around line 62-64: The code is hard-capping results with limit=10_000 in the
call to storage.get_agent_success_evaluation_results which can silently truncate
org-level aggregates; change the call in service.py (the all_results assignment)
to either query by a time window (pass explicit start/end timestamps) or
implement paginated reads (loop calling get_agent_success_evaluation_results
with an offset/next_token until exhausted) so you compute aggregates over the
complete result set rather than a fixed 10_000 limit; ensure any new params
(time bounds or pagination token) are plumbed through and documented in the
calling flow.

---

Nitpick comments:
In `@tests/server/services/evaluation_overview/test_rule_attribution.py`:
- Line 76: Remove the redundant no-op assignment `_ = RuleAttribution` (and its
inline comment) from the test; `RuleAttribution` is already imported at module
level so delete the assignment statement in the test to avoid confusion and keep
the import as-is.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b288b680-4fa4-45cd-a409-ed5a2ddcb6ca

📥 Commits

Reviewing files that changed from the base of the PR and between 25f8034 and 2abf014.

📒 Files selected for processing (17)
  • reflexio/lib/_dashboard.py
  • reflexio/models/api_schema/eval_overview_schema.py
  • reflexio/models/config_schema.py
  • reflexio/server/api.py
  • reflexio/server/services/evaluation_overview/__init__.py
  • reflexio/server/services/evaluation_overview/distribution.py
  • reflexio/server/services/evaluation_overview/hero_state.py
  • reflexio/server/services/evaluation_overview/rule_attribution.py
  • reflexio/server/services/evaluation_overview/service.py
  • reflexio/server/services/storage/storage_base/_extras.py
  • tests/models/test_config_shadow_mode.py
  • tests/models/test_eval_overview_schema.py
  • tests/server/api_endpoints/test_evaluation_overview_api.py
  • tests/server/services/evaluation_overview/test_distribution.py
  • tests/server/services/evaluation_overview/test_hero_state.py
  • tests/server/services/evaluation_overview/test_rule_attribution.py
  • tests/server/services/evaluation_overview/test_service_integration.py

Comment on lines +75 to +77
current_bins: list[int]
baseline_bins: list[int]
labels: list[str]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the fixed 6-bin histogram shape at the schema boundary.

Line 75–77 currently accept any list lengths, but this endpoint contract is a fixed 6-bin distribution (0/1/2/3/4/5+). Without shape validation, malformed payloads can pass model validation and break consumers.

Proposed fix
 class ScoreDistribution(BaseModel):
     """Corrections-per-session histogram, current window + baseline."""

-    current_bins: list[int]
-    baseline_bins: list[int]
-    labels: list[str]
+    current_bins: list[int] = Field(min_length=6, max_length=6)
+    baseline_bins: list[int] = Field(min_length=6, max_length=6)
+    labels: list[str] = Field(min_length=6, max_length=6)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 75 - 77, The
schema currently allows variable-length lists for current_bins, baseline_bins
and labels; change those fields to enforce exactly 6 elements by using
pydantic's constrained list type (e.g., replace list[int]/list[str] with
conlist(int, min_items=6, max_items=6) and conlist(str, min_items=6,
max_items=6)) or by using Field(..., min_items=6, max_items=6); update the
imports to include pydantic.conlist or Field and apply this to the current_bins,
baseline_bins and labels fields so malformed-length payloads are rejected at the
schema boundary.

Comment on lines +91 to +94
from_ts: int = Field(ge=0)
to_ts: int = Field(ge=0)
bucket: BucketLiteral = "week"
include_shadow: bool = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate timestamp window ordering in the request model.

Line 91–94 validates non-negative values, but not window order. A request with to_ts < from_ts should be rejected early to avoid nonsensical aggregation windows.

Proposed fix
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, model_validator
@@
 class GetEvaluationOverviewRequest(BaseModel):
@@
     from_ts: int = Field(ge=0)
     to_ts: int = Field(ge=0)
     bucket: BucketLiteral = "week"
     include_shadow: bool = True
+
+    `@model_validator`(mode="after")
+    def validate_window(self) -> "GetEvaluationOverviewRequest":
+        if self.to_ts < self.from_ts:
+            raise ValueError("to_ts must be >= from_ts")
+        return self
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from_ts: int = Field(ge=0)
to_ts: int = Field(ge=0)
bucket: BucketLiteral = "week"
include_shadow: bool = True
from_ts: int = Field(ge=0)
to_ts: int = Field(ge=0)
bucket: BucketLiteral = "week"
include_shadow: bool = True
`@model_validator`(mode="after")
def validate_window(self) -> "GetEvaluationOverviewRequest":
if self.to_ts < self.from_ts:
raise ValueError("to_ts must be >= from_ts")
return self
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 91 - 94, The
model currently enforces non-negative from_ts and to_ts but does not prevent
to_ts < from_ts; add a Pydantic validation to the request model (e.g., a
root_validator or a validator on "to_ts") that checks from_ts <= to_ts and
raises a ValueError with a clear message when the window is invalid; reference
the fields by name ("from_ts" and "to_ts") in the validator and ensure the error
is surfaced as a validation error on the model.

Comment on lines +617 to +621
# Whether this org has opted into shadow-mode runs. Drives /healthz/eval
# liveness derivation and the /api/get_evaluation_overview hero state
# machine. When True, each publish optionally schedules a parallel
# "without Reflexio" generation for side-by-side comparison.
shadow_mode_enabled: bool = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden backward compatibility for nullable persisted values of the new field.

After adding shadow_mode_enabled on Line 621, please also include it in the _migrate_field_names None-stripping list (Line 656+ pattern). Otherwise stored configs with "shadow_mode_enabled": null will fail validation instead of falling back to False.

Proposed fix
         if isinstance(data, dict):
             for key in (
                 "window_size",
                 "stride_size",
                 "extraction_backend",
                 "search_backend",
                 "reflection_config",
                 "playbook_optimizer_config",
+                "shadow_mode_enabled",
             ):
                 if key in data and data[key] is None:
                     del data[key]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/models/config_schema.py` around lines 617 - 621, The new boolean
field shadow_mode_enabled should be made backward-compatible with persisted
nulls by adding its name to the _migrate_field_names None-stripping list; update
the _migrate_field_names logic (where other nullable fields are stripped) to
include "shadow_mode_enabled" so that incoming configs with
"shadow_mode_enabled": null are converted to missing and will fall back to the
default False declared on the shadow_mode_enabled field.

Comment thread reflexio/server/api.py
Comment on lines +1520 to +1544
@core_router.post(
"/api/get_evaluation_overview",
response_model=GetEvaluationOverviewResponse,
response_model_exclude_none=True,
)
def get_evaluation_overview(
request: GetEvaluationOverviewRequest,
org_id: str = Depends(default_get_org_id),
) -> GetEvaluationOverviewResponse:
"""Return the redesigned /evaluations page payload.

Aggregates hero state, four context tiles with deltas, top rule
attribution, and a corrections-per-session distribution into a single
response shaped exactly as the frontend renders it.

Args:
request (GetEvaluationOverviewRequest): Window + bucket granularity.
org_id (str): Organization ID resolved by the auth dependency.

Returns:
GetEvaluationOverviewResponse: Full overview payload.
"""
reflexio = get_reflexio(org_id=org_id)
return reflexio.get_evaluation_overview(request)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add rate limiting to this expensive aggregate endpoint.

This route is currently unthrottled, unlike other read endpoints, and it fans out into multiple data reads/aggregations.

Suggested fix
 `@core_router.post`(
     "/api/get_evaluation_overview",
     response_model=GetEvaluationOverviewResponse,
     response_model_exclude_none=True,
 )
+@limiter.limit("120/minute")  # Rate limit for read operations
 def get_evaluation_overview(
-    request: GetEvaluationOverviewRequest,
+    http_request: Request,  # required by limiter
+    payload: GetEvaluationOverviewRequest,
     org_id: str = Depends(default_get_org_id),
 ) -> GetEvaluationOverviewResponse:
@@
     reflexio = get_reflexio(org_id=org_id)
-    return reflexio.get_evaluation_overview(request)
+    return reflexio.get_evaluation_overview(payload)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/api.py` around lines 1520 - 1544, The get_evaluation_overview
endpoint is unthrottled and needs rate limiting; add the same
rate-limiting/dependency used by other read endpoints to throttle expensive
aggregates. Modify the get_evaluation_overview route to require the existing
rate limiter (e.g., add a Depends(rate_limit_dependency) or the shared
rate_limit decorator) so requests to
core_router.post("/api/get_evaluation_overview") will be limited before calling
get_reflexio(org_id=...) and reflexio.get_evaluation_overview(request). Ensure
you reference the same rate limiter function/class used elsewhere so the
endpoint inherits consistent limits and behavior.

Comment on lines +28 to +30
for c in corrections:
idx = min(_BUCKET_COUNT - 1, c)
bins[idx] += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard negative correction counts before bin indexing.

A negative c currently increments the last bin because Python accepts negative list indexes, which silently corrupts the histogram.

Suggested fix
     for c in corrections:
-        idx = min(_BUCKET_COUNT - 1, c)
+        idx = min(_BUCKET_COUNT - 1, max(0, c))
         bins[idx] += 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for c in corrections:
idx = min(_BUCKET_COUNT - 1, c)
bins[idx] += 1
for c in corrections:
idx = min(_BUCKET_COUNT - 1, max(0, c))
bins[idx] += 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/evaluation_overview/distribution.py` around lines 28
- 30, The loop over corrections allows negative c to index bins (Python negative
indices), corrupting the histogram; update the computation of idx in the loop
that uses _BUCKET_COUNT, bins and c (the for c in corrections block) to clamp c
into the valid range before indexing (e.g., ensure c is at least 0 and at most
_BUCKET_COUNT-1 using max/min) or skip negative values, then increment
bins[idx].

Comment on lines +80 to +83
rows.sort(
key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)),
)
return rows[:top_n]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make tie ordering deterministic for stable top-N output.

When net_sessions and total fires tie, order depends on set iteration, so equal rows can shuffle between runs.

Suggested fix
     rows.sort(
-        key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)),
+        key=lambda r: (
+            -r.net_sessions,
+            -(r.successes_with + r.failures_with),
+            r.kind,
+            r.rule_id,
+        ),
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rows.sort(
key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)),
)
return rows[:top_n]
rows.sort(
key=lambda r: (
-r.net_sessions,
-(r.successes_with + r.failures_with),
r.kind,
r.rule_id,
),
)
return rows[:top_n]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/evaluation_overview/rule_attribution.py` around
lines 80 - 83, The current sort key only uses (-r.net_sessions,
-(r.successes_with + r.failures_with)) so ties are broken by unpredictable set
iteration; update the ordering to include a deterministic tie-breaker (e.g.,
append r.rule_id or r.rule_name as the last element of the sort key) so
rows.sort(...) becomes deterministic before slicing to rows[:top_n]; if the row
objects lack a stable id/name, use a stable string representation (e.g., str of
a unique tuple of identifying fields) as the final tiebreaker.

Comment on lines +62 to +64
all_results = self.storage.get_agent_success_evaluation_results( # type: ignore[attr-defined]
agent_version=None, limit=10_000
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid hard-capping aggregate source data at 10,000 rows.

This silently truncates large org datasets and can skew hero state, tiles, attribution, and distribution outputs.

A safer approach is a storage query scoped by time window (or paginated reads until exhaustion) so aggregates are computed on complete data.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/evaluation_overview/service.py` around lines 62 -
64, The code is hard-capping results with limit=10_000 in the call to
storage.get_agent_success_evaluation_results which can silently truncate
org-level aggregates; change the call in service.py (the all_results assignment)
to either query by a time window (pass explicit start/end timestamps) or
implement paginated reads (loop calling get_agent_success_evaluation_results
with an offset/next_token until exhausted) so you compute aggregates over the
complete result set rather than a fixed 10_000 limit; ensure any new params
(time bounds or pagination token) are plumbed through and documented in the
calling flow.

Comment on lines +72 to +73
hero = self._build_hero(request, results)
tiles = self._build_tiles(results, results_prev)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compute onboarding age from org history, not just the current window.

days_since_first_eval is derived from window-filtered results, so older orgs can be incorrectly classified as early/empty when the selected window starts recently.

Suggested fix
-        hero = self._build_hero(request, results)
+        earliest_all = min((r.created_at for r in all_results), default=None)
+        now_ts = int(datetime.now(UTC).timestamp())
+        days_since_first_eval = (
+            None if earliest_all is None else (now_ts - earliest_all) // 86_400
+        )
+        hero = self._build_hero(request, results, days_since_first_eval)
@@
     def _build_hero(
         self,
         request: GetEvaluationOverviewRequest,
         results: list[AgentSuccessEvaluationResult],
+        days_since_first_eval: int | None,
     ) -> HeroBlock:
-        if not results:
-            days_since = None
-        else:
-            earliest = min(r.created_at for r in results)
-            days_since = (int(datetime.now(UTC).timestamp()) - earliest) // 86_400
         n_shadow = self.storage.count_sessions_with_shadow_content(  # type: ignore[attr-defined]
             request.from_ts, request.to_ts
         )
         state = compute_hero_state(
             shadow_enabled=self.config.shadow_mode_enabled,
-            days_since_first_eval=days_since,
+            days_since_first_eval=days_since_first_eval,
             n_shadow_in_window=n_shadow,
             total_results=len(results),
         )

Also applies to: 86-95, 99-104

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/evaluation_overview/service.py` around lines 72 -
73, The code computes days_since_first_eval from window-filtered results causing
misclassification; change the logic in _build_hero and _build_tiles to derive
days_since_first_eval from the full org history instead of the windowed
`results` (e.g., compute earliest evaluation date using
`org_history`/`all_results` or a provided `results_all` collection), update the
places that currently use the windowed `results` (references:
days_since_first_eval, _build_hero, _build_tiles) to use that earliest date, and
propagate that value to the existing branching logic used in the code blocks
around the current uses (the blocks covering lines 86-95 and 99-104) so
`early`/`empty` classification is based on org-wide history not the selected
time window.

@yilu331
Copy link
Copy Markdown
Collaborator Author

yilu331 commented May 26, 2026

Superseded by #91 — consolidated evaluations + Braintrust PR. Branch remains on remote for history.

@yilu331 yilu331 closed this May 26, 2026
yilu331 added a commit that referenced this pull request May 27, 2026
…valuation_overview

Surfaces imported_score data alongside the existing tiles in the
overview response.

- Adds `BraintrustTileRow` (scorer_name, current, n, delta) to the
  response model, exposed as a new `braintrust_tiles` list (defaults to
  []).
- Adds a default no-op `get_imported_scores(org_id, from_ts, to_ts)`
  on BaseStorage. Real per-backend implementations are a follow-up.
- Adds `EvaluationOverviewService._build_braintrust_tiles` that aggregates
  scores by scorer_name (mean + count) and computes a vs-prior-window
  delta. When no prior data exists, delta == current — matches the
  existing tile "no baseline" convention so the frontend can detect it.
- Updates the storage-unconfigured default response to include
  `braintrust_tiles: []`.

3 new tests (aggregator, empty case, deltas with + without baseline).
53 tests pass; ruff + pyright clean on touched files.

Depends on PR #86 (Plan B-backend — eval overview endpoint) AND
PR #87 (Plan C-backend — Braintrust connector foundation).
yilu331 added a commit that referenced this pull request May 27, 2026
…hods

Combines items 7 and 8 — same file (sqlite_storage/_extras.py), same
test fixture pattern, one PR.

**Plan B-backend (count_sessions_with_shadow_content, get_interactions_by_session):**

- Both methods JOIN interactions ↔ requests since session_id lives on
  the request. count_sessions_with_shadow_content filters by COALESCE
  of shadow_content != ''; get_interactions_by_session orders by
  i.created_at.

**Plan C-backend + Plan C-overview (4 connection methods + get_imported_scores):**

- New tables `braintrust_connection` (PK org_id) and `imported_score`
  (UNIQUE org_id+source+source_run_id+scorer_name for idempotent
  re-sync; indexes on session_id and ts).
- Upsert via ON CONFLICT DO UPDATE on the existing PK / UNIQUE.
- API keys stored encrypted (the service encrypts before save).

**Tests (15 pass, all integration-marked, tempfile-isolated):**

- Shadow-count: zero/distinct-sessions/window filter
- get_interactions_by_session: correct session/empty/unknown
- BraintrustConnection: roundtrip/upsert/get-unknown/delete-idempotent
- ImportedScore: window/idempotent re-sync/org-filter/empty-noop

Lint + pyright clean.

Supabase + Postgres backends are explicit follow-ups.

Depends on: PR #86 (Plan B-backend), PR #87 (Plan C-backend), PR #88
(Plan C-overview). All three must merge before this rebases cleanly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant