feat(eval-overview): Plan B-backend — POST /api/get_evaluation_overview aggregate endpoint#86
feat(eval-overview): Plan B-backend — POST /api/get_evaluation_overview aggregate endpoint#86yilu331 wants to merge 7 commits into
Conversation
📝 WalkthroughWalkthroughThis PR introduces a new ChangesEvaluations Overview Endpoint
Sequence Diagram(s)sequenceDiagram
participant Request as POST /api/get_evaluation_overview
participant API as FastAPI Endpoint
participant Service as EvaluationOverviewService
participant Storage as Storage Backend
participant Response as JSON Response
Request->>API: GetEvaluationOverviewRequest
API->>Service: run(request)
Service->>Storage: get_agent_success_evals(org_id)
Storage-->>Service: evaluation results
Service->>Service: compute hero state<br/>(days_since_first_eval, shadow_enabled)
Service->>Service: compute context tiles<br/>(success/escalation/corrections deltas)
Service->>Storage: count_sessions_with_shadow_content
Storage-->>Service: shadow session count
Service->>Storage: get_interactions_by_session(all sessions)
Storage-->>Service: interactions with citations
Service->>Service: compute rule attribution<br/>(net_sessions ranking)
Service->>Storage: get_playbook_stats(rule ids)
Storage-->>Service: rule titles
Service->>Service: build score distribution<br/>(bucket correction counts)
Service-->>API: GetEvaluationOverviewResponse
API-->>Response: 200 JSON
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
tests/server/services/evaluation_overview/test_rule_attribution.py (1)
76-76: ⚡ Quick winRemove confusing no-op statement.
The line
_ = RuleAttributionserves no functional purpose.RuleAttributionis already imported at the module level (line 4), and assigning it to_within a single test function has no effect on linting or exports. The comment acknowledges it's a "no-op," confirming it should be removed.🧹 Suggested cleanup
def test_session_missing_from_success_map_is_skipped() -> None: """If a citation references a session we have no AgentSuccessEvaluationResult for, treat it as unknown and don't count it on either side.""" - _ = RuleAttribution # imported for export sanity; no-op citations_by_session = {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/server/services/evaluation_overview/test_rule_attribution.py` at line 76, Remove the redundant no-op assignment `_ = RuleAttribution` (and its inline comment) from the test; `RuleAttribution` is already imported at module level so delete the assignment statement in the test to avoid confusion and keep the import as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@reflexio/models/api_schema/eval_overview_schema.py`:
- Around line 91-94: The model currently enforces non-negative from_ts and to_ts
but does not prevent to_ts < from_ts; add a Pydantic validation to the request
model (e.g., a root_validator or a validator on "to_ts") that checks from_ts <=
to_ts and raises a ValueError with a clear message when the window is invalid;
reference the fields by name ("from_ts" and "to_ts") in the validator and ensure
the error is surfaced as a validation error on the model.
- Around line 75-77: The schema currently allows variable-length lists for
current_bins, baseline_bins and labels; change those fields to enforce exactly 6
elements by using pydantic's constrained list type (e.g., replace
list[int]/list[str] with conlist(int, min_items=6, max_items=6) and conlist(str,
min_items=6, max_items=6)) or by using Field(..., min_items=6, max_items=6);
update the imports to include pydantic.conlist or Field and apply this to the
current_bins, baseline_bins and labels fields so malformed-length payloads are
rejected at the schema boundary.
In `@reflexio/models/config_schema.py`:
- Around line 617-621: The new boolean field shadow_mode_enabled should be made
backward-compatible with persisted nulls by adding its name to the
_migrate_field_names None-stripping list; update the _migrate_field_names logic
(where other nullable fields are stripped) to include "shadow_mode_enabled" so
that incoming configs with "shadow_mode_enabled": null are converted to missing
and will fall back to the default False declared on the shadow_mode_enabled
field.
In `@reflexio/server/api.py`:
- Around line 1520-1544: The get_evaluation_overview endpoint is unthrottled and
needs rate limiting; add the same rate-limiting/dependency used by other read
endpoints to throttle expensive aggregates. Modify the get_evaluation_overview
route to require the existing rate limiter (e.g., add a
Depends(rate_limit_dependency) or the shared rate_limit decorator) so requests
to core_router.post("/api/get_evaluation_overview") will be limited before
calling get_reflexio(org_id=...) and reflexio.get_evaluation_overview(request).
Ensure you reference the same rate limiter function/class used elsewhere so the
endpoint inherits consistent limits and behavior.
In `@reflexio/server/services/evaluation_overview/distribution.py`:
- Around line 28-30: The loop over corrections allows negative c to index bins
(Python negative indices), corrupting the histogram; update the computation of
idx in the loop that uses _BUCKET_COUNT, bins and c (the for c in corrections
block) to clamp c into the valid range before indexing (e.g., ensure c is at
least 0 and at most _BUCKET_COUNT-1 using max/min) or skip negative values, then
increment bins[idx].
In `@reflexio/server/services/evaluation_overview/rule_attribution.py`:
- Around line 80-83: The current sort key only uses (-r.net_sessions,
-(r.successes_with + r.failures_with)) so ties are broken by unpredictable set
iteration; update the ordering to include a deterministic tie-breaker (e.g.,
append r.rule_id or r.rule_name as the last element of the sort key) so
rows.sort(...) becomes deterministic before slicing to rows[:top_n]; if the row
objects lack a stable id/name, use a stable string representation (e.g., str of
a unique tuple of identifying fields) as the final tiebreaker.
In `@reflexio/server/services/evaluation_overview/service.py`:
- Around line 72-73: The code computes days_since_first_eval from
window-filtered results causing misclassification; change the logic in
_build_hero and _build_tiles to derive days_since_first_eval from the full org
history instead of the windowed `results` (e.g., compute earliest evaluation
date using `org_history`/`all_results` or a provided `results_all` collection),
update the places that currently use the windowed `results` (references:
days_since_first_eval, _build_hero, _build_tiles) to use that earliest date, and
propagate that value to the existing branching logic used in the code blocks
around the current uses (the blocks covering lines 86-95 and 99-104) so
`early`/`empty` classification is based on org-wide history not the selected
time window.
- Around line 62-64: The code is hard-capping results with limit=10_000 in the
call to storage.get_agent_success_evaluation_results which can silently truncate
org-level aggregates; change the call in service.py (the all_results assignment)
to either query by a time window (pass explicit start/end timestamps) or
implement paginated reads (loop calling get_agent_success_evaluation_results
with an offset/next_token until exhausted) so you compute aggregates over the
complete result set rather than a fixed 10_000 limit; ensure any new params
(time bounds or pagination token) are plumbed through and documented in the
calling flow.
---
Nitpick comments:
In `@tests/server/services/evaluation_overview/test_rule_attribution.py`:
- Line 76: Remove the redundant no-op assignment `_ = RuleAttribution` (and its
inline comment) from the test; `RuleAttribution` is already imported at module
level so delete the assignment statement in the test to avoid confusion and keep
the import as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b288b680-4fa4-45cd-a409-ed5a2ddcb6ca
📒 Files selected for processing (17)
reflexio/lib/_dashboard.pyreflexio/models/api_schema/eval_overview_schema.pyreflexio/models/config_schema.pyreflexio/server/api.pyreflexio/server/services/evaluation_overview/__init__.pyreflexio/server/services/evaluation_overview/distribution.pyreflexio/server/services/evaluation_overview/hero_state.pyreflexio/server/services/evaluation_overview/rule_attribution.pyreflexio/server/services/evaluation_overview/service.pyreflexio/server/services/storage/storage_base/_extras.pytests/models/test_config_shadow_mode.pytests/models/test_eval_overview_schema.pytests/server/api_endpoints/test_evaluation_overview_api.pytests/server/services/evaluation_overview/test_distribution.pytests/server/services/evaluation_overview/test_hero_state.pytests/server/services/evaluation_overview/test_rule_attribution.pytests/server/services/evaluation_overview/test_service_integration.py
| current_bins: list[int] | ||
| baseline_bins: list[int] | ||
| labels: list[str] |
There was a problem hiding this comment.
Enforce the fixed 6-bin histogram shape at the schema boundary.
Line 75–77 currently accept any list lengths, but this endpoint contract is a fixed 6-bin distribution (0/1/2/3/4/5+). Without shape validation, malformed payloads can pass model validation and break consumers.
Proposed fix
class ScoreDistribution(BaseModel):
"""Corrections-per-session histogram, current window + baseline."""
- current_bins: list[int]
- baseline_bins: list[int]
- labels: list[str]
+ current_bins: list[int] = Field(min_length=6, max_length=6)
+ baseline_bins: list[int] = Field(min_length=6, max_length=6)
+ labels: list[str] = Field(min_length=6, max_length=6)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 75 - 77, The
schema currently allows variable-length lists for current_bins, baseline_bins
and labels; change those fields to enforce exactly 6 elements by using
pydantic's constrained list type (e.g., replace list[int]/list[str] with
conlist(int, min_items=6, max_items=6) and conlist(str, min_items=6,
max_items=6)) or by using Field(..., min_items=6, max_items=6); update the
imports to include pydantic.conlist or Field and apply this to the current_bins,
baseline_bins and labels fields so malformed-length payloads are rejected at the
schema boundary.
| from_ts: int = Field(ge=0) | ||
| to_ts: int = Field(ge=0) | ||
| bucket: BucketLiteral = "week" | ||
| include_shadow: bool = True |
There was a problem hiding this comment.
Validate timestamp window ordering in the request model.
Line 91–94 validates non-negative values, but not window order. A request with to_ts < from_ts should be rejected early to avoid nonsensical aggregation windows.
Proposed fix
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, model_validator
@@
class GetEvaluationOverviewRequest(BaseModel):
@@
from_ts: int = Field(ge=0)
to_ts: int = Field(ge=0)
bucket: BucketLiteral = "week"
include_shadow: bool = True
+
+ `@model_validator`(mode="after")
+ def validate_window(self) -> "GetEvaluationOverviewRequest":
+ if self.to_ts < self.from_ts:
+ raise ValueError("to_ts must be >= from_ts")
+ return self📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from_ts: int = Field(ge=0) | |
| to_ts: int = Field(ge=0) | |
| bucket: BucketLiteral = "week" | |
| include_shadow: bool = True | |
| from_ts: int = Field(ge=0) | |
| to_ts: int = Field(ge=0) | |
| bucket: BucketLiteral = "week" | |
| include_shadow: bool = True | |
| `@model_validator`(mode="after") | |
| def validate_window(self) -> "GetEvaluationOverviewRequest": | |
| if self.to_ts < self.from_ts: | |
| raise ValueError("to_ts must be >= from_ts") | |
| return self |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 91 - 94, The
model currently enforces non-negative from_ts and to_ts but does not prevent
to_ts < from_ts; add a Pydantic validation to the request model (e.g., a
root_validator or a validator on "to_ts") that checks from_ts <= to_ts and
raises a ValueError with a clear message when the window is invalid; reference
the fields by name ("from_ts" and "to_ts") in the validator and ensure the error
is surfaced as a validation error on the model.
| # Whether this org has opted into shadow-mode runs. Drives /healthz/eval | ||
| # liveness derivation and the /api/get_evaluation_overview hero state | ||
| # machine. When True, each publish optionally schedules a parallel | ||
| # "without Reflexio" generation for side-by-side comparison. | ||
| shadow_mode_enabled: bool = False |
There was a problem hiding this comment.
Harden backward compatibility for nullable persisted values of the new field.
After adding shadow_mode_enabled on Line 621, please also include it in the _migrate_field_names None-stripping list (Line 656+ pattern). Otherwise stored configs with "shadow_mode_enabled": null will fail validation instead of falling back to False.
Proposed fix
if isinstance(data, dict):
for key in (
"window_size",
"stride_size",
"extraction_backend",
"search_backend",
"reflection_config",
"playbook_optimizer_config",
+ "shadow_mode_enabled",
):
if key in data and data[key] is None:
del data[key]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/models/config_schema.py` around lines 617 - 621, The new boolean
field shadow_mode_enabled should be made backward-compatible with persisted
nulls by adding its name to the _migrate_field_names None-stripping list; update
the _migrate_field_names logic (where other nullable fields are stripped) to
include "shadow_mode_enabled" so that incoming configs with
"shadow_mode_enabled": null are converted to missing and will fall back to the
default False declared on the shadow_mode_enabled field.
| @core_router.post( | ||
| "/api/get_evaluation_overview", | ||
| response_model=GetEvaluationOverviewResponse, | ||
| response_model_exclude_none=True, | ||
| ) | ||
| def get_evaluation_overview( | ||
| request: GetEvaluationOverviewRequest, | ||
| org_id: str = Depends(default_get_org_id), | ||
| ) -> GetEvaluationOverviewResponse: | ||
| """Return the redesigned /evaluations page payload. | ||
|
|
||
| Aggregates hero state, four context tiles with deltas, top rule | ||
| attribution, and a corrections-per-session distribution into a single | ||
| response shaped exactly as the frontend renders it. | ||
|
|
||
| Args: | ||
| request (GetEvaluationOverviewRequest): Window + bucket granularity. | ||
| org_id (str): Organization ID resolved by the auth dependency. | ||
|
|
||
| Returns: | ||
| GetEvaluationOverviewResponse: Full overview payload. | ||
| """ | ||
| reflexio = get_reflexio(org_id=org_id) | ||
| return reflexio.get_evaluation_overview(request) | ||
|
|
There was a problem hiding this comment.
Add rate limiting to this expensive aggregate endpoint.
This route is currently unthrottled, unlike other read endpoints, and it fans out into multiple data reads/aggregations.
Suggested fix
`@core_router.post`(
"/api/get_evaluation_overview",
response_model=GetEvaluationOverviewResponse,
response_model_exclude_none=True,
)
+@limiter.limit("120/minute") # Rate limit for read operations
def get_evaluation_overview(
- request: GetEvaluationOverviewRequest,
+ http_request: Request, # required by limiter
+ payload: GetEvaluationOverviewRequest,
org_id: str = Depends(default_get_org_id),
) -> GetEvaluationOverviewResponse:
@@
reflexio = get_reflexio(org_id=org_id)
- return reflexio.get_evaluation_overview(request)
+ return reflexio.get_evaluation_overview(payload)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/api.py` around lines 1520 - 1544, The get_evaluation_overview
endpoint is unthrottled and needs rate limiting; add the same
rate-limiting/dependency used by other read endpoints to throttle expensive
aggregates. Modify the get_evaluation_overview route to require the existing
rate limiter (e.g., add a Depends(rate_limit_dependency) or the shared
rate_limit decorator) so requests to
core_router.post("/api/get_evaluation_overview") will be limited before calling
get_reflexio(org_id=...) and reflexio.get_evaluation_overview(request). Ensure
you reference the same rate limiter function/class used elsewhere so the
endpoint inherits consistent limits and behavior.
| for c in corrections: | ||
| idx = min(_BUCKET_COUNT - 1, c) | ||
| bins[idx] += 1 |
There was a problem hiding this comment.
Guard negative correction counts before bin indexing.
A negative c currently increments the last bin because Python accepts negative list indexes, which silently corrupts the histogram.
Suggested fix
for c in corrections:
- idx = min(_BUCKET_COUNT - 1, c)
+ idx = min(_BUCKET_COUNT - 1, max(0, c))
bins[idx] += 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for c in corrections: | |
| idx = min(_BUCKET_COUNT - 1, c) | |
| bins[idx] += 1 | |
| for c in corrections: | |
| idx = min(_BUCKET_COUNT - 1, max(0, c)) | |
| bins[idx] += 1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/evaluation_overview/distribution.py` around lines 28
- 30, The loop over corrections allows negative c to index bins (Python negative
indices), corrupting the histogram; update the computation of idx in the loop
that uses _BUCKET_COUNT, bins and c (the for c in corrections block) to clamp c
into the valid range before indexing (e.g., ensure c is at least 0 and at most
_BUCKET_COUNT-1 using max/min) or skip negative values, then increment
bins[idx].
| rows.sort( | ||
| key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)), | ||
| ) | ||
| return rows[:top_n] |
There was a problem hiding this comment.
Make tie ordering deterministic for stable top-N output.
When net_sessions and total fires tie, order depends on set iteration, so equal rows can shuffle between runs.
Suggested fix
rows.sort(
- key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)),
+ key=lambda r: (
+ -r.net_sessions,
+ -(r.successes_with + r.failures_with),
+ r.kind,
+ r.rule_id,
+ ),
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rows.sort( | |
| key=lambda r: (-r.net_sessions, -(r.successes_with + r.failures_with)), | |
| ) | |
| return rows[:top_n] | |
| rows.sort( | |
| key=lambda r: ( | |
| -r.net_sessions, | |
| -(r.successes_with + r.failures_with), | |
| r.kind, | |
| r.rule_id, | |
| ), | |
| ) | |
| return rows[:top_n] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/evaluation_overview/rule_attribution.py` around
lines 80 - 83, The current sort key only uses (-r.net_sessions,
-(r.successes_with + r.failures_with)) so ties are broken by unpredictable set
iteration; update the ordering to include a deterministic tie-breaker (e.g.,
append r.rule_id or r.rule_name as the last element of the sort key) so
rows.sort(...) becomes deterministic before slicing to rows[:top_n]; if the row
objects lack a stable id/name, use a stable string representation (e.g., str of
a unique tuple of identifying fields) as the final tiebreaker.
| all_results = self.storage.get_agent_success_evaluation_results( # type: ignore[attr-defined] | ||
| agent_version=None, limit=10_000 | ||
| ) |
There was a problem hiding this comment.
Avoid hard-capping aggregate source data at 10,000 rows.
This silently truncates large org datasets and can skew hero state, tiles, attribution, and distribution outputs.
A safer approach is a storage query scoped by time window (or paginated reads until exhaustion) so aggregates are computed on complete data.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/evaluation_overview/service.py` around lines 62 -
64, The code is hard-capping results with limit=10_000 in the call to
storage.get_agent_success_evaluation_results which can silently truncate
org-level aggregates; change the call in service.py (the all_results assignment)
to either query by a time window (pass explicit start/end timestamps) or
implement paginated reads (loop calling get_agent_success_evaluation_results
with an offset/next_token until exhausted) so you compute aggregates over the
complete result set rather than a fixed 10_000 limit; ensure any new params
(time bounds or pagination token) are plumbed through and documented in the
calling flow.
| hero = self._build_hero(request, results) | ||
| tiles = self._build_tiles(results, results_prev) |
There was a problem hiding this comment.
Compute onboarding age from org history, not just the current window.
days_since_first_eval is derived from window-filtered results, so older orgs can be incorrectly classified as early/empty when the selected window starts recently.
Suggested fix
- hero = self._build_hero(request, results)
+ earliest_all = min((r.created_at for r in all_results), default=None)
+ now_ts = int(datetime.now(UTC).timestamp())
+ days_since_first_eval = (
+ None if earliest_all is None else (now_ts - earliest_all) // 86_400
+ )
+ hero = self._build_hero(request, results, days_since_first_eval)
@@
def _build_hero(
self,
request: GetEvaluationOverviewRequest,
results: list[AgentSuccessEvaluationResult],
+ days_since_first_eval: int | None,
) -> HeroBlock:
- if not results:
- days_since = None
- else:
- earliest = min(r.created_at for r in results)
- days_since = (int(datetime.now(UTC).timestamp()) - earliest) // 86_400
n_shadow = self.storage.count_sessions_with_shadow_content( # type: ignore[attr-defined]
request.from_ts, request.to_ts
)
state = compute_hero_state(
shadow_enabled=self.config.shadow_mode_enabled,
- days_since_first_eval=days_since,
+ days_since_first_eval=days_since_first_eval,
n_shadow_in_window=n_shadow,
total_results=len(results),
)Also applies to: 86-95, 99-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/evaluation_overview/service.py` around lines 72 -
73, The code computes days_since_first_eval from window-filtered results causing
misclassification; change the logic in _build_hero and _build_tiles to derive
days_since_first_eval from the full org history instead of the windowed
`results` (e.g., compute earliest evaluation date using
`org_history`/`all_results` or a provided `results_all` collection), update the
places that currently use the windowed `results` (references:
days_since_first_eval, _build_hero, _build_tiles) to use that earliest date, and
propagate that value to the existing branching logic used in the code blocks
around the current uses (the blocks covering lines 86-95 and 99-104) so
`early`/`empty` classification is based on org-wide history not the selected
time window.
|
Superseded by #91 — consolidated evaluations + Braintrust PR. Branch remains on remote for history. |
…valuation_overview Surfaces imported_score data alongside the existing tiles in the overview response. - Adds `BraintrustTileRow` (scorer_name, current, n, delta) to the response model, exposed as a new `braintrust_tiles` list (defaults to []). - Adds a default no-op `get_imported_scores(org_id, from_ts, to_ts)` on BaseStorage. Real per-backend implementations are a follow-up. - Adds `EvaluationOverviewService._build_braintrust_tiles` that aggregates scores by scorer_name (mean + count) and computes a vs-prior-window delta. When no prior data exists, delta == current — matches the existing tile "no baseline" convention so the frontend can detect it. - Updates the storage-unconfigured default response to include `braintrust_tiles: []`. 3 new tests (aggregator, empty case, deltas with + without baseline). 53 tests pass; ruff + pyright clean on touched files. Depends on PR #86 (Plan B-backend — eval overview endpoint) AND PR #87 (Plan C-backend — Braintrust connector foundation).
…hods Combines items 7 and 8 — same file (sqlite_storage/_extras.py), same test fixture pattern, one PR. **Plan B-backend (count_sessions_with_shadow_content, get_interactions_by_session):** - Both methods JOIN interactions ↔ requests since session_id lives on the request. count_sessions_with_shadow_content filters by COALESCE of shadow_content != ''; get_interactions_by_session orders by i.created_at. **Plan C-backend + Plan C-overview (4 connection methods + get_imported_scores):** - New tables `braintrust_connection` (PK org_id) and `imported_score` (UNIQUE org_id+source+source_run_id+scorer_name for idempotent re-sync; indexes on session_id and ts). - Upsert via ON CONFLICT DO UPDATE on the existing PK / UNIQUE. - API keys stored encrypted (the service encrypts before save). **Tests (15 pass, all integration-marked, tempfile-isolated):** - Shadow-count: zero/distinct-sessions/window filter - get_interactions_by_session: correct session/empty/unknown - BraintrustConnection: roundtrip/upsert/get-unknown/delete-idempotent - ImportedScore: window/idempotent re-sync/org-filter/empty-noop Lint + pyright clean. Supabase + Postgres backends are explicit follow-ups. Depends on: PR #86 (Plan B-backend), PR #87 (Plan C-backend), PR #88 (Plan C-overview). All three must merge before this rebases cleanly.
Plan B-backend ships the aggregate endpoint that the redesigned
/evaluationspage (Plan B-frontend) reads from. All hero/tile/attribution/distribution computation happens server-side; the frontend renders, never computes.Summary
POST /api/get_evaluation_overview— single round-trip returning hero state, four context tiles with week-over-week deltas, top rule attribution, and corrections distribution.Config.shadow_mode_enabled: bool = False— new toggle that drives the hero state machine (drives full vs early vs shadow_off).hero_state.compute_hero_state(...)— 4-state machine per spec §3.2distribution.bucket_corrections(...)— 6-bin corrections histogram (0/1/2/3/4/5+)rule_attribution.compute_net_sessions(...)— joins citations + success outcomes, top-N by net sessionsEvaluationOverviewServicecomposes the four pieces.BaseStorage(count_sessions_with_shadow_content,get_interactions_by_session) so backends can override without breaking the endpoint.Two adjustments from spec
AgentSuccessEvaluationResulthad a 0.0–1.0 score field; it doesn't (onlyis_successexists). Replaced with corrections-per-session bins — still shows the "shape of quality" and shifts visibly as sessions improve.PlaybookApplicationStat. The spec §4.4 proposed a newrule_applicationtable; the existingInteraction.citationsJSON column +/api/get_playbook_application_statsalready cover this. No new table.Test plan
Companion PR
ReflexioAI/reflexio-enterprise#<TBD>bumps the submodule pointer here.Out of scope (out-of-scope follow-ups)
get_interactions_by_sessiondefaults to[]; rule attribution panel is empty until backends implement real queries.References
docs/superpowers/specs/2026-05-21-evaluations-redesign-and-braintrust-integration-design.md(§3.2, §3.3, §4.1, §4.2, §4.4, §4.5)docs/superpowers/plans/2026-05-23-evaluations-plan-b-backend.mdSummary by CodeRabbit