feat(storage): SQLite implementations for Plan B + Plan C storage methods by yilu331 · Pull Request #89 · ReflexioAI/reflexio

yilu331 · 2026-05-23T22:44:52Z

Combines items 7 + 8 from the remaining-work list — same file, same test fixture pattern, one PR.

Plan B-backend storage gaps:

count_sessions_with_shadow_content — JOINs interactions ↔ requests, COALESCE(shadow_content, '') != ''
get_interactions_by_session — JOINs to filter by Request.session_id

Plan C-backend + Plan C-overview storage gaps:

New tables braintrust_connection (PK org_id) and imported_score (UNIQUE org_id+source+source_run_id+scorer_name; indexes on session_id and ts)
4 connection methods (save/get/delete + the new get for tiles aggregation)
save_imported_scores upsert via ON CONFLICT DO UPDATE for idempotent re-sync

Tests

15 pass, integration-marked, tempfile-isolated. Covers shadow-count window/distinct/empty, session filter cases, connection roundtrip/upsert/delete-idempotent, imported scores idempotent re-sync + window + org-filter + empty-noop.

Depends on

reflexio#86 + reflexio#87 + reflexio#88 must merge first.

Out of scope

Supabase + Postgres backends — explicit follow-up ("Plan C-backends" / "Plan B-backends").

Summary by CodeRabbit

New Features

Braintrust Integration: Connect and sync Braintrust projects to import evaluation scores into the platform.
Evaluation Overview Dashboard: New endpoint providing comprehensive evaluation metrics including hero state, context tiles, rule attribution analysis, score distribution, and Braintrust score aggregates.
Shadow Mode Configuration: Added toggle to enable shadow-mode evaluations for side-by-side comparison scenarios.
Secure Credential Storage: Implemented encrypted storage for external connector credentials.

…apper

Adds the minimum foundation for the Braintrust connector. A customer can paste an API key, see their workspaces, pick projects, and trigger a one-shot import of scorer outputs. Scheduled (cron) sync, score aggregation into /api/get_evaluation_overview, and the frontend settings UI are explicit follow-ups (Plan C-cron, Plan C-overview, Plan C-frontend). - **Schemas** — BraintrustConnection, ImportedScore, plus request/response models for the four endpoints. - **BraintrustClient** — HTTP wrapper over the Braintrust REST API (validate_key, list_organizations, list_projects, list_experiments, list_spans). Pure httpx; raises BraintrustAuthError / BraintrustHTTPError. - **Encryption helper** — Fernet via REFLEXIO_FERNET_KEYS env var. Passes through unchanged when no key is configured (matches existing EncryptManager behavior, keeps OS dev mode workable). - **BraintrustConnectorService** — orchestrates connect / select_projects / status / disconnect / sync_once. Extracts session_id from span.metadata['reflexio_session_id'] when present (progressive matching). - **Storage default no-op stubs** on BaseStorage for save/get/delete Braintrust connection + save imported scores. Per-backend SQLite/Supabase implementations are a follow-up. - **5 routes** — /api/braintrust/{connect,select_projects,status,sync}, DELETE /api/braintrust/connection. API keys never echoed in responses. 34 tests pass: - 8 schema round-trip tests - 4 encryption tests (passthrough + Fernet roundtrip) - 8 HTTP client tests (mocked transport) - 10 service tests (in-memory storage + stub client, including span-metadata session matching and last_sync_ts persistence) - 4 endpoint smoke tests (disconnected state, sync without connection, etc) - Plan C-cron: scheduled 15-min sync via a new RecurringJobsScheduler - Plan C-overview: surface Braintrust tiles in /api/get_evaluation_overview - Plan C-frontend: settings UI + context-band tiles - Plan C-backends: real SQLite + Supabase + Postgres implementations of the four new storage methods Spec: docs/superpowers/specs/2026-05-21-evaluations-redesign-and-braintrust-integration-design.md §5, §6, §7 Plan: docs/superpowers/plans/2026-05-23-evaluations-plan-c-backend.md (in the enterprise repo)

…valuation_overview Surfaces imported_score data alongside the existing tiles in the overview response. - Adds `BraintrustTileRow` (scorer_name, current, n, delta) to the response model, exposed as a new `braintrust_tiles` list (defaults to []). - Adds a default no-op `get_imported_scores(org_id, from_ts, to_ts)` on BaseStorage. Real per-backend implementations are a follow-up. - Adds `EvaluationOverviewService._build_braintrust_tiles` that aggregates scores by scorer_name (mean + count) and computes a vs-prior-window delta. When no prior data exists, delta == current — matches the existing tile "no baseline" convention so the frontend can detect it. - Updates the storage-unconfigured default response to include `braintrust_tiles: []`. 3 new tests (aggregator, empty case, deltas with + without baseline). 53 tests pass; ruff + pyright clean on touched files. Depends on PR #86 (Plan B-backend — eval overview endpoint) AND PR #87 (Plan C-backend — Braintrust connector foundation).

…hods Combines items 7 and 8 — same file (sqlite_storage/_extras.py), same test fixture pattern, one PR. **Plan B-backend (count_sessions_with_shadow_content, get_interactions_by_session):** - Both methods JOIN interactions ↔ requests since session_id lives on the request. count_sessions_with_shadow_content filters by COALESCE of shadow_content != ''; get_interactions_by_session orders by i.created_at. **Plan C-backend + Plan C-overview (4 connection methods + get_imported_scores):** - New tables `braintrust_connection` (PK org_id) and `imported_score` (UNIQUE org_id+source+source_run_id+scorer_name for idempotent re-sync; indexes on session_id and ts). - Upsert via ON CONFLICT DO UPDATE on the existing PK / UNIQUE. - API keys stored encrypted (the service encrypts before save). **Tests (15 pass, all integration-marked, tempfile-isolated):** - Shadow-count: zero/distinct-sessions/window filter - get_interactions_by_session: correct session/empty/unknown - BraintrustConnection: roundtrip/upsert/get-unknown/delete-idempotent - ImportedScore: window/idempotent re-sync/org-filter/empty-noop Lint + pyright clean. Supabase + Postgres backends are explicit follow-ups. Depends on: PR #86 (Plan B-backend), PR #87 (Plan C-backend), PR #88 (Plan C-overview). All three must merge before this rebases cleanly.

coderabbitai · 2026-05-23T22:45:02Z

📝 Walkthrough

Walkthrough

This PR introduces two major features: a Braintrust connector for syncing evaluation scores with encrypted API key storage, and a comprehensive evaluation overview dashboard endpoint. Both features require new schemas, storage infrastructure, and API wiring alongside supporting configuration and test coverage.

Changes

Braintrust Connector and Evaluation Overview Dashboard

Layer / File(s)	Summary
Data Schemas and Configuration `reflexio/models/api_schema/braintrust_schema.py`, `reflexio/models/api_schema/eval_overview_schema.py`, `reflexio/models/config_schema.py`, `tests/models/test_braintrust_schema.py`, `tests/models/test_config_shadow_mode.py`, `tests/models/test_eval_overview_schema.py`	Braintrust schemas define persisted connections, imported scores, workspace/project summaries, and request/response models for connect/select/status/sync flows. Evaluation overview schemas define hero block, context tiles, rule attribution, score distribution, and response aggregation. Config adds `shadow_mode_enabled` flag with tests for defaults and serialization.
Braintrust HTTP Client and Encryption `reflexio/server/services/braintrust/_encryption.py`, `reflexio/server/services/braintrust/client.py`, `tests/server/services/braintrust/test_encryption.py`, `tests/server/services/braintrust/test_client.py`	BraintrustClient provides typed methods to validate API keys, list organizations/projects/experiments/spans via httpx with bearer token auth and response unwrapping. Encryption module provides Fernet-based encrypt/decrypt with lazy loading from `REFLEXIO_FERNET_KEYS` environment variable and passthrough fallback when unconfigured.
Braintrust Service and Storage Persistence `reflexio/server/services/braintrust/service.py`, `reflexio/server/services/storage/sqlite_storage/_base.py`, `reflexio/server/services/storage/sqlite_storage/_extras.py`, `reflexio/server/services/storage/storage_base/_extras.py`, `tests/server/services/braintrust/test_service.py`	BraintrustConnectorService orchestrates connect/select/status/disconnect/sync workflows, managing encrypted connection storage and score import. SQLite schema adds `braintrust_connection` and `imported_score` tables with indexes. Storage base and implementation provide persistence methods for connections and imported scores, with full integration test coverage for save/get/delete/upsert and window-based queries.
Braintrust API Endpoints and Dashboard Wrappers `reflexio/lib/_dashboard.py`, `reflexio/server/api.py`, `tests/server/api_endpoints/test_braintrust_api.py`	API routes wire connect/select/status/disconnect/sync endpoints that resolve org\id and delegate to per-org reflexio instance. DashboardMixin adds braintrust\* wrapper methods normalizing dict inputs into request models and constructing service from storage and request context.
Evaluation Overview Helper Modules `reflexio/server/services/evaluation_overview/distribution.py`, `reflexio/server/services/evaluation_overview/hero_state.py`, `reflexio/server/services/evaluation_overview/rule_attribution.py`, `tests/server/services/evaluation_overview/test_distribution.py`, `tests/server/services/evaluation_overview/test_hero_state.py`, `tests/server/services/evaluation_overview/test_rule_attribution.py`	Pure utility modules compute hero state (empty/shadow\_off/early/full) with shadow mode awareness and threshold precedence, bucket correction counts into six bins (0–5+), and aggregate rule citations into net\_sessions attribution rows with top\_n ranking.
Evaluation Overview Service `reflexio/server/services/evaluation_overview/service.py`, `tests/server/services/evaluation_overview/test_service_integration.py`, `tests/server/services/evaluation_overview/test_braintrust_tiles.py`	EvaluationOverviewService fetches current/prior evaluation results and orchestrates building hero block, context tiles (success %, corrections, turns, escalation), rule attribution, score distribution, and Braintrust tile aggregates. Includes window-aligned bucketing, per-scorer Braintrust score grouping with delta computation, and integration tests with mocked storage.
Evaluation Overview API and Dashboard Integration `reflexio/lib/_dashboard.py`, `reflexio/server/api.py`, `tests/server/api_endpoints/test_evaluation_overview_api.py`	POST `/api/get_evaluation_overview` endpoint resolves org\_id and delegates to service. DashboardMixin wrapper normalizes dict input, returns empty response when storage unconfigured, otherwise runs service with storage and config to produce full metric response.
Storage Infrastructure and Tests `reflexio/server/services/storage/storage_base/_extras.py`, `reflexio/server/services/storage/sqlite_storage/__init__.py`, `reflexio/server/services/storage/sqlite_storage/_base.py`, `reflexio/server/services/storage/sqlite_storage/_extras.py`, `tests/server/services/storage/test_sqlite_storage_bc_extras.py`	Base storage mixin defines default no-op methods for shadow content counting, session interaction lookup, and Braintrust connection/score persistence. SQLite implementation provides actual SQL queries with locking and transactions. Integration tests cover roundtrip persistence, upsert overwriting, window filtering, idempotent operations, and empty-input edge cases.
Minor Cleanup and Datetime Updates `reflexio/server/api_endpoints/stall_state_api.py`, `reflexio/server/llm/providers/claude_code_stream_parser.py`, `tests/client/test_stall_state_client.py`, `tests/server/api_endpoints/test_stall_state_api.py`, `tests/server/llm/test_claude_code_stream_parser.py`, `tests/server/services/storage/test_stall_state.py`	Update datetime imports across modules to use `UTC` directly from `datetime` instead of `timezone.utc`. Reformat imports for consistency (parenthesized multi-line blocks).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Scores from Braintrust now encrypted and stored,
Hero states computed with shadow modes explored,
Rules that moved needles ranked by their might,
Dashboard tiles gleaming with metrics so bright,
Two features dancing—connected and tight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.39% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(storage): SQLite implementations for Plan B + Plan C storage methods' directly and specifically describes the main change: adding SQLite implementations for both Plan B and Plan C storage operations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/evaluations-plan-bc-sqlite-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (2)

tests/server/services/evaluation_overview/test_rule_attribution.py (1)

76-76: ⚡ Quick win

Remove the no-op assignment.

The assignment _ = RuleAttribution doesn't verify anything. If the intent is to ensure the import works, the import itself at line 4 already does that.

♻️ Remove the no-op line

-    _ = RuleAttribution  # imported for export sanity; no-op
     citations_by_session = {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/services/evaluation_overview/test_rule_attribution.py` at line
76, Remove the redundant no-op assignment `_ = RuleAttribution`; the import of
RuleAttribution already exercises the symbol, so delete the line containing `_ =
RuleAttribution` in the test file (leave the original import intact and no other
changes to function/class names such as RuleAttribution or the test structure).

tests/server/services/evaluation_overview/test_service_integration.py (1)

51-57: ⚡ Quick win

Tighten this test to be deterministic and regression-sensitive.

Line 51 and Lines 54-57 currently allow broad pass conditions, so real behavior regressions can slip through.

Suggested tightening

-    response = svc.run(
-        GetEvaluationOverviewRequest(from_ts=0, to_ts=int(time.time()))
-    )
+    response = svc.run(
+        GetEvaluationOverviewRequest(from_ts=0, to_ts=1702592000)
+    )
 
-    assert response.hero.state in ("full", "early", "shadow_off", "empty")
-    assert response.context_tiles.success.current >= 0.0
-    assert len(response.score_distribution.current_bins) == 6
+    assert response.hero.state == "full"
+    assert response.context_tiles.success.current > 0.0
+    assert response.score_distribution.current_bins == [1, 1, 0, 1, 0, 0]
     assert response.score_distribution.labels == ["0", "1", "2", "3", "4", "5+"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/services/evaluation_overview/test_service_integration.py` around
lines 51 - 57, The test currently allows broad, non-deterministic outcomes; make
it deterministic by using a fixed to_ts in the GetEvaluationOverviewRequest
(replace int(time.time()) with a stable timestamp or a test fixture) and tighten
the assertions to exact expected values: assert response.hero.state equals the
single expected state string, assert response.context_tiles.success.current
equals the expected numeric value, and assert
response.score_distribution.current_bins and response.score_distribution.labels
equal the exact expected lists (not just length or a loose subset); update the
assertions in this test (the GetEvaluationOverviewRequest call and checks of
response.hero.state, response.context_tiles.success.current,
response.score_distribution.current_bins, and
response.score_distribution.labels) so the test fails on any regression rather
than accepting broad ranges.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@reflexio/models/api_schema/eval_overview_schema.py`:
- Around line 104-105: Replace the Unicode minus character in the docstring for
the parameter named "delta" with an ASCII hyphen; locate the docstring in
reflexio/models/api_schema/eval_overview_schema.py where the parameter
description reads "delta (float): Mean current − mean prior-window..." and
change the Unicode minus "−" to the ASCII "-" so the line becomes "Mean current
- mean prior-window..." to eliminate the RUF002 ambiguity warning.

In `@reflexio/server/services/braintrust/_encryption.py`:
- Around line 52-53: The code currently returns None from _load() when no valid
Fernet keys are parsed, which lets encrypt() silently fall back to plaintext;
change _load() so that when the REFLEXIO_FERNET_KEYS env var is set but parsing
yields zero valid Fernet instances (i.e., local variable fernets is empty), it
raises an explicit exception (e.g., ValueError or RuntimeError) to fail closed;
alternatively, ensure encrypt()/decrypt() check for the presence of
REFLEXIO_FERNET_KEYS and raise if fernets is None/empty instead of defaulting to
plaintext. Reference _load(), encrypt(), decrypt(), and the REFLEXIO_FERNET_KEYS
parsing/fernets variable to locate the change.

In `@reflexio/server/services/braintrust/client.py`:
- Around line 141-152: The _get method in BraintrustClient should catch
httpx.RequestError raised by self._client.get and normalize transport/timeouts
into the connector's structured error; wrap the call to self._client.get in a
try/except catching httpx.RequestError as exc and re-raise a BraintrustHTTPError
that includes the transport error details (e.g., an explicit error message and
the original exception info) so transport-level failures no longer escape as raw
httpx exceptions; preserve the existing handling for HTTP 401/403 and non-2xx
responses in BraintrustClient._get.

In `@reflexio/server/services/braintrust/service.py`:
- Around line 167-169: The call to self.storage.save_imported_scores(all_scores)
can raise and currently prevents updating the connection error state; wrap that
call in a try/except that catches exceptions, records the exception text into
the connection last_error (use the same path _persist_sync_outcome(connection,
error=...)) and ensure _persist_sync_outcome is invoked in both success and
failure paths before returning or re-raising; update the SyncBraintrustResponse
flow so when save_imported_scores fails you persist the error and return or
raise consistently instead of letting the exception escape untracked.
- Around line 70-88: connect() and sync_once() create a BraintrustClient via
client_factory(request.api_key) but never close it and sync_once() may skip
_persist_sync_outcome if storage.save_imported_scores(all_scores) raises; update
both connect() and sync_once() to ensure the BraintrustClient is closed (call
client.close() in a finally block or use a context manager around
BraintrustClient) and in sync_once() wrap
self.storage.save_imported_scores(all_scores) in its own try/except so any
StorageError (or generic Exception) is caught, still call
_persist_sync_outcome(...) with a failure/result indicating persistence failed,
and re-raise or log as appropriate; reference the client creation site via
client_factory and the client type BraintrustClient plus _persist_sync_outcome
and self.storage.save_imported_scores to locate the changes.

In `@reflexio/server/services/evaluation_overview/distribution.py`:
- Around line 28-30: The loop over corrections allows negative c values which
trigger Python negative indexing; update the loop in distribution.py (the for c
in corrections: block using _BUCKET_COUNT, idx, and bins) to guard negatives
before computing idx—e.g., if c < 0: continue (or clamp c = 0 if you prefer
mapping negatives to the first bucket) so that idx = min(_BUCKET_COUNT - 1, c)
never receives a negative value and invalid corrections are not mis-bucketed.

In `@reflexio/server/services/evaluation_overview/service.py`:
- Around line 74-75: The code computes days_since_first_eval from the
window-filtered `results`, which causes hero state flips when older evaluations
exist; update the logic so `days_since_first_eval` is derived from the
full/unfiltered evaluations collection (e.g., `all_results` or whatever variable
holds all evaluations) rather than `results`, and pass that full collection into
`_build_hero` (or compute it before calling `_build_hero`) — update both places
mentioned (around the `days_since_first_eval` calculation and the `_build_hero`
call in the same block and the 92-102 region) so the hero uses the earliest
evaluation date across all evaluations, not just the current window.

In `@tests/server/api_endpoints/test_evaluation_overview_api.py`:
- Line 21: The assertion currently allows non-empty variants; tighten it to only
accept the empty-state variants by changing the check on body["hero"]["state"]
so it only permits the expected empty values (e.g., "shadow_off" and "empty")
instead of ("full", "early", "shadow_off", "empty"); update the assertion that
references body["hero"]["state"] in the test to reflect this narrower set.

In `@tests/server/services/braintrust/test_encryption.py`:
- Around line 43-46: In test_finalize_env_cleanup, avoid mutating process-wide
env directly: add the pytest monkeypatch fixture parameter to the test signature
and replace os.environ.pop("REFLEXIO_FERNET_KEYS", None) with
monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) so pytest will restore
the original environment automatically; keep the call to
_encryption._reset_for_test() as-is.

---

Nitpick comments:
In `@tests/server/services/evaluation_overview/test_rule_attribution.py`:
- Line 76: Remove the redundant no-op assignment `_ = RuleAttribution`; the
import of RuleAttribution already exercises the symbol, so delete the line
containing `_ = RuleAttribution` in the test file (leave the original import
intact and no other changes to function/class names such as RuleAttribution or
the test structure).

In `@tests/server/services/evaluation_overview/test_service_integration.py`:
- Around line 51-57: The test currently allows broad, non-deterministic
outcomes; make it deterministic by using a fixed to_ts in the
GetEvaluationOverviewRequest (replace int(time.time()) with a stable timestamp
or a test fixture) and tighten the assertions to exact expected values: assert
response.hero.state equals the single expected state string, assert
response.context_tiles.success.current equals the expected numeric value, and
assert response.score_distribution.current_bins and
response.score_distribution.labels equal the exact expected lists (not just
length or a loose subset); update the assertions in this test (the
GetEvaluationOverviewRequest call and checks of response.hero.state,
response.context_tiles.success.current,
response.score_distribution.current_bins, and
response.score_distribution.labels) so the test fails on any regression rather
than accepting broad ranges.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ef3cd802-8f65-4ff5-8158-9aa1b945c20a

📥 Commits

Reviewing files that changed from the base of the PR and between 25f8034 and 240cfe6.

📒 Files selected for processing (39)

reflexio/lib/_dashboard.py
reflexio/models/api_schema/braintrust_schema.py
reflexio/models/api_schema/eval_overview_schema.py
reflexio/models/config_schema.py
reflexio/server/api.py
reflexio/server/api_endpoints/stall_state_api.py
reflexio/server/llm/providers/claude_code_stream_parser.py
reflexio/server/services/braintrust/__init__.py
reflexio/server/services/braintrust/_encryption.py
reflexio/server/services/braintrust/client.py
reflexio/server/services/braintrust/service.py
reflexio/server/services/evaluation_overview/__init__.py
reflexio/server/services/evaluation_overview/distribution.py
reflexio/server/services/evaluation_overview/hero_state.py
reflexio/server/services/evaluation_overview/rule_attribution.py
reflexio/server/services/evaluation_overview/service.py
reflexio/server/services/storage/sqlite_storage/__init__.py
reflexio/server/services/storage/sqlite_storage/_base.py
reflexio/server/services/storage/sqlite_storage/_extras.py
reflexio/server/services/storage/storage_base/_extras.py
tests/client/test_stall_state_client.py
tests/models/test_braintrust_schema.py
tests/models/test_config_shadow_mode.py
tests/models/test_eval_overview_schema.py
tests/server/api_endpoints/test_braintrust_api.py
tests/server/api_endpoints/test_evaluation_overview_api.py
tests/server/api_endpoints/test_stall_state_api.py
tests/server/llm/test_claude_code_stream_parser.py
tests/server/services/braintrust/__init__.py
tests/server/services/braintrust/test_client.py
tests/server/services/braintrust/test_encryption.py
tests/server/services/braintrust/test_service.py
tests/server/services/evaluation_overview/test_braintrust_tiles.py
tests/server/services/evaluation_overview/test_distribution.py
tests/server/services/evaluation_overview/test_hero_state.py
tests/server/services/evaluation_overview/test_rule_attribution.py
tests/server/services/evaluation_overview/test_service_integration.py
tests/server/services/storage/test_sqlite_storage_bc_extras.py
tests/server/services/storage/test_stall_state.py

coderabbitai · 2026-05-23T22:53:32Z

+        delta (float): Mean current − mean prior-window. Equals `current`
+            when no baseline (the frontend renders "no baseline").


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace the Unicode minus with ASCII - in this docstring.

The current character triggers Ruff RUF002 ambiguity warnings; swapping to - avoids lint/tooling friction.

🧰 Tools

🪛 Ruff (0.15.13)

[warning] 104-104: Docstring contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 104 - 105, Replace the Unicode minus character in the docstring for the parameter named "delta" with an ASCII hyphen; locate the docstring in reflexio/models/api_schema/eval_overview_schema.py where the parameter description reads "delta (float): Mean current − mean prior-window..." and change the Unicode minus "−" to the ASCII "-" so the line becomes "Mean current - mean prior-window..." to eliminate the RUF002 ambiguity warning.

coderabbitai · 2026-05-23T22:53:32Z

+    if not fernets:
+        return None


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when REFLEXIO_FERNET_KEYS is set but unusable.

If the env var is present and all keys are invalid, _load() returns None, and encrypt() falls back to plaintext storage. That silently disables encryption under production misconfiguration.

Suggested fix

if not fernets: - return None + raise RuntimeError( + f"{_ENV_KEY} is set but contains no valid Fernet keys" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if not fernets:

return None

if not fernets:

raise RuntimeError(

f"{_ENV_KEY} is set but contains no valid Fernet keys"

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/braintrust/_encryption.py` around lines 52 - 53, The code currently returns None from _load() when no valid Fernet keys are parsed, which lets encrypt() silently fall back to plaintext; change _load() so that when the REFLEXIO_FERNET_KEYS env var is set but parsing yields zero valid Fernet instances (i.e., local variable fernets is empty), it raises an explicit exception (e.g., ValueError or RuntimeError) to fail closed; alternatively, ensure encrypt()/decrypt() check for the presence of REFLEXIO_FERNET_KEYS and raise if fernets is None/empty instead of defaulting to plaintext. Reference _load(), encrypt(), decrypt(), and the REFLEXIO_FERNET_KEYS parsing/fernets variable to locate the change.

coderabbitai · 2026-05-23T22:53:32Z

+    def _get(
+        self, path: str, *, params: dict[str, Any] | None = None
+    ) -> httpx.Response:
+        url = f"{self.base_url}{path}"
+        response = self._client.get(url, params=params)
+        if response.status_code in (401, 403):
+            raise BraintrustAuthError(
+                f"Braintrust rejected the API key (HTTP {response.status_code})"
+            )
+        if not (200 <= response.status_code < 300):
+            raise BraintrustHTTPError(response.status_code, response.text)
+        return response


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In httpx 0.28.1, does httpx.Client.get(...)raisehttpx.RequestError (including timeout/transport failures) instead of always returning a Response?

💡 Result:

In httpx 0.28.1, httpx.Client.get(...) does not always return a Response object [1][2]. When a request fails due to transport-level issues—such as connection timeouts, read/write timeouts, or network errors—the method raises an exception that inherits from httpx.RequestError [3][4]. Specifically, httpx.RequestError is the base class for all exceptions that occur when issuing a request [3][4]. If a transport failure occurs (e.g., a timeout), httpx raises a specific subclass of RequestError, such as httpx.TimeoutException (which includes ConnectTimeout, ReadTimeout, etc.) or httpx.NetworkError [3][4]. Because these exceptions are raised during the execution of the request, the method call is interrupted and does not return a Response [3][5]. It is important to distinguish this from HTTP status errors (like 4xx or 5xx responses). If the server successfully returns an HTTP error response, httpx.Client.get(...) will still return a Response object [6][2]. To raise an exception for those status codes, you must explicitly call response.raise_for_status [3][4].

Citations:

1: https://pypi.org/project/httpx/0.28.1/

2: https://www.python-httpx.org/api/

3: https://github.com/encode/httpx/blob/master/httpx/_exceptions.py

4: https://www.python-httpx.org/exceptions/

5: https://stackoverflow.com/questions/73676661/how-to-trap-this-nested-httpx-exception

6: https://github.com/encode/httpx/blob/master/docs/api.md

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the client file around the _get implementation and nearby error definitions/uses sed -n '1,220p' reflexio/server/services/braintrust/client.py # Find where these errors are caught/handled within the repo rg -n "BraintrustAuthError|BraintrustHTTPError" -S . # Also inspect the service layer entry points that likely call _get rg -n "services/braintrust|braintrust" reflexio/server -S

Repository: ReflexioAI/reflexio

Length of output: 23748

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show Braintrust service implementation around exception handling sed -n '1,260p' reflexio/server/services/braintrust/service.py # Show the client tests that assert expected exception types sed -n '1,220p' tests/server/services/braintrust/test_client.py

Repository: ReflexioAI/reflexio

Length of output: 13485

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "status_code\s*==\s*0|BraintrustHTTPError\(\s*0\b" -S . rg -n "last_error|Braintrust HTTP error" reflexio/server/services/braintrust -S

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash set -euo pipefail # Where is BraintrustHTTPError.status_code referenced? rg -n "BraintrustHTTPError.*status_code|\.status_code\b" reflexio tests -S # Inspect braintrust service tests that may assert exact error strings/status codes sed -n '1,240p' tests/server/services/braintrust/test_service.py

Repository: ReflexioAI/reflexio

Length of output: 13811

Normalize httpx transport failures into BraintrustHTTPError.

In reflexio/server/services/braintrust/client.py (BraintrustClient._get), httpx.Client.get() raises httpx.RequestError (including timeout/network transport failures). Those currently bypass the connector’s BraintrustAuthError/BraintrustHTTPError handling in BraintrustConnectorService, causing transport-level failures to escape as unhandled exceptions instead of structured sync/connect failures.

Suggested fix

def _get( self, path: str, *, params: dict[str, Any] | None = None ) -> httpx.Response: url = f"{self.base_url}{path}" - response = self._client.get(url, params=params) + try: + response = self._client.get(url, params=params) + except httpx.RequestError as e: + raise BraintrustHTTPError(0, f"Request failed: {e}") from e if response.status_code in (401, 403): raise BraintrustAuthError( f"Braintrust rejected the API key (HTTP {response.status_code})" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _get(

self, path: str, *, params: dict[str, Any] | None = None

) -> httpx.Response:

url = f"{self.base_url}{path}"

response = self._client.get(url, params=params)

if response.status_code in (401, 403):

raise BraintrustAuthError(

f"Braintrust rejected the API key (HTTP {response.status_code})"

)

if not (200 <= response.status_code < 300):

raise BraintrustHTTPError(response.status_code, response.text)

return response

def _get(

self, path: str, *, params: dict[str, Any] | None = None

) -> httpx.Response:

url = f"{self.base_url}{path}"

try:

response = self._client.get(url, params=params)

except httpx.RequestError as e:

raise BraintrustHTTPError(0, f"Request failed: {e}") from e

if response.status_code in (401, 403):

raise BraintrustAuthError(

f"Braintrust rejected the API key (HTTP {response.status_code})"

)

if not (200 <= response.status_code < 300):

raise BraintrustHTTPError(response.status_code, response.text)

return response

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/braintrust/client.py` around lines 141 - 152, The _get method in BraintrustClient should catch httpx.RequestError raised by self._client.get and normalize transport/timeouts into the connector's structured error; wrap the call to self._client.get in a try/except catching httpx.RequestError as exc and re-raise a BraintrustHTTPError that includes the transport error details (e.g., an explicit error message and the original exception info) so transport-level failures no longer escape as raw httpx exceptions; preserve the existing handling for HTTP 401/403 and non-2xx responses in BraintrustClient._get.

coderabbitai · 2026-05-23T22:53:32Z

+        client = self.client_factory(request.api_key)
+        try:
+            if not client.validate_key():
+                return ConnectBraintrustResponse(
+                    success=False, msg="Braintrust rejected the API key."
+                )
+            workspaces = self._fetch_workspace_tree(client)
+        except BraintrustAuthError:
+            return ConnectBraintrustResponse(
+                success=False, msg="Braintrust rejected the API key."
+            )
+        except BraintrustHTTPError as e:
+            return ConnectBraintrustResponse(
+                success=False, msg=f"Braintrust HTTP error: {e}"
+            )
+
+        return ConnectBraintrustResponse(
+            success=True, workspaces=workspaces, msg=""
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n --type=py 'client_factory\(|\.close\(' \ reflexio/server/services/braintrust/service.py \ reflexio/server/services/braintrust/client.py

Repository: ReflexioAI/reflexio

Length of output: 342

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,120p' reflexio/server/services/braintrust/service.py | nl -ba | sed -n '55,105p' echo "----" sed -n '120,240p' reflexio/server/services/braintrust/service.py | nl -ba | sed -n '120,220p' echo "----" sed -n '1,140p' reflexio/server/services/braintrust/client.py | nl -ba | sed -n '1,140p'

Repository: ReflexioAI/reflexio

Length of output: 106

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- service.py (approx 55-105) ----" sed -n '55,105p' reflexio/server/services/braintrust/service.py | cat -n echo "---- service.py (approx 120-220) ----" sed -n '120,220p' reflexio/server/services/braintrust/service.py | cat -n echo "---- client.py (1-140) ----" sed -n '1,140p' reflexio/server/services/braintrust/client.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 12631

🏁 Script executed:

#!/bin/bash rg -n "client\.close|BraintrustClient|\.close\(" reflexio/server/services/braintrust/service.py

Repository: ReflexioAI/reflexio

Length of output: 361

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n --hidden --type=py "def save_imported_scores|save_imported_scores\(" reflexio/server | head -n 50

Repository: ReflexioAI/reflexio

Length of output: 468

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- storage_base/_extras.py: ~200-280 ----" sed -n '190,270p' reflexio/server/services/storage/storage_base/_extras.py | cat -n echo "---- sqlite_storage/_extras.py: ~510-620 ----" sed -n '520,620p' reflexio/server/services/storage/sqlite_storage/_extras.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 5526

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "def handle_exceptions|handle_exceptions" reflexio/server/services/storage/sqlite_storage -S rg -n "def handle_exceptions|handle_exceptions" reflexio/server/services/storage/storage_base -S

Repository: ReflexioAI/reflexio

Length of output: 12826

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '520,740p' reflexio/server/services/storage/sqlite_storage/_base.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 10152

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "class StorageError" reflexio/server/services/storage -S

Repository: ReflexioAI/reflexio

Length of output: 138

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,120p' reflexio/server/services/storage/error.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 407

Close the Braintrust client (and handle score persistence failures)

connect() and sync_once() create a BraintrustClient but never call client.close() (it wraps an httpx.Client), so HTTP connections can leak.

sync_once() doesn’t guard self.storage.save_imported_scores(all_scores); if it raises (e.g., SQLite wraps failures in StorageError), _persist_sync_outcome(...) is skipped and the exception can escape.

Suggested fix (client lifecycle)

client = self.client_factory(request.api_key) try: if not client.validate_key(): return ConnectBraintrustResponse( success=False, msg="Braintrust rejected the API key." ) workspaces = self._fetch_workspace_tree(client) except BraintrustAuthError: return ConnectBraintrustResponse( success=False, msg="Braintrust rejected the API key." ) except BraintrustHTTPError as e: return ConnectBraintrustResponse( success=False, msg=f"Braintrust HTTP error: {e}" ) + finally: + client.close() @@ client = self.client_factory(api_key) since_ts = max(0, int(time.time()) - backfill_days * 24 * 60 * 60) @@ try: for project_id in connection.project_ids: experiments = client.list_experiments(project_id, since_ts=since_ts) for exp in experiments: spans = client.list_spans(exp["id"]) all_scores.extend(_scores_from_spans(spans, self.org_id)) except BraintrustAuthError: self._persist_sync_outcome(connection, error="API key invalid.") return SyncBraintrustResponse( success=False, msg="API key invalid; halting sync." ) except BraintrustHTTPError as e: self._persist_sync_outcome(connection, error=str(e)) return SyncBraintrustResponse(success=False, msg=str(e)) + finally: + client.close()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/braintrust/service.py` around lines 70 - 88, connect() and sync_once() create a BraintrustClient via client_factory(request.api_key) but never close it and sync_once() may skip _persist_sync_outcome if storage.save_imported_scores(all_scores) raises; update both connect() and sync_once() to ensure the BraintrustClient is closed (call client.close() in a finally block or use a context manager around BraintrustClient) and in sync_once() wrap self.storage.save_imported_scores(all_scores) in its own try/except so any StorageError (or generic Exception) is caught, still call _persist_sync_outcome(...) with a failure/result indicating persistence failed, and re-raise or log as appropriate; reference the client creation site via client_factory and the client type BraintrustClient plus _persist_sync_outcome and self.storage.save_imported_scores to locate the changes.

coderabbitai · 2026-05-23T22:53:32Z

+        self.storage.save_imported_scores(all_scores)  # type: ignore[attr-defined]
+        self._persist_sync_outcome(connection, error=None)
+        return SyncBraintrustResponse(


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle save_imported_scores failures and persist last_error.

If Line 167 raises, sync exits without updating connection error state, and callers get an unstructured failure path.

Suggested fix

- self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined] - self._persist_sync_outcome(connection, error=None) + try: + self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined] + except Exception as e: # noqa: BLE001 + self._persist_sync_outcome( + connection, error=f"Failed to persist imported scores: {e}" + ) + return SyncBraintrustResponse( + success=False, msg="Failed to persist imported scores." + ) + self._persist_sync_outcome(connection, error=None)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined]

self._persist_sync_outcome(connection, error=None)

return SyncBraintrustResponse(

try:

self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined]

except Exception as e: # noqa: BLE001

self._persist_sync_outcome(

connection, error=f"Failed to persist imported scores: {e}"

)

return SyncBraintrustResponse(

success=False, msg="Failed to persist imported scores."

)

self._persist_sync_outcome(connection, error=None)

return SyncBraintrustResponse(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/braintrust/service.py` around lines 167 - 169, The call to self.storage.save_imported_scores(all_scores) can raise and currently prevents updating the connection error state; wrap that call in a try/except that catches exceptions, records the exception text into the connection last_error (use the same path _persist_sync_outcome(connection, error=...)) and ensure _persist_sync_outcome is invoked in both success and failure paths before returning or re-raising; update the SyncBraintrustResponse flow so when save_imported_scores fails you persist the error and return or raise consistently instead of letting the exception escape untracked.

coderabbitai · 2026-05-23T22:53:33Z

+    for c in corrections:
+        idx = min(_BUCKET_COUNT - 1, c)
+        bins[idx] += 1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard negative correction counts before bin indexing.

Line 29 allows negative c, which uses Python negative indexing and silently mis-buckets invalid data (e.g., -1 maps to the last bin).

Proposed fix

def bucket_corrections(corrections: Iterable[int]) -> tuple[int, int, int, int, int, int]: @@ bins = [0] * _BUCKET_COUNT for c in corrections: + if c < 0: + raise ValueError("corrections must be non-negative") idx = min(_BUCKET_COUNT - 1, c) bins[idx] += 1 return tuple(bins) # type: ignore[return-value]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for c in corrections:

idx = min(_BUCKET_COUNT - 1, c)

bins[idx] += 1

for c in corrections:

if c < 0:

raise ValueError("corrections must be non-negative")

idx = min(_BUCKET_COUNT - 1, c)

bins[idx] += 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/evaluation_overview/distribution.py` around lines 28 - 30, The loop over corrections allows negative c values which trigger Python negative indexing; update the loop in distribution.py (the for c in corrections: block using _BUCKET_COUNT, idx, and bins) to guard negatives before computing idx—e.g., if c < 0: continue (or clamp c = 0 if you prefer mapping negatives to the first bucket) so that idx = min(_BUCKET_COUNT - 1, c) never receives a negative value and invalid corrections are not mis-bucketed.

coderabbitai · 2026-05-23T22:53:33Z

+        hero = self._build_hero(request, results)
+        tiles = self._build_tiles(results, results_prev)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compute days_since_first_eval from all evaluations, not just the current window.

Line 101 currently derives days_since_first_eval from results (window-filtered), which can flip hero state incorrectly when older evaluations exist outside the requested range.

Suggested fix

- hero = self._build_hero(request, results) + hero = self._build_hero(request, results, all_results) @@ def _build_hero( self, request: GetEvaluationOverviewRequest, results: list[AgentSuccessEvaluationResult], + all_results: list[AgentSuccessEvaluationResult], ) -> HeroBlock: - if not results: + if not all_results: days_since = None else: - earliest = min(r.created_at for r in results) + earliest = min(r.created_at for r in all_results) days_since = (int(datetime.now(UTC).timestamp()) - earliest) // 86_400

Also applies to: 92-102

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/services/evaluation_overview/service.py` around lines 74 - 75, The code computes days_since_first_eval from the window-filtered `results`, which causes hero state flips when older evaluations exist; update the logic so `days_since_first_eval` is derived from the full/unfiltered evaluations collection (e.g., `all_results` or whatever variable holds all evaluations) rather than `results`, and pass that full collection into `_build_hero` (or compute it before calling `_build_hero`) — update both places mentioned (around the `days_since_first_eval` calculation and the `_build_hero` call in the same block and the 92-102 region) so the hero uses the earliest evaluation date across all evaluations, not just the current window.

coderabbitai · 2026-05-23T22:53:33Z

+
+    assert response.status_code == 200
+    body = response.json()
+    assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten hero-state assertion to match test intent.

Line 21 currently allows "full" and "early", so a broken empty-state path could still pass. Narrow this to the expected empty-state variants only.

Suggested patch

- assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty") + assert body["hero"]["state"] in ("shadow_off", "empty")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty")

assert body["hero"]["state"] in ("shadow_off", "empty")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/server/api_endpoints/test_evaluation_overview_api.py` at line 21, The assertion currently allows non-empty variants; tighten it to only accept the empty-state variants by changing the check on body["hero"]["state"] so it only permits the expected empty values (e.g., "shadow_off" and "empty") instead of ("full", "early", "shadow_off", "empty"); update the assertion that references body["hero"]["state"] in the test to reflect this narrower set.

coderabbitai · 2026-05-23T22:53:33Z

+def test_finalize_env_cleanup() -> None:
+    """Reset module state so other tests aren't affected."""
+    os.environ.pop("REFLEXIO_FERNET_KEYS", None)
+    _encryption._reset_for_test()


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail ls -la fd "test_encryption.py" . -a > /tmp/matches.txt || true cat /tmp/matches.txt FILE="$(cat /tmp/matches.txt | head -n 1)" echo "Using: $FILE" nl -ba "$FILE" | sed -n '1,120p'

Repository: ReflexioAI/reflexio

Length of output: 2209

Avoid process-wide env mutation in test_finalize_env_cleanup.

Replace os.environ.pop("REFLEXIO_FERNET_KEYS", None) with monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) and add a monkeypatch fixture parameter to ensure the prior environment state is restored by pytest.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/server/services/braintrust/test_encryption.py` around lines 43 - 46, In test_finalize_env_cleanup, avoid mutating process-wide env directly: add the pytest monkeypatch fixture parameter to the test signature and replace os.environ.pop("REFLEXIO_FERNET_KEYS", None) with monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) so pytest will restore the original environment automatically; keep the call to _encryption._reset_for_test() as-is.

yilu331 · 2026-05-26T01:31:11Z

Superseded by #91 — consolidated evaluations + Braintrust PR. Branch remains on remote for history.

yilu331 added 10 commits May 23, 2026 20:42

feat(config): add shadow_mode_enabled toggle to Config

3486bdf

feat(eval-overview): hero state machine (4 states from spec)

b15a4b5

feat(eval-overview): corrections-per-session histogram buckets

75016c0

feat(eval-overview): rule net-sessions attribution

cfbb7c8

feat(eval-overview): request/response Pydantic models

70bbf9d

feat(eval-overview): aggregator service + storage stubs

f861ce5

feat(eval-overview): POST /api/get_evaluation_overview route + lib wr…

2abf014

…apper

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

yilu331 mentioned this pull request May 26, 2026

feat(evaluations): consolidated — Plan A pipeline fix + Plan B/C evaluation overview + Braintrust #91

Open

5 tasks

yilu331 closed this May 26, 2026

		delta (float): Mean current − mean prior-window. Equals `current`
		when no baseline (the frontend renders "no baseline").

		hero = self._build_hero(request, results)
		tiles = self._build_tiles(results, results_prev)

	assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty")
	assert body["hero"]["state"] in ("shadow_off", "empty")

Conversation

yilu331 commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Depends on

Out of scope

Summary by CodeRabbit

New Features

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

yilu331 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yilu331 commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading