feat(storage): SQLite implementations for Plan B + Plan C storage methods#89
feat(storage): SQLite implementations for Plan B + Plan C storage methods#89yilu331 wants to merge 10 commits into
Conversation
Adds the minimum foundation for the Braintrust connector. A customer can
paste an API key, see their workspaces, pick projects, and trigger a
one-shot import of scorer outputs. Scheduled (cron) sync, score
aggregation into /api/get_evaluation_overview, and the frontend settings
UI are explicit follow-ups (Plan C-cron, Plan C-overview, Plan C-frontend).
- **Schemas** — BraintrustConnection, ImportedScore, plus request/response
models for the four endpoints.
- **BraintrustClient** — HTTP wrapper over the Braintrust REST API
(validate_key, list_organizations, list_projects, list_experiments,
list_spans). Pure httpx; raises BraintrustAuthError / BraintrustHTTPError.
- **Encryption helper** — Fernet via REFLEXIO_FERNET_KEYS env var. Passes
through unchanged when no key is configured (matches existing
EncryptManager behavior, keeps OS dev mode workable).
- **BraintrustConnectorService** — orchestrates connect / select_projects /
status / disconnect / sync_once. Extracts session_id from
span.metadata['reflexio_session_id'] when present (progressive matching).
- **Storage default no-op stubs** on BaseStorage for save/get/delete
Braintrust connection + save imported scores. Per-backend SQLite/Supabase
implementations are a follow-up.
- **5 routes** — /api/braintrust/{connect,select_projects,status,sync},
DELETE /api/braintrust/connection. API keys never echoed in responses.
34 tests pass:
- 8 schema round-trip tests
- 4 encryption tests (passthrough + Fernet roundtrip)
- 8 HTTP client tests (mocked transport)
- 10 service tests (in-memory storage + stub client, including span-metadata
session matching and last_sync_ts persistence)
- 4 endpoint smoke tests (disconnected state, sync without connection, etc)
- Plan C-cron: scheduled 15-min sync via a new RecurringJobsScheduler
- Plan C-overview: surface Braintrust tiles in /api/get_evaluation_overview
- Plan C-frontend: settings UI + context-band tiles
- Plan C-backends: real SQLite + Supabase + Postgres implementations of the
four new storage methods
Spec: docs/superpowers/specs/2026-05-21-evaluations-redesign-and-braintrust-integration-design.md §5, §6, §7
Plan: docs/superpowers/plans/2026-05-23-evaluations-plan-c-backend.md (in the enterprise repo)
…valuation_overview Surfaces imported_score data alongside the existing tiles in the overview response. - Adds `BraintrustTileRow` (scorer_name, current, n, delta) to the response model, exposed as a new `braintrust_tiles` list (defaults to []). - Adds a default no-op `get_imported_scores(org_id, from_ts, to_ts)` on BaseStorage. Real per-backend implementations are a follow-up. - Adds `EvaluationOverviewService._build_braintrust_tiles` that aggregates scores by scorer_name (mean + count) and computes a vs-prior-window delta. When no prior data exists, delta == current — matches the existing tile "no baseline" convention so the frontend can detect it. - Updates the storage-unconfigured default response to include `braintrust_tiles: []`. 3 new tests (aggregator, empty case, deltas with + without baseline). 53 tests pass; ruff + pyright clean on touched files. Depends on PR #86 (Plan B-backend — eval overview endpoint) AND PR #87 (Plan C-backend — Braintrust connector foundation).
…hods Combines items 7 and 8 — same file (sqlite_storage/_extras.py), same test fixture pattern, one PR. **Plan B-backend (count_sessions_with_shadow_content, get_interactions_by_session):** - Both methods JOIN interactions ↔ requests since session_id lives on the request. count_sessions_with_shadow_content filters by COALESCE of shadow_content != ''; get_interactions_by_session orders by i.created_at. **Plan C-backend + Plan C-overview (4 connection methods + get_imported_scores):** - New tables `braintrust_connection` (PK org_id) and `imported_score` (UNIQUE org_id+source+source_run_id+scorer_name for idempotent re-sync; indexes on session_id and ts). - Upsert via ON CONFLICT DO UPDATE on the existing PK / UNIQUE. - API keys stored encrypted (the service encrypts before save). **Tests (15 pass, all integration-marked, tempfile-isolated):** - Shadow-count: zero/distinct-sessions/window filter - get_interactions_by_session: correct session/empty/unknown - BraintrustConnection: roundtrip/upsert/get-unknown/delete-idempotent - ImportedScore: window/idempotent re-sync/org-filter/empty-noop Lint + pyright clean. Supabase + Postgres backends are explicit follow-ups. Depends on: PR #86 (Plan B-backend), PR #87 (Plan C-backend), PR #88 (Plan C-overview). All three must merge before this rebases cleanly.
📝 WalkthroughWalkthroughThis PR introduces two major features: a Braintrust connector for syncing evaluation scores with encrypted API key storage, and a comprehensive evaluation overview dashboard endpoint. Both features require new schemas, storage infrastructure, and API wiring alongside supporting configuration and test coverage. ChangesBraintrust Connector and Evaluation Overview Dashboard
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (2)
tests/server/services/evaluation_overview/test_rule_attribution.py (1)
76-76: ⚡ Quick winRemove the no-op assignment.
The assignment
_ = RuleAttributiondoesn't verify anything. If the intent is to ensure the import works, the import itself at line 4 already does that.♻️ Remove the no-op line
- _ = RuleAttribution # imported for export sanity; no-op citations_by_session = {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/server/services/evaluation_overview/test_rule_attribution.py` at line 76, Remove the redundant no-op assignment `_ = RuleAttribution`; the import of RuleAttribution already exercises the symbol, so delete the line containing `_ = RuleAttribution` in the test file (leave the original import intact and no other changes to function/class names such as RuleAttribution or the test structure).tests/server/services/evaluation_overview/test_service_integration.py (1)
51-57: ⚡ Quick winTighten this test to be deterministic and regression-sensitive.
Line 51 and Lines 54-57 currently allow broad pass conditions, so real behavior regressions can slip through.
Suggested tightening
- response = svc.run( - GetEvaluationOverviewRequest(from_ts=0, to_ts=int(time.time())) - ) + response = svc.run( + GetEvaluationOverviewRequest(from_ts=0, to_ts=1702592000) + ) - assert response.hero.state in ("full", "early", "shadow_off", "empty") - assert response.context_tiles.success.current >= 0.0 - assert len(response.score_distribution.current_bins) == 6 + assert response.hero.state == "full" + assert response.context_tiles.success.current > 0.0 + assert response.score_distribution.current_bins == [1, 1, 0, 1, 0, 0] assert response.score_distribution.labels == ["0", "1", "2", "3", "4", "5+"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/server/services/evaluation_overview/test_service_integration.py` around lines 51 - 57, The test currently allows broad, non-deterministic outcomes; make it deterministic by using a fixed to_ts in the GetEvaluationOverviewRequest (replace int(time.time()) with a stable timestamp or a test fixture) and tighten the assertions to exact expected values: assert response.hero.state equals the single expected state string, assert response.context_tiles.success.current equals the expected numeric value, and assert response.score_distribution.current_bins and response.score_distribution.labels equal the exact expected lists (not just length or a loose subset); update the assertions in this test (the GetEvaluationOverviewRequest call and checks of response.hero.state, response.context_tiles.success.current, response.score_distribution.current_bins, and response.score_distribution.labels) so the test fails on any regression rather than accepting broad ranges.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@reflexio/models/api_schema/eval_overview_schema.py`:
- Around line 104-105: Replace the Unicode minus character in the docstring for
the parameter named "delta" with an ASCII hyphen; locate the docstring in
reflexio/models/api_schema/eval_overview_schema.py where the parameter
description reads "delta (float): Mean current − mean prior-window..." and
change the Unicode minus "−" to the ASCII "-" so the line becomes "Mean current
- mean prior-window..." to eliminate the RUF002 ambiguity warning.
In `@reflexio/server/services/braintrust/_encryption.py`:
- Around line 52-53: The code currently returns None from _load() when no valid
Fernet keys are parsed, which lets encrypt() silently fall back to plaintext;
change _load() so that when the REFLEXIO_FERNET_KEYS env var is set but parsing
yields zero valid Fernet instances (i.e., local variable fernets is empty), it
raises an explicit exception (e.g., ValueError or RuntimeError) to fail closed;
alternatively, ensure encrypt()/decrypt() check for the presence of
REFLEXIO_FERNET_KEYS and raise if fernets is None/empty instead of defaulting to
plaintext. Reference _load(), encrypt(), decrypt(), and the REFLEXIO_FERNET_KEYS
parsing/fernets variable to locate the change.
In `@reflexio/server/services/braintrust/client.py`:
- Around line 141-152: The _get method in BraintrustClient should catch
httpx.RequestError raised by self._client.get and normalize transport/timeouts
into the connector's structured error; wrap the call to self._client.get in a
try/except catching httpx.RequestError as exc and re-raise a BraintrustHTTPError
that includes the transport error details (e.g., an explicit error message and
the original exception info) so transport-level failures no longer escape as raw
httpx exceptions; preserve the existing handling for HTTP 401/403 and non-2xx
responses in BraintrustClient._get.
In `@reflexio/server/services/braintrust/service.py`:
- Around line 167-169: The call to self.storage.save_imported_scores(all_scores)
can raise and currently prevents updating the connection error state; wrap that
call in a try/except that catches exceptions, records the exception text into
the connection last_error (use the same path _persist_sync_outcome(connection,
error=...)) and ensure _persist_sync_outcome is invoked in both success and
failure paths before returning or re-raising; update the SyncBraintrustResponse
flow so when save_imported_scores fails you persist the error and return or
raise consistently instead of letting the exception escape untracked.
- Around line 70-88: connect() and sync_once() create a BraintrustClient via
client_factory(request.api_key) but never close it and sync_once() may skip
_persist_sync_outcome if storage.save_imported_scores(all_scores) raises; update
both connect() and sync_once() to ensure the BraintrustClient is closed (call
client.close() in a finally block or use a context manager around
BraintrustClient) and in sync_once() wrap
self.storage.save_imported_scores(all_scores) in its own try/except so any
StorageError (or generic Exception) is caught, still call
_persist_sync_outcome(...) with a failure/result indicating persistence failed,
and re-raise or log as appropriate; reference the client creation site via
client_factory and the client type BraintrustClient plus _persist_sync_outcome
and self.storage.save_imported_scores to locate the changes.
In `@reflexio/server/services/evaluation_overview/distribution.py`:
- Around line 28-30: The loop over corrections allows negative c values which
trigger Python negative indexing; update the loop in distribution.py (the for c
in corrections: block using _BUCKET_COUNT, idx, and bins) to guard negatives
before computing idx—e.g., if c < 0: continue (or clamp c = 0 if you prefer
mapping negatives to the first bucket) so that idx = min(_BUCKET_COUNT - 1, c)
never receives a negative value and invalid corrections are not mis-bucketed.
In `@reflexio/server/services/evaluation_overview/service.py`:
- Around line 74-75: The code computes days_since_first_eval from the
window-filtered `results`, which causes hero state flips when older evaluations
exist; update the logic so `days_since_first_eval` is derived from the
full/unfiltered evaluations collection (e.g., `all_results` or whatever variable
holds all evaluations) rather than `results`, and pass that full collection into
`_build_hero` (or compute it before calling `_build_hero`) — update both places
mentioned (around the `days_since_first_eval` calculation and the `_build_hero`
call in the same block and the 92-102 region) so the hero uses the earliest
evaluation date across all evaluations, not just the current window.
In `@tests/server/api_endpoints/test_evaluation_overview_api.py`:
- Line 21: The assertion currently allows non-empty variants; tighten it to only
accept the empty-state variants by changing the check on body["hero"]["state"]
so it only permits the expected empty values (e.g., "shadow_off" and "empty")
instead of ("full", "early", "shadow_off", "empty"); update the assertion that
references body["hero"]["state"] in the test to reflect this narrower set.
In `@tests/server/services/braintrust/test_encryption.py`:
- Around line 43-46: In test_finalize_env_cleanup, avoid mutating process-wide
env directly: add the pytest monkeypatch fixture parameter to the test signature
and replace os.environ.pop("REFLEXIO_FERNET_KEYS", None) with
monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) so pytest will restore
the original environment automatically; keep the call to
_encryption._reset_for_test() as-is.
---
Nitpick comments:
In `@tests/server/services/evaluation_overview/test_rule_attribution.py`:
- Line 76: Remove the redundant no-op assignment `_ = RuleAttribution`; the
import of RuleAttribution already exercises the symbol, so delete the line
containing `_ = RuleAttribution` in the test file (leave the original import
intact and no other changes to function/class names such as RuleAttribution or
the test structure).
In `@tests/server/services/evaluation_overview/test_service_integration.py`:
- Around line 51-57: The test currently allows broad, non-deterministic
outcomes; make it deterministic by using a fixed to_ts in the
GetEvaluationOverviewRequest (replace int(time.time()) with a stable timestamp
or a test fixture) and tighten the assertions to exact expected values: assert
response.hero.state equals the single expected state string, assert
response.context_tiles.success.current equals the expected numeric value, and
assert response.score_distribution.current_bins and
response.score_distribution.labels equal the exact expected lists (not just
length or a loose subset); update the assertions in this test (the
GetEvaluationOverviewRequest call and checks of response.hero.state,
response.context_tiles.success.current,
response.score_distribution.current_bins, and
response.score_distribution.labels) so the test fails on any regression rather
than accepting broad ranges.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ef3cd802-8f65-4ff5-8158-9aa1b945c20a
📒 Files selected for processing (39)
reflexio/lib/_dashboard.pyreflexio/models/api_schema/braintrust_schema.pyreflexio/models/api_schema/eval_overview_schema.pyreflexio/models/config_schema.pyreflexio/server/api.pyreflexio/server/api_endpoints/stall_state_api.pyreflexio/server/llm/providers/claude_code_stream_parser.pyreflexio/server/services/braintrust/__init__.pyreflexio/server/services/braintrust/_encryption.pyreflexio/server/services/braintrust/client.pyreflexio/server/services/braintrust/service.pyreflexio/server/services/evaluation_overview/__init__.pyreflexio/server/services/evaluation_overview/distribution.pyreflexio/server/services/evaluation_overview/hero_state.pyreflexio/server/services/evaluation_overview/rule_attribution.pyreflexio/server/services/evaluation_overview/service.pyreflexio/server/services/storage/sqlite_storage/__init__.pyreflexio/server/services/storage/sqlite_storage/_base.pyreflexio/server/services/storage/sqlite_storage/_extras.pyreflexio/server/services/storage/storage_base/_extras.pytests/client/test_stall_state_client.pytests/models/test_braintrust_schema.pytests/models/test_config_shadow_mode.pytests/models/test_eval_overview_schema.pytests/server/api_endpoints/test_braintrust_api.pytests/server/api_endpoints/test_evaluation_overview_api.pytests/server/api_endpoints/test_stall_state_api.pytests/server/llm/test_claude_code_stream_parser.pytests/server/services/braintrust/__init__.pytests/server/services/braintrust/test_client.pytests/server/services/braintrust/test_encryption.pytests/server/services/braintrust/test_service.pytests/server/services/evaluation_overview/test_braintrust_tiles.pytests/server/services/evaluation_overview/test_distribution.pytests/server/services/evaluation_overview/test_hero_state.pytests/server/services/evaluation_overview/test_rule_attribution.pytests/server/services/evaluation_overview/test_service_integration.pytests/server/services/storage/test_sqlite_storage_bc_extras.pytests/server/services/storage/test_stall_state.py
| delta (float): Mean current − mean prior-window. Equals `current` | ||
| when no baseline (the frontend renders "no baseline"). |
There was a problem hiding this comment.
Replace the Unicode minus with ASCII - in this docstring.
The current character triggers Ruff RUF002 ambiguity warnings; swapping to - avoids lint/tooling friction.
🧰 Tools
🪛 Ruff (0.15.13)
[warning] 104-104: Docstring contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 104 - 105,
Replace the Unicode minus character in the docstring for the parameter named
"delta" with an ASCII hyphen; locate the docstring in
reflexio/models/api_schema/eval_overview_schema.py where the parameter
description reads "delta (float): Mean current − mean prior-window..." and
change the Unicode minus "−" to the ASCII "-" so the line becomes "Mean current
- mean prior-window..." to eliminate the RUF002 ambiguity warning.
| if not fernets: | ||
| return None |
There was a problem hiding this comment.
Fail closed when REFLEXIO_FERNET_KEYS is set but unusable.
If the env var is present and all keys are invalid, _load() returns None, and encrypt() falls back to plaintext storage. That silently disables encryption under production misconfiguration.
Suggested fix
if not fernets:
- return None
+ raise RuntimeError(
+ f"{_ENV_KEY} is set but contains no valid Fernet keys"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not fernets: | |
| return None | |
| if not fernets: | |
| raise RuntimeError( | |
| f"{_ENV_KEY} is set but contains no valid Fernet keys" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/braintrust/_encryption.py` around lines 52 - 53, The
code currently returns None from _load() when no valid Fernet keys are parsed,
which lets encrypt() silently fall back to plaintext; change _load() so that
when the REFLEXIO_FERNET_KEYS env var is set but parsing yields zero valid
Fernet instances (i.e., local variable fernets is empty), it raises an explicit
exception (e.g., ValueError or RuntimeError) to fail closed; alternatively,
ensure encrypt()/decrypt() check for the presence of REFLEXIO_FERNET_KEYS and
raise if fernets is None/empty instead of defaulting to plaintext. Reference
_load(), encrypt(), decrypt(), and the REFLEXIO_FERNET_KEYS parsing/fernets
variable to locate the change.
| def _get( | ||
| self, path: str, *, params: dict[str, Any] | None = None | ||
| ) -> httpx.Response: | ||
| url = f"{self.base_url}{path}" | ||
| response = self._client.get(url, params=params) | ||
| if response.status_code in (401, 403): | ||
| raise BraintrustAuthError( | ||
| f"Braintrust rejected the API key (HTTP {response.status_code})" | ||
| ) | ||
| if not (200 <= response.status_code < 300): | ||
| raise BraintrustHTTPError(response.status_code, response.text) | ||
| return response |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In httpx 0.28.1, does httpx.Client.get(...)raisehttpx.RequestError (including timeout/transport failures) instead of always returning a Response?
💡 Result:
In httpx 0.28.1, httpx.Client.get(...) does not always return a Response object [1][2]. When a request fails due to transport-level issues—such as connection timeouts, read/write timeouts, or network errors—the method raises an exception that inherits from httpx.RequestError [3][4]. Specifically, httpx.RequestError is the base class for all exceptions that occur when issuing a request [3][4]. If a transport failure occurs (e.g., a timeout), httpx raises a specific subclass of RequestError, such as httpx.TimeoutException (which includes ConnectTimeout, ReadTimeout, etc.) or httpx.NetworkError [3][4]. Because these exceptions are raised during the execution of the request, the method call is interrupted and does not return a Response [3][5]. It is important to distinguish this from HTTP status errors (like 4xx or 5xx responses). If the server successfully returns an HTTP error response, httpx.Client.get(...) will still return a Response object [6][2]. To raise an exception for those status codes, you must explicitly call response.raise_for_status [3][4].
Citations:
- 1: https://pypi.org/project/httpx/0.28.1/
- 2: https://www.python-httpx.org/api/
- 3: https://github.com/encode/httpx/blob/master/httpx/_exceptions.py
- 4: https://www.python-httpx.org/exceptions/
- 5: https://stackoverflow.com/questions/73676661/how-to-trap-this-nested-httpx-exception
- 6: https://github.com/encode/httpx/blob/master/docs/api.md
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the client file around the _get implementation and nearby error definitions/uses
sed -n '1,220p' reflexio/server/services/braintrust/client.py
# Find where these errors are caught/handled within the repo
rg -n "BraintrustAuthError|BraintrustHTTPError" -S .
# Also inspect the service layer entry points that likely call _get
rg -n "services/braintrust|braintrust" reflexio/server -SRepository: ReflexioAI/reflexio
Length of output: 23748
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show Braintrust service implementation around exception handling
sed -n '1,260p' reflexio/server/services/braintrust/service.py
# Show the client tests that assert expected exception types
sed -n '1,220p' tests/server/services/braintrust/test_client.pyRepository: ReflexioAI/reflexio
Length of output: 13485
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "status_code\s*==\s*0|BraintrustHTTPError\(\s*0\b" -S .
rg -n "last_error|Braintrust HTTP error" reflexio/server/services/braintrust -SRepository: ReflexioAI/reflexio
Length of output: 45
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Where is BraintrustHTTPError.status_code referenced?
rg -n "BraintrustHTTPError.*status_code|\.status_code\b" reflexio tests -S
# Inspect braintrust service tests that may assert exact error strings/status codes
sed -n '1,240p' tests/server/services/braintrust/test_service.pyRepository: ReflexioAI/reflexio
Length of output: 13811
Normalize httpx transport failures into BraintrustHTTPError.
In reflexio/server/services/braintrust/client.py (BraintrustClient._get), httpx.Client.get() raises httpx.RequestError (including timeout/network transport failures). Those currently bypass the connector’s BraintrustAuthError/BraintrustHTTPError handling in BraintrustConnectorService, causing transport-level failures to escape as unhandled exceptions instead of structured sync/connect failures.
Suggested fix
def _get(
self, path: str, *, params: dict[str, Any] | None = None
) -> httpx.Response:
url = f"{self.base_url}{path}"
- response = self._client.get(url, params=params)
+ try:
+ response = self._client.get(url, params=params)
+ except httpx.RequestError as e:
+ raise BraintrustHTTPError(0, f"Request failed: {e}") from e
if response.status_code in (401, 403):
raise BraintrustAuthError(
f"Braintrust rejected the API key (HTTP {response.status_code})"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _get( | |
| self, path: str, *, params: dict[str, Any] | None = None | |
| ) -> httpx.Response: | |
| url = f"{self.base_url}{path}" | |
| response = self._client.get(url, params=params) | |
| if response.status_code in (401, 403): | |
| raise BraintrustAuthError( | |
| f"Braintrust rejected the API key (HTTP {response.status_code})" | |
| ) | |
| if not (200 <= response.status_code < 300): | |
| raise BraintrustHTTPError(response.status_code, response.text) | |
| return response | |
| def _get( | |
| self, path: str, *, params: dict[str, Any] | None = None | |
| ) -> httpx.Response: | |
| url = f"{self.base_url}{path}" | |
| try: | |
| response = self._client.get(url, params=params) | |
| except httpx.RequestError as e: | |
| raise BraintrustHTTPError(0, f"Request failed: {e}") from e | |
| if response.status_code in (401, 403): | |
| raise BraintrustAuthError( | |
| f"Braintrust rejected the API key (HTTP {response.status_code})" | |
| ) | |
| if not (200 <= response.status_code < 300): | |
| raise BraintrustHTTPError(response.status_code, response.text) | |
| return response |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/braintrust/client.py` around lines 141 - 152, The
_get method in BraintrustClient should catch httpx.RequestError raised by
self._client.get and normalize transport/timeouts into the connector's
structured error; wrap the call to self._client.get in a try/except catching
httpx.RequestError as exc and re-raise a BraintrustHTTPError that includes the
transport error details (e.g., an explicit error message and the original
exception info) so transport-level failures no longer escape as raw httpx
exceptions; preserve the existing handling for HTTP 401/403 and non-2xx
responses in BraintrustClient._get.
| client = self.client_factory(request.api_key) | ||
| try: | ||
| if not client.validate_key(): | ||
| return ConnectBraintrustResponse( | ||
| success=False, msg="Braintrust rejected the API key." | ||
| ) | ||
| workspaces = self._fetch_workspace_tree(client) | ||
| except BraintrustAuthError: | ||
| return ConnectBraintrustResponse( | ||
| success=False, msg="Braintrust rejected the API key." | ||
| ) | ||
| except BraintrustHTTPError as e: | ||
| return ConnectBraintrustResponse( | ||
| success=False, msg=f"Braintrust HTTP error: {e}" | ||
| ) | ||
|
|
||
| return ConnectBraintrustResponse( | ||
| success=True, workspaces=workspaces, msg="" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n --type=py 'client_factory\(|\.close\(' \
reflexio/server/services/braintrust/service.py \
reflexio/server/services/braintrust/client.pyRepository: ReflexioAI/reflexio
Length of output: 342
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,120p' reflexio/server/services/braintrust/service.py | nl -ba | sed -n '55,105p'
echo "----"
sed -n '120,240p' reflexio/server/services/braintrust/service.py | nl -ba | sed -n '120,220p'
echo "----"
sed -n '1,140p' reflexio/server/services/braintrust/client.py | nl -ba | sed -n '1,140p'Repository: ReflexioAI/reflexio
Length of output: 106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- service.py (approx 55-105) ----"
sed -n '55,105p' reflexio/server/services/braintrust/service.py | cat -n
echo "---- service.py (approx 120-220) ----"
sed -n '120,220p' reflexio/server/services/braintrust/service.py | cat -n
echo "---- client.py (1-140) ----"
sed -n '1,140p' reflexio/server/services/braintrust/client.py | cat -nRepository: ReflexioAI/reflexio
Length of output: 12631
🏁 Script executed:
#!/bin/bash
rg -n "client\.close|BraintrustClient|\.close\(" reflexio/server/services/braintrust/service.pyRepository: ReflexioAI/reflexio
Length of output: 361
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n --hidden --type=py "def save_imported_scores|save_imported_scores\(" reflexio/server | head -n 50Repository: ReflexioAI/reflexio
Length of output: 468
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- storage_base/_extras.py: ~200-280 ----"
sed -n '190,270p' reflexio/server/services/storage/storage_base/_extras.py | cat -n
echo "---- sqlite_storage/_extras.py: ~510-620 ----"
sed -n '520,620p' reflexio/server/services/storage/sqlite_storage/_extras.py | cat -nRepository: ReflexioAI/reflexio
Length of output: 5526
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "def handle_exceptions|handle_exceptions" reflexio/server/services/storage/sqlite_storage -S
rg -n "def handle_exceptions|handle_exceptions" reflexio/server/services/storage/storage_base -SRepository: ReflexioAI/reflexio
Length of output: 12826
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '520,740p' reflexio/server/services/storage/sqlite_storage/_base.py | cat -nRepository: ReflexioAI/reflexio
Length of output: 10152
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "class StorageError" reflexio/server/services/storage -SRepository: ReflexioAI/reflexio
Length of output: 138
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,120p' reflexio/server/services/storage/error.py | cat -nRepository: ReflexioAI/reflexio
Length of output: 407
Close the Braintrust client (and handle score persistence failures)
connect()andsync_once()create aBraintrustClientbut never callclient.close()(it wraps anhttpx.Client), so HTTP connections can leak.sync_once()doesn’t guardself.storage.save_imported_scores(all_scores); if it raises (e.g., SQLite wraps failures inStorageError),_persist_sync_outcome(...)is skipped and the exception can escape.
Suggested fix (client lifecycle)
client = self.client_factory(request.api_key)
try:
if not client.validate_key():
return ConnectBraintrustResponse(
success=False, msg="Braintrust rejected the API key."
)
workspaces = self._fetch_workspace_tree(client)
except BraintrustAuthError:
return ConnectBraintrustResponse(
success=False, msg="Braintrust rejected the API key."
)
except BraintrustHTTPError as e:
return ConnectBraintrustResponse(
success=False, msg=f"Braintrust HTTP error: {e}"
)
+ finally:
+ client.close()
@@
client = self.client_factory(api_key)
since_ts = max(0, int(time.time()) - backfill_days * 24 * 60 * 60)
@@
try:
for project_id in connection.project_ids:
experiments = client.list_experiments(project_id, since_ts=since_ts)
for exp in experiments:
spans = client.list_spans(exp["id"])
all_scores.extend(_scores_from_spans(spans, self.org_id))
except BraintrustAuthError:
self._persist_sync_outcome(connection, error="API key invalid.")
return SyncBraintrustResponse(
success=False, msg="API key invalid; halting sync."
)
except BraintrustHTTPError as e:
self._persist_sync_outcome(connection, error=str(e))
return SyncBraintrustResponse(success=False, msg=str(e))
+ finally:
+ client.close()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/braintrust/service.py` around lines 70 - 88,
connect() and sync_once() create a BraintrustClient via
client_factory(request.api_key) but never close it and sync_once() may skip
_persist_sync_outcome if storage.save_imported_scores(all_scores) raises; update
both connect() and sync_once() to ensure the BraintrustClient is closed (call
client.close() in a finally block or use a context manager around
BraintrustClient) and in sync_once() wrap
self.storage.save_imported_scores(all_scores) in its own try/except so any
StorageError (or generic Exception) is caught, still call
_persist_sync_outcome(...) with a failure/result indicating persistence failed,
and re-raise or log as appropriate; reference the client creation site via
client_factory and the client type BraintrustClient plus _persist_sync_outcome
and self.storage.save_imported_scores to locate the changes.
| self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined] | ||
| self._persist_sync_outcome(connection, error=None) | ||
| return SyncBraintrustResponse( |
There was a problem hiding this comment.
Handle save_imported_scores failures and persist last_error.
If Line 167 raises, sync exits without updating connection error state, and callers get an unstructured failure path.
Suggested fix
- self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined]
- self._persist_sync_outcome(connection, error=None)
+ try:
+ self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined]
+ except Exception as e: # noqa: BLE001
+ self._persist_sync_outcome(
+ connection, error=f"Failed to persist imported scores: {e}"
+ )
+ return SyncBraintrustResponse(
+ success=False, msg="Failed to persist imported scores."
+ )
+ self._persist_sync_outcome(connection, error=None)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined] | |
| self._persist_sync_outcome(connection, error=None) | |
| return SyncBraintrustResponse( | |
| try: | |
| self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined] | |
| except Exception as e: # noqa: BLE001 | |
| self._persist_sync_outcome( | |
| connection, error=f"Failed to persist imported scores: {e}" | |
| ) | |
| return SyncBraintrustResponse( | |
| success=False, msg="Failed to persist imported scores." | |
| ) | |
| self._persist_sync_outcome(connection, error=None) | |
| return SyncBraintrustResponse( |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/braintrust/service.py` around lines 167 - 169, The
call to self.storage.save_imported_scores(all_scores) can raise and currently
prevents updating the connection error state; wrap that call in a try/except
that catches exceptions, records the exception text into the connection
last_error (use the same path _persist_sync_outcome(connection, error=...)) and
ensure _persist_sync_outcome is invoked in both success and failure paths before
returning or re-raising; update the SyncBraintrustResponse flow so when
save_imported_scores fails you persist the error and return or raise
consistently instead of letting the exception escape untracked.
| for c in corrections: | ||
| idx = min(_BUCKET_COUNT - 1, c) | ||
| bins[idx] += 1 |
There was a problem hiding this comment.
Guard negative correction counts before bin indexing.
Line 29 allows negative c, which uses Python negative indexing and silently mis-buckets invalid data (e.g., -1 maps to the last bin).
Proposed fix
def bucket_corrections(corrections: Iterable[int]) -> tuple[int, int, int, int, int, int]:
@@
bins = [0] * _BUCKET_COUNT
for c in corrections:
+ if c < 0:
+ raise ValueError("corrections must be non-negative")
idx = min(_BUCKET_COUNT - 1, c)
bins[idx] += 1
return tuple(bins) # type: ignore[return-value]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for c in corrections: | |
| idx = min(_BUCKET_COUNT - 1, c) | |
| bins[idx] += 1 | |
| for c in corrections: | |
| if c < 0: | |
| raise ValueError("corrections must be non-negative") | |
| idx = min(_BUCKET_COUNT - 1, c) | |
| bins[idx] += 1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/evaluation_overview/distribution.py` around lines 28
- 30, The loop over corrections allows negative c values which trigger Python
negative indexing; update the loop in distribution.py (the for c in corrections:
block using _BUCKET_COUNT, idx, and bins) to guard negatives before computing
idx—e.g., if c < 0: continue (or clamp c = 0 if you prefer mapping negatives to
the first bucket) so that idx = min(_BUCKET_COUNT - 1, c) never receives a
negative value and invalid corrections are not mis-bucketed.
| hero = self._build_hero(request, results) | ||
| tiles = self._build_tiles(results, results_prev) |
There was a problem hiding this comment.
Compute days_since_first_eval from all evaluations, not just the current window.
Line 101 currently derives days_since_first_eval from results (window-filtered), which can flip hero state incorrectly when older evaluations exist outside the requested range.
Suggested fix
- hero = self._build_hero(request, results)
+ hero = self._build_hero(request, results, all_results)
@@
def _build_hero(
self,
request: GetEvaluationOverviewRequest,
results: list[AgentSuccessEvaluationResult],
+ all_results: list[AgentSuccessEvaluationResult],
) -> HeroBlock:
- if not results:
+ if not all_results:
days_since = None
else:
- earliest = min(r.created_at for r in results)
+ earliest = min(r.created_at for r in all_results)
days_since = (int(datetime.now(UTC).timestamp()) - earliest) // 86_400Also applies to: 92-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@reflexio/server/services/evaluation_overview/service.py` around lines 74 -
75, The code computes days_since_first_eval from the window-filtered `results`,
which causes hero state flips when older evaluations exist; update the logic so
`days_since_first_eval` is derived from the full/unfiltered evaluations
collection (e.g., `all_results` or whatever variable holds all evaluations)
rather than `results`, and pass that full collection into `_build_hero` (or
compute it before calling `_build_hero`) — update both places mentioned (around
the `days_since_first_eval` calculation and the `_build_hero` call in the same
block and the 92-102 region) so the hero uses the earliest evaluation date
across all evaluations, not just the current window.
|
|
||
| assert response.status_code == 200 | ||
| body = response.json() | ||
| assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty") |
There was a problem hiding this comment.
Tighten hero-state assertion to match test intent.
Line 21 currently allows "full" and "early", so a broken empty-state path could still pass. Narrow this to the expected empty-state variants only.
Suggested patch
- assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty")
+ assert body["hero"]["state"] in ("shadow_off", "empty")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty") | |
| assert body["hero"]["state"] in ("shadow_off", "empty") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/server/api_endpoints/test_evaluation_overview_api.py` at line 21, The
assertion currently allows non-empty variants; tighten it to only accept the
empty-state variants by changing the check on body["hero"]["state"] so it only
permits the expected empty values (e.g., "shadow_off" and "empty") instead of
("full", "early", "shadow_off", "empty"); update the assertion that references
body["hero"]["state"] in the test to reflect this narrower set.
| def test_finalize_env_cleanup() -> None: | ||
| """Reset module state so other tests aren't affected.""" | ||
| os.environ.pop("REFLEXIO_FERNET_KEYS", None) | ||
| _encryption._reset_for_test() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ls -la
fd "test_encryption.py" . -a > /tmp/matches.txt || true
cat /tmp/matches.txt
FILE="$(cat /tmp/matches.txt | head -n 1)"
echo "Using: $FILE"
nl -ba "$FILE" | sed -n '1,120p'Repository: ReflexioAI/reflexio
Length of output: 2209
Avoid process-wide env mutation in test_finalize_env_cleanup.
Replace os.environ.pop("REFLEXIO_FERNET_KEYS", None) with monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) and add a monkeypatch fixture parameter to ensure the prior environment state is restored by pytest.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/server/services/braintrust/test_encryption.py` around lines 43 - 46, In
test_finalize_env_cleanup, avoid mutating process-wide env directly: add the
pytest monkeypatch fixture parameter to the test signature and replace
os.environ.pop("REFLEXIO_FERNET_KEYS", None) with
monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) so pytest will restore
the original environment automatically; keep the call to
_encryption._reset_for_test() as-is.
|
Superseded by #91 — consolidated evaluations + Braintrust PR. Branch remains on remote for history. |
Combines items 7 + 8 from the remaining-work list — same file, same test fixture pattern, one PR.
Plan B-backend storage gaps:
count_sessions_with_shadow_content— JOINs interactions ↔ requests, COALESCE(shadow_content, '') != ''get_interactions_by_session— JOINs to filter by Request.session_idPlan C-backend + Plan C-overview storage gaps:
braintrust_connection(PK org_id) andimported_score(UNIQUE org_id+source+source_run_id+scorer_name; indexes on session_id and ts)save_imported_scoresupsert via ON CONFLICT DO UPDATE for idempotent re-syncTests
15 pass, integration-marked, tempfile-isolated. Covers shadow-count window/distinct/empty, session filter cases, connection roundtrip/upsert/delete-idempotent, imported scores idempotent re-sync + window + org-filter + empty-noop.
Depends on
reflexio#86+reflexio#87+reflexio#88must merge first.Out of scope
Summary by CodeRabbit
New Features