Skip to content

feat(storage): SQLite implementations for Plan B + Plan C storage methods#89

Closed
yilu331 wants to merge 10 commits into
mainfrom
feat/evaluations-plan-bc-sqlite-storage
Closed

feat(storage): SQLite implementations for Plan B + Plan C storage methods#89
yilu331 wants to merge 10 commits into
mainfrom
feat/evaluations-plan-bc-sqlite-storage

Conversation

@yilu331
Copy link
Copy Markdown
Collaborator

@yilu331 yilu331 commented May 23, 2026

Combines items 7 + 8 from the remaining-work list — same file, same test fixture pattern, one PR.

Plan B-backend storage gaps:

  • count_sessions_with_shadow_content — JOINs interactions ↔ requests, COALESCE(shadow_content, '') != ''
  • get_interactions_by_session — JOINs to filter by Request.session_id

Plan C-backend + Plan C-overview storage gaps:

  • New tables braintrust_connection (PK org_id) and imported_score (UNIQUE org_id+source+source_run_id+scorer_name; indexes on session_id and ts)
  • 4 connection methods (save/get/delete + the new get for tiles aggregation)
  • save_imported_scores upsert via ON CONFLICT DO UPDATE for idempotent re-sync

Tests

15 pass, integration-marked, tempfile-isolated. Covers shadow-count window/distinct/empty, session filter cases, connection roundtrip/upsert/delete-idempotent, imported scores idempotent re-sync + window + org-filter + empty-noop.

Depends on

reflexio#86 + reflexio#87 + reflexio#88 must merge first.

Out of scope

  • Supabase + Postgres backends — explicit follow-up ("Plan C-backends" / "Plan B-backends").

Summary by CodeRabbit

New Features

  • Braintrust Integration: Connect and sync Braintrust projects to import evaluation scores into the platform.
  • Evaluation Overview Dashboard: New endpoint providing comprehensive evaluation metrics including hero state, context tiles, rule attribution analysis, score distribution, and Braintrust score aggregates.
  • Shadow Mode Configuration: Added toggle to enable shadow-mode evaluations for side-by-side comparison scenarios.
  • Secure Credential Storage: Implemented encrypted storage for external connector credentials.

yilu331 added 10 commits May 23, 2026 20:42
Adds the minimum foundation for the Braintrust connector. A customer can
paste an API key, see their workspaces, pick projects, and trigger a
one-shot import of scorer outputs. Scheduled (cron) sync, score
aggregation into /api/get_evaluation_overview, and the frontend settings
UI are explicit follow-ups (Plan C-cron, Plan C-overview, Plan C-frontend).

- **Schemas** — BraintrustConnection, ImportedScore, plus request/response
  models for the four endpoints.
- **BraintrustClient** — HTTP wrapper over the Braintrust REST API
  (validate_key, list_organizations, list_projects, list_experiments,
  list_spans). Pure httpx; raises BraintrustAuthError / BraintrustHTTPError.
- **Encryption helper** — Fernet via REFLEXIO_FERNET_KEYS env var. Passes
  through unchanged when no key is configured (matches existing
  EncryptManager behavior, keeps OS dev mode workable).
- **BraintrustConnectorService** — orchestrates connect / select_projects /
  status / disconnect / sync_once. Extracts session_id from
  span.metadata['reflexio_session_id'] when present (progressive matching).
- **Storage default no-op stubs** on BaseStorage for save/get/delete
  Braintrust connection + save imported scores. Per-backend SQLite/Supabase
  implementations are a follow-up.
- **5 routes** — /api/braintrust/{connect,select_projects,status,sync},
  DELETE /api/braintrust/connection. API keys never echoed in responses.

34 tests pass:
- 8 schema round-trip tests
- 4 encryption tests (passthrough + Fernet roundtrip)
- 8 HTTP client tests (mocked transport)
- 10 service tests (in-memory storage + stub client, including span-metadata
  session matching and last_sync_ts persistence)
- 4 endpoint smoke tests (disconnected state, sync without connection, etc)

- Plan C-cron: scheduled 15-min sync via a new RecurringJobsScheduler
- Plan C-overview: surface Braintrust tiles in /api/get_evaluation_overview
- Plan C-frontend: settings UI + context-band tiles
- Plan C-backends: real SQLite + Supabase + Postgres implementations of the
  four new storage methods

Spec: docs/superpowers/specs/2026-05-21-evaluations-redesign-and-braintrust-integration-design.md §5, §6, §7
Plan: docs/superpowers/plans/2026-05-23-evaluations-plan-c-backend.md (in the enterprise repo)
…valuation_overview

Surfaces imported_score data alongside the existing tiles in the
overview response.

- Adds `BraintrustTileRow` (scorer_name, current, n, delta) to the
  response model, exposed as a new `braintrust_tiles` list (defaults to
  []).
- Adds a default no-op `get_imported_scores(org_id, from_ts, to_ts)`
  on BaseStorage. Real per-backend implementations are a follow-up.
- Adds `EvaluationOverviewService._build_braintrust_tiles` that aggregates
  scores by scorer_name (mean + count) and computes a vs-prior-window
  delta. When no prior data exists, delta == current — matches the
  existing tile "no baseline" convention so the frontend can detect it.
- Updates the storage-unconfigured default response to include
  `braintrust_tiles: []`.

3 new tests (aggregator, empty case, deltas with + without baseline).
53 tests pass; ruff + pyright clean on touched files.

Depends on PR #86 (Plan B-backend — eval overview endpoint) AND
PR #87 (Plan C-backend — Braintrust connector foundation).
…hods

Combines items 7 and 8 — same file (sqlite_storage/_extras.py), same
test fixture pattern, one PR.

**Plan B-backend (count_sessions_with_shadow_content, get_interactions_by_session):**

- Both methods JOIN interactions ↔ requests since session_id lives on
  the request. count_sessions_with_shadow_content filters by COALESCE
  of shadow_content != ''; get_interactions_by_session orders by
  i.created_at.

**Plan C-backend + Plan C-overview (4 connection methods + get_imported_scores):**

- New tables `braintrust_connection` (PK org_id) and `imported_score`
  (UNIQUE org_id+source+source_run_id+scorer_name for idempotent
  re-sync; indexes on session_id and ts).
- Upsert via ON CONFLICT DO UPDATE on the existing PK / UNIQUE.
- API keys stored encrypted (the service encrypts before save).

**Tests (15 pass, all integration-marked, tempfile-isolated):**

- Shadow-count: zero/distinct-sessions/window filter
- get_interactions_by_session: correct session/empty/unknown
- BraintrustConnection: roundtrip/upsert/get-unknown/delete-idempotent
- ImportedScore: window/idempotent re-sync/org-filter/empty-noop

Lint + pyright clean.

Supabase + Postgres backends are explicit follow-ups.

Depends on: PR #86 (Plan B-backend), PR #87 (Plan C-backend), PR #88
(Plan C-overview). All three must merge before this rebases cleanly.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces two major features: a Braintrust connector for syncing evaluation scores with encrypted API key storage, and a comprehensive evaluation overview dashboard endpoint. Both features require new schemas, storage infrastructure, and API wiring alongside supporting configuration and test coverage.

Changes

Braintrust Connector and Evaluation Overview Dashboard

Layer / File(s) Summary
Data Schemas and Configuration
reflexio/models/api_schema/braintrust_schema.py, reflexio/models/api_schema/eval_overview_schema.py, reflexio/models/config_schema.py, tests/models/test_braintrust_schema.py, tests/models/test_config_shadow_mode.py, tests/models/test_eval_overview_schema.py
Braintrust schemas define persisted connections, imported scores, workspace/project summaries, and request/response models for connect/select/status/sync flows. Evaluation overview schemas define hero block, context tiles, rule attribution, score distribution, and response aggregation. Config adds shadow_mode_enabled flag with tests for defaults and serialization.
Braintrust HTTP Client and Encryption
reflexio/server/services/braintrust/_encryption.py, reflexio/server/services/braintrust/client.py, tests/server/services/braintrust/test_encryption.py, tests/server/services/braintrust/test_client.py
BraintrustClient provides typed methods to validate API keys, list organizations/projects/experiments/spans via httpx with bearer token auth and response unwrapping. Encryption module provides Fernet-based encrypt/decrypt with lazy loading from REFLEXIO_FERNET_KEYS environment variable and passthrough fallback when unconfigured.
Braintrust Service and Storage Persistence
reflexio/server/services/braintrust/service.py, reflexio/server/services/storage/sqlite_storage/_base.py, reflexio/server/services/storage/sqlite_storage/_extras.py, reflexio/server/services/storage/storage_base/_extras.py, tests/server/services/braintrust/test_service.py
BraintrustConnectorService orchestrates connect/select/status/disconnect/sync workflows, managing encrypted connection storage and score import. SQLite schema adds braintrust_connection and imported_score tables with indexes. Storage base and implementation provide persistence methods for connections and imported scores, with full integration test coverage for save/get/delete/upsert and window-based queries.
Braintrust API Endpoints and Dashboard Wrappers
reflexio/lib/_dashboard.py, reflexio/server/api.py, tests/server/api_endpoints/test_braintrust_api.py
API routes wire connect/select/status/disconnect/sync endpoints that resolve org\id and delegate to per-org reflexio instance. DashboardMixin adds braintrust\* wrapper methods normalizing dict inputs into request models and constructing service from storage and request context.
Evaluation Overview Helper Modules
reflexio/server/services/evaluation_overview/distribution.py, reflexio/server/services/evaluation_overview/hero_state.py, reflexio/server/services/evaluation_overview/rule_attribution.py, tests/server/services/evaluation_overview/test_distribution.py, tests/server/services/evaluation_overview/test_hero_state.py, tests/server/services/evaluation_overview/test_rule_attribution.py
Pure utility modules compute hero state (empty/shadow\_off/early/full) with shadow mode awareness and threshold precedence, bucket correction counts into six bins (0–5+), and aggregate rule citations into net\_sessions attribution rows with top\_n ranking.
Evaluation Overview Service
reflexio/server/services/evaluation_overview/service.py, tests/server/services/evaluation_overview/test_service_integration.py, tests/server/services/evaluation_overview/test_braintrust_tiles.py
EvaluationOverviewService fetches current/prior evaluation results and orchestrates building hero block, context tiles (success %, corrections, turns, escalation), rule attribution, score distribution, and Braintrust tile aggregates. Includes window-aligned bucketing, per-scorer Braintrust score grouping with delta computation, and integration tests with mocked storage.
Evaluation Overview API and Dashboard Integration
reflexio/lib/_dashboard.py, reflexio/server/api.py, tests/server/api_endpoints/test_evaluation_overview_api.py
POST /api/get_evaluation_overview endpoint resolves org\_id and delegates to service. DashboardMixin wrapper normalizes dict input, returns empty response when storage unconfigured, otherwise runs service with storage and config to produce full metric response.
Storage Infrastructure and Tests
reflexio/server/services/storage/storage_base/_extras.py, reflexio/server/services/storage/sqlite_storage/__init__.py, reflexio/server/services/storage/sqlite_storage/_base.py, reflexio/server/services/storage/sqlite_storage/_extras.py, tests/server/services/storage/test_sqlite_storage_bc_extras.py
Base storage mixin defines default no-op methods for shadow content counting, session interaction lookup, and Braintrust connection/score persistence. SQLite implementation provides actual SQL queries with locking and transactions. Integration tests cover roundtrip persistence, upsert overwriting, window filtering, idempotent operations, and empty-input edge cases.
Minor Cleanup and Datetime Updates
reflexio/server/api_endpoints/stall_state_api.py, reflexio/server/llm/providers/claude_code_stream_parser.py, tests/client/test_stall_state_client.py, tests/server/api_endpoints/test_stall_state_api.py, tests/server/llm/test_claude_code_stream_parser.py, tests/server/services/storage/test_stall_state.py
Update datetime imports across modules to use UTC directly from datetime instead of timezone.utc. Reformat imports for consistency (parenthesized multi-line blocks).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Scores from Braintrust now encrypted and stored,
Hero states computed with shadow modes explored,
Rules that moved needles ranked by their might,
Dashboard tiles gleaming with metrics so bright,
Two features dancing—connected and tight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(storage): SQLite implementations for Plan B + Plan C storage methods' directly and specifically describes the main change: adding SQLite implementations for both Plan B and Plan C storage operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/evaluations-plan-bc-sqlite-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
tests/server/services/evaluation_overview/test_rule_attribution.py (1)

76-76: ⚡ Quick win

Remove the no-op assignment.

The assignment _ = RuleAttribution doesn't verify anything. If the intent is to ensure the import works, the import itself at line 4 already does that.

♻️ Remove the no-op line
-    _ = RuleAttribution  # imported for export sanity; no-op
     citations_by_session = {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/services/evaluation_overview/test_rule_attribution.py` at line
76, Remove the redundant no-op assignment `_ = RuleAttribution`; the import of
RuleAttribution already exercises the symbol, so delete the line containing `_ =
RuleAttribution` in the test file (leave the original import intact and no other
changes to function/class names such as RuleAttribution or the test structure).
tests/server/services/evaluation_overview/test_service_integration.py (1)

51-57: ⚡ Quick win

Tighten this test to be deterministic and regression-sensitive.

Line 51 and Lines 54-57 currently allow broad pass conditions, so real behavior regressions can slip through.

Suggested tightening
-    response = svc.run(
-        GetEvaluationOverviewRequest(from_ts=0, to_ts=int(time.time()))
-    )
+    response = svc.run(
+        GetEvaluationOverviewRequest(from_ts=0, to_ts=1702592000)
+    )
 
-    assert response.hero.state in ("full", "early", "shadow_off", "empty")
-    assert response.context_tiles.success.current >= 0.0
-    assert len(response.score_distribution.current_bins) == 6
+    assert response.hero.state == "full"
+    assert response.context_tiles.success.current > 0.0
+    assert response.score_distribution.current_bins == [1, 1, 0, 1, 0, 0]
     assert response.score_distribution.labels == ["0", "1", "2", "3", "4", "5+"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/services/evaluation_overview/test_service_integration.py` around
lines 51 - 57, The test currently allows broad, non-deterministic outcomes; make
it deterministic by using a fixed to_ts in the GetEvaluationOverviewRequest
(replace int(time.time()) with a stable timestamp or a test fixture) and tighten
the assertions to exact expected values: assert response.hero.state equals the
single expected state string, assert response.context_tiles.success.current
equals the expected numeric value, and assert
response.score_distribution.current_bins and response.score_distribution.labels
equal the exact expected lists (not just length or a loose subset); update the
assertions in this test (the GetEvaluationOverviewRequest call and checks of
response.hero.state, response.context_tiles.success.current,
response.score_distribution.current_bins, and
response.score_distribution.labels) so the test fails on any regression rather
than accepting broad ranges.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@reflexio/models/api_schema/eval_overview_schema.py`:
- Around line 104-105: Replace the Unicode minus character in the docstring for
the parameter named "delta" with an ASCII hyphen; locate the docstring in
reflexio/models/api_schema/eval_overview_schema.py where the parameter
description reads "delta (float): Mean current − mean prior-window..." and
change the Unicode minus "−" to the ASCII "-" so the line becomes "Mean current
- mean prior-window..." to eliminate the RUF002 ambiguity warning.

In `@reflexio/server/services/braintrust/_encryption.py`:
- Around line 52-53: The code currently returns None from _load() when no valid
Fernet keys are parsed, which lets encrypt() silently fall back to plaintext;
change _load() so that when the REFLEXIO_FERNET_KEYS env var is set but parsing
yields zero valid Fernet instances (i.e., local variable fernets is empty), it
raises an explicit exception (e.g., ValueError or RuntimeError) to fail closed;
alternatively, ensure encrypt()/decrypt() check for the presence of
REFLEXIO_FERNET_KEYS and raise if fernets is None/empty instead of defaulting to
plaintext. Reference _load(), encrypt(), decrypt(), and the REFLEXIO_FERNET_KEYS
parsing/fernets variable to locate the change.

In `@reflexio/server/services/braintrust/client.py`:
- Around line 141-152: The _get method in BraintrustClient should catch
httpx.RequestError raised by self._client.get and normalize transport/timeouts
into the connector's structured error; wrap the call to self._client.get in a
try/except catching httpx.RequestError as exc and re-raise a BraintrustHTTPError
that includes the transport error details (e.g., an explicit error message and
the original exception info) so transport-level failures no longer escape as raw
httpx exceptions; preserve the existing handling for HTTP 401/403 and non-2xx
responses in BraintrustClient._get.

In `@reflexio/server/services/braintrust/service.py`:
- Around line 167-169: The call to self.storage.save_imported_scores(all_scores)
can raise and currently prevents updating the connection error state; wrap that
call in a try/except that catches exceptions, records the exception text into
the connection last_error (use the same path _persist_sync_outcome(connection,
error=...)) and ensure _persist_sync_outcome is invoked in both success and
failure paths before returning or re-raising; update the SyncBraintrustResponse
flow so when save_imported_scores fails you persist the error and return or
raise consistently instead of letting the exception escape untracked.
- Around line 70-88: connect() and sync_once() create a BraintrustClient via
client_factory(request.api_key) but never close it and sync_once() may skip
_persist_sync_outcome if storage.save_imported_scores(all_scores) raises; update
both connect() and sync_once() to ensure the BraintrustClient is closed (call
client.close() in a finally block or use a context manager around
BraintrustClient) and in sync_once() wrap
self.storage.save_imported_scores(all_scores) in its own try/except so any
StorageError (or generic Exception) is caught, still call
_persist_sync_outcome(...) with a failure/result indicating persistence failed,
and re-raise or log as appropriate; reference the client creation site via
client_factory and the client type BraintrustClient plus _persist_sync_outcome
and self.storage.save_imported_scores to locate the changes.

In `@reflexio/server/services/evaluation_overview/distribution.py`:
- Around line 28-30: The loop over corrections allows negative c values which
trigger Python negative indexing; update the loop in distribution.py (the for c
in corrections: block using _BUCKET_COUNT, idx, and bins) to guard negatives
before computing idx—e.g., if c < 0: continue (or clamp c = 0 if you prefer
mapping negatives to the first bucket) so that idx = min(_BUCKET_COUNT - 1, c)
never receives a negative value and invalid corrections are not mis-bucketed.

In `@reflexio/server/services/evaluation_overview/service.py`:
- Around line 74-75: The code computes days_since_first_eval from the
window-filtered `results`, which causes hero state flips when older evaluations
exist; update the logic so `days_since_first_eval` is derived from the
full/unfiltered evaluations collection (e.g., `all_results` or whatever variable
holds all evaluations) rather than `results`, and pass that full collection into
`_build_hero` (or compute it before calling `_build_hero`) — update both places
mentioned (around the `days_since_first_eval` calculation and the `_build_hero`
call in the same block and the 92-102 region) so the hero uses the earliest
evaluation date across all evaluations, not just the current window.

In `@tests/server/api_endpoints/test_evaluation_overview_api.py`:
- Line 21: The assertion currently allows non-empty variants; tighten it to only
accept the empty-state variants by changing the check on body["hero"]["state"]
so it only permits the expected empty values (e.g., "shadow_off" and "empty")
instead of ("full", "early", "shadow_off", "empty"); update the assertion that
references body["hero"]["state"] in the test to reflect this narrower set.

In `@tests/server/services/braintrust/test_encryption.py`:
- Around line 43-46: In test_finalize_env_cleanup, avoid mutating process-wide
env directly: add the pytest monkeypatch fixture parameter to the test signature
and replace os.environ.pop("REFLEXIO_FERNET_KEYS", None) with
monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) so pytest will restore
the original environment automatically; keep the call to
_encryption._reset_for_test() as-is.

---

Nitpick comments:
In `@tests/server/services/evaluation_overview/test_rule_attribution.py`:
- Line 76: Remove the redundant no-op assignment `_ = RuleAttribution`; the
import of RuleAttribution already exercises the symbol, so delete the line
containing `_ = RuleAttribution` in the test file (leave the original import
intact and no other changes to function/class names such as RuleAttribution or
the test structure).

In `@tests/server/services/evaluation_overview/test_service_integration.py`:
- Around line 51-57: The test currently allows broad, non-deterministic
outcomes; make it deterministic by using a fixed to_ts in the
GetEvaluationOverviewRequest (replace int(time.time()) with a stable timestamp
or a test fixture) and tighten the assertions to exact expected values: assert
response.hero.state equals the single expected state string, assert
response.context_tiles.success.current equals the expected numeric value, and
assert response.score_distribution.current_bins and
response.score_distribution.labels equal the exact expected lists (not just
length or a loose subset); update the assertions in this test (the
GetEvaluationOverviewRequest call and checks of response.hero.state,
response.context_tiles.success.current,
response.score_distribution.current_bins, and
response.score_distribution.labels) so the test fails on any regression rather
than accepting broad ranges.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ef3cd802-8f65-4ff5-8158-9aa1b945c20a

📥 Commits

Reviewing files that changed from the base of the PR and between 25f8034 and 240cfe6.

📒 Files selected for processing (39)
  • reflexio/lib/_dashboard.py
  • reflexio/models/api_schema/braintrust_schema.py
  • reflexio/models/api_schema/eval_overview_schema.py
  • reflexio/models/config_schema.py
  • reflexio/server/api.py
  • reflexio/server/api_endpoints/stall_state_api.py
  • reflexio/server/llm/providers/claude_code_stream_parser.py
  • reflexio/server/services/braintrust/__init__.py
  • reflexio/server/services/braintrust/_encryption.py
  • reflexio/server/services/braintrust/client.py
  • reflexio/server/services/braintrust/service.py
  • reflexio/server/services/evaluation_overview/__init__.py
  • reflexio/server/services/evaluation_overview/distribution.py
  • reflexio/server/services/evaluation_overview/hero_state.py
  • reflexio/server/services/evaluation_overview/rule_attribution.py
  • reflexio/server/services/evaluation_overview/service.py
  • reflexio/server/services/storage/sqlite_storage/__init__.py
  • reflexio/server/services/storage/sqlite_storage/_base.py
  • reflexio/server/services/storage/sqlite_storage/_extras.py
  • reflexio/server/services/storage/storage_base/_extras.py
  • tests/client/test_stall_state_client.py
  • tests/models/test_braintrust_schema.py
  • tests/models/test_config_shadow_mode.py
  • tests/models/test_eval_overview_schema.py
  • tests/server/api_endpoints/test_braintrust_api.py
  • tests/server/api_endpoints/test_evaluation_overview_api.py
  • tests/server/api_endpoints/test_stall_state_api.py
  • tests/server/llm/test_claude_code_stream_parser.py
  • tests/server/services/braintrust/__init__.py
  • tests/server/services/braintrust/test_client.py
  • tests/server/services/braintrust/test_encryption.py
  • tests/server/services/braintrust/test_service.py
  • tests/server/services/evaluation_overview/test_braintrust_tiles.py
  • tests/server/services/evaluation_overview/test_distribution.py
  • tests/server/services/evaluation_overview/test_hero_state.py
  • tests/server/services/evaluation_overview/test_rule_attribution.py
  • tests/server/services/evaluation_overview/test_service_integration.py
  • tests/server/services/storage/test_sqlite_storage_bc_extras.py
  • tests/server/services/storage/test_stall_state.py

Comment on lines +104 to +105
delta (float): Mean current − mean prior-window. Equals `current`
when no baseline (the frontend renders "no baseline").
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace the Unicode minus with ASCII - in this docstring.

The current character triggers Ruff RUF002 ambiguity warnings; swapping to - avoids lint/tooling friction.

🧰 Tools
🪛 Ruff (0.15.13)

[warning] 104-104: Docstring contains ambiguous (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/models/api_schema/eval_overview_schema.py` around lines 104 - 105,
Replace the Unicode minus character in the docstring for the parameter named
"delta" with an ASCII hyphen; locate the docstring in
reflexio/models/api_schema/eval_overview_schema.py where the parameter
description reads "delta (float): Mean current − mean prior-window..." and
change the Unicode minus "−" to the ASCII "-" so the line becomes "Mean current
- mean prior-window..." to eliminate the RUF002 ambiguity warning.

Comment on lines +52 to +53
if not fernets:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when REFLEXIO_FERNET_KEYS is set but unusable.

If the env var is present and all keys are invalid, _load() returns None, and encrypt() falls back to plaintext storage. That silently disables encryption under production misconfiguration.

Suggested fix
     if not fernets:
-        return None
+        raise RuntimeError(
+            f"{_ENV_KEY} is set but contains no valid Fernet keys"
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not fernets:
return None
if not fernets:
raise RuntimeError(
f"{_ENV_KEY} is set but contains no valid Fernet keys"
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/braintrust/_encryption.py` around lines 52 - 53, The
code currently returns None from _load() when no valid Fernet keys are parsed,
which lets encrypt() silently fall back to plaintext; change _load() so that
when the REFLEXIO_FERNET_KEYS env var is set but parsing yields zero valid
Fernet instances (i.e., local variable fernets is empty), it raises an explicit
exception (e.g., ValueError or RuntimeError) to fail closed; alternatively,
ensure encrypt()/decrypt() check for the presence of REFLEXIO_FERNET_KEYS and
raise if fernets is None/empty instead of defaulting to plaintext. Reference
_load(), encrypt(), decrypt(), and the REFLEXIO_FERNET_KEYS parsing/fernets
variable to locate the change.

Comment on lines +141 to +152
def _get(
self, path: str, *, params: dict[str, Any] | None = None
) -> httpx.Response:
url = f"{self.base_url}{path}"
response = self._client.get(url, params=params)
if response.status_code in (401, 403):
raise BraintrustAuthError(
f"Braintrust rejected the API key (HTTP {response.status_code})"
)
if not (200 <= response.status_code < 300):
raise BraintrustHTTPError(response.status_code, response.text)
return response
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In httpx 0.28.1, does httpx.Client.get(...)raisehttpx.RequestError (including timeout/transport failures) instead of always returning a Response?

💡 Result:

In httpx 0.28.1, httpx.Client.get(...) does not always return a Response object [1][2]. When a request fails due to transport-level issues—such as connection timeouts, read/write timeouts, or network errors—the method raises an exception that inherits from httpx.RequestError [3][4]. Specifically, httpx.RequestError is the base class for all exceptions that occur when issuing a request [3][4]. If a transport failure occurs (e.g., a timeout), httpx raises a specific subclass of RequestError, such as httpx.TimeoutException (which includes ConnectTimeout, ReadTimeout, etc.) or httpx.NetworkError [3][4]. Because these exceptions are raised during the execution of the request, the method call is interrupted and does not return a Response [3][5]. It is important to distinguish this from HTTP status errors (like 4xx or 5xx responses). If the server successfully returns an HTTP error response, httpx.Client.get(...) will still return a Response object [6][2]. To raise an exception for those status codes, you must explicitly call response.raise_for_status [3][4].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the client file around the _get implementation and nearby error definitions/uses
sed -n '1,220p' reflexio/server/services/braintrust/client.py

# Find where these errors are caught/handled within the repo
rg -n "BraintrustAuthError|BraintrustHTTPError" -S .

# Also inspect the service layer entry points that likely call _get
rg -n "services/braintrust|braintrust" reflexio/server -S

Repository: ReflexioAI/reflexio

Length of output: 23748


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show Braintrust service implementation around exception handling
sed -n '1,260p' reflexio/server/services/braintrust/service.py

# Show the client tests that assert expected exception types
sed -n '1,220p' tests/server/services/braintrust/test_client.py

Repository: ReflexioAI/reflexio

Length of output: 13485


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "status_code\s*==\s*0|BraintrustHTTPError\(\s*0\b" -S .
rg -n "last_error|Braintrust HTTP error" reflexio/server/services/braintrust -S

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Where is BraintrustHTTPError.status_code referenced?
rg -n "BraintrustHTTPError.*status_code|\.status_code\b" reflexio tests -S

# Inspect braintrust service tests that may assert exact error strings/status codes
sed -n '1,240p' tests/server/services/braintrust/test_service.py

Repository: ReflexioAI/reflexio

Length of output: 13811


Normalize httpx transport failures into BraintrustHTTPError.

In reflexio/server/services/braintrust/client.py (BraintrustClient._get), httpx.Client.get() raises httpx.RequestError (including timeout/network transport failures). Those currently bypass the connector’s BraintrustAuthError/BraintrustHTTPError handling in BraintrustConnectorService, causing transport-level failures to escape as unhandled exceptions instead of structured sync/connect failures.

Suggested fix
     def _get(
         self, path: str, *, params: dict[str, Any] | None = None
     ) -> httpx.Response:
         url = f"{self.base_url}{path}"
-        response = self._client.get(url, params=params)
+        try:
+            response = self._client.get(url, params=params)
+        except httpx.RequestError as e:
+            raise BraintrustHTTPError(0, f"Request failed: {e}") from e
         if response.status_code in (401, 403):
             raise BraintrustAuthError(
                 f"Braintrust rejected the API key (HTTP {response.status_code})"
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _get(
self, path: str, *, params: dict[str, Any] | None = None
) -> httpx.Response:
url = f"{self.base_url}{path}"
response = self._client.get(url, params=params)
if response.status_code in (401, 403):
raise BraintrustAuthError(
f"Braintrust rejected the API key (HTTP {response.status_code})"
)
if not (200 <= response.status_code < 300):
raise BraintrustHTTPError(response.status_code, response.text)
return response
def _get(
self, path: str, *, params: dict[str, Any] | None = None
) -> httpx.Response:
url = f"{self.base_url}{path}"
try:
response = self._client.get(url, params=params)
except httpx.RequestError as e:
raise BraintrustHTTPError(0, f"Request failed: {e}") from e
if response.status_code in (401, 403):
raise BraintrustAuthError(
f"Braintrust rejected the API key (HTTP {response.status_code})"
)
if not (200 <= response.status_code < 300):
raise BraintrustHTTPError(response.status_code, response.text)
return response
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/braintrust/client.py` around lines 141 - 152, The
_get method in BraintrustClient should catch httpx.RequestError raised by
self._client.get and normalize transport/timeouts into the connector's
structured error; wrap the call to self._client.get in a try/except catching
httpx.RequestError as exc and re-raise a BraintrustHTTPError that includes the
transport error details (e.g., an explicit error message and the original
exception info) so transport-level failures no longer escape as raw httpx
exceptions; preserve the existing handling for HTTP 401/403 and non-2xx
responses in BraintrustClient._get.

Comment on lines +70 to +88
client = self.client_factory(request.api_key)
try:
if not client.validate_key():
return ConnectBraintrustResponse(
success=False, msg="Braintrust rejected the API key."
)
workspaces = self._fetch_workspace_tree(client)
except BraintrustAuthError:
return ConnectBraintrustResponse(
success=False, msg="Braintrust rejected the API key."
)
except BraintrustHTTPError as e:
return ConnectBraintrustResponse(
success=False, msg=f"Braintrust HTTP error: {e}"
)

return ConnectBraintrustResponse(
success=True, workspaces=workspaces, msg=""
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n --type=py 'client_factory\(|\.close\(' \
  reflexio/server/services/braintrust/service.py \
  reflexio/server/services/braintrust/client.py

Repository: ReflexioAI/reflexio

Length of output: 342


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,120p' reflexio/server/services/braintrust/service.py | nl -ba | sed -n '55,105p'
echo "----"
sed -n '120,240p' reflexio/server/services/braintrust/service.py | nl -ba | sed -n '120,220p'
echo "----"
sed -n '1,140p' reflexio/server/services/braintrust/client.py | nl -ba | sed -n '1,140p'

Repository: ReflexioAI/reflexio

Length of output: 106


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "---- service.py (approx 55-105) ----"
sed -n '55,105p' reflexio/server/services/braintrust/service.py | cat -n

echo "---- service.py (approx 120-220) ----"
sed -n '120,220p' reflexio/server/services/braintrust/service.py | cat -n

echo "---- client.py (1-140) ----"
sed -n '1,140p' reflexio/server/services/braintrust/client.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 12631


🏁 Script executed:

#!/bin/bash
rg -n "client\.close|BraintrustClient|\.close\(" reflexio/server/services/braintrust/service.py

Repository: ReflexioAI/reflexio

Length of output: 361


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n --hidden --type=py "def save_imported_scores|save_imported_scores\(" reflexio/server | head -n 50

Repository: ReflexioAI/reflexio

Length of output: 468


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "---- storage_base/_extras.py: ~200-280 ----"
sed -n '190,270p' reflexio/server/services/storage/storage_base/_extras.py | cat -n

echo "---- sqlite_storage/_extras.py: ~510-620 ----"
sed -n '520,620p' reflexio/server/services/storage/sqlite_storage/_extras.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 5526


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "def handle_exceptions|handle_exceptions" reflexio/server/services/storage/sqlite_storage -S
rg -n "def handle_exceptions|handle_exceptions" reflexio/server/services/storage/storage_base -S

Repository: ReflexioAI/reflexio

Length of output: 12826


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '520,740p' reflexio/server/services/storage/sqlite_storage/_base.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 10152


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "class StorageError" reflexio/server/services/storage -S

Repository: ReflexioAI/reflexio

Length of output: 138


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,120p' reflexio/server/services/storage/error.py | cat -n

Repository: ReflexioAI/reflexio

Length of output: 407


Close the Braintrust client (and handle score persistence failures)

  • connect() and sync_once() create a BraintrustClient but never call client.close() (it wraps an httpx.Client), so HTTP connections can leak.
  • sync_once() doesn’t guard self.storage.save_imported_scores(all_scores); if it raises (e.g., SQLite wraps failures in StorageError), _persist_sync_outcome(...) is skipped and the exception can escape.
Suggested fix (client lifecycle)
         client = self.client_factory(request.api_key)
         try:
             if not client.validate_key():
                 return ConnectBraintrustResponse(
                     success=False, msg="Braintrust rejected the API key."
                 )
             workspaces = self._fetch_workspace_tree(client)
         except BraintrustAuthError:
             return ConnectBraintrustResponse(
                 success=False, msg="Braintrust rejected the API key."
             )
         except BraintrustHTTPError as e:
             return ConnectBraintrustResponse(
                 success=False, msg=f"Braintrust HTTP error: {e}"
             )
+        finally:
+            client.close()
@@
         client = self.client_factory(api_key)
         since_ts = max(0, int(time.time()) - backfill_days * 24 * 60 * 60)
@@
         try:
             for project_id in connection.project_ids:
                 experiments = client.list_experiments(project_id, since_ts=since_ts)
                 for exp in experiments:
                     spans = client.list_spans(exp["id"])
                     all_scores.extend(_scores_from_spans(spans, self.org_id))
         except BraintrustAuthError:
             self._persist_sync_outcome(connection, error="API key invalid.")
             return SyncBraintrustResponse(
                 success=False, msg="API key invalid; halting sync."
             )
         except BraintrustHTTPError as e:
             self._persist_sync_outcome(connection, error=str(e))
             return SyncBraintrustResponse(success=False, msg=str(e))
+        finally:
+            client.close()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/braintrust/service.py` around lines 70 - 88,
connect() and sync_once() create a BraintrustClient via
client_factory(request.api_key) but never close it and sync_once() may skip
_persist_sync_outcome if storage.save_imported_scores(all_scores) raises; update
both connect() and sync_once() to ensure the BraintrustClient is closed (call
client.close() in a finally block or use a context manager around
BraintrustClient) and in sync_once() wrap
self.storage.save_imported_scores(all_scores) in its own try/except so any
StorageError (or generic Exception) is caught, still call
_persist_sync_outcome(...) with a failure/result indicating persistence failed,
and re-raise or log as appropriate; reference the client creation site via
client_factory and the client type BraintrustClient plus _persist_sync_outcome
and self.storage.save_imported_scores to locate the changes.

Comment on lines +167 to +169
self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined]
self._persist_sync_outcome(connection, error=None)
return SyncBraintrustResponse(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle save_imported_scores failures and persist last_error.

If Line 167 raises, sync exits without updating connection error state, and callers get an unstructured failure path.

Suggested fix
-        self.storage.save_imported_scores(all_scores)  # type: ignore[attr-defined]
-        self._persist_sync_outcome(connection, error=None)
+        try:
+            self.storage.save_imported_scores(all_scores)  # type: ignore[attr-defined]
+        except Exception as e:  # noqa: BLE001
+            self._persist_sync_outcome(
+                connection, error=f"Failed to persist imported scores: {e}"
+            )
+            return SyncBraintrustResponse(
+                success=False, msg="Failed to persist imported scores."
+            )
+        self._persist_sync_outcome(connection, error=None)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined]
self._persist_sync_outcome(connection, error=None)
return SyncBraintrustResponse(
try:
self.storage.save_imported_scores(all_scores) # type: ignore[attr-defined]
except Exception as e: # noqa: BLE001
self._persist_sync_outcome(
connection, error=f"Failed to persist imported scores: {e}"
)
return SyncBraintrustResponse(
success=False, msg="Failed to persist imported scores."
)
self._persist_sync_outcome(connection, error=None)
return SyncBraintrustResponse(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/braintrust/service.py` around lines 167 - 169, The
call to self.storage.save_imported_scores(all_scores) can raise and currently
prevents updating the connection error state; wrap that call in a try/except
that catches exceptions, records the exception text into the connection
last_error (use the same path _persist_sync_outcome(connection, error=...)) and
ensure _persist_sync_outcome is invoked in both success and failure paths before
returning or re-raising; update the SyncBraintrustResponse flow so when
save_imported_scores fails you persist the error and return or raise
consistently instead of letting the exception escape untracked.

Comment on lines +28 to +30
for c in corrections:
idx = min(_BUCKET_COUNT - 1, c)
bins[idx] += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard negative correction counts before bin indexing.

Line 29 allows negative c, which uses Python negative indexing and silently mis-buckets invalid data (e.g., -1 maps to the last bin).

Proposed fix
 def bucket_corrections(corrections: Iterable[int]) -> tuple[int, int, int, int, int, int]:
@@
     bins = [0] * _BUCKET_COUNT
     for c in corrections:
+        if c < 0:
+            raise ValueError("corrections must be non-negative")
         idx = min(_BUCKET_COUNT - 1, c)
         bins[idx] += 1
     return tuple(bins)  # type: ignore[return-value]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for c in corrections:
idx = min(_BUCKET_COUNT - 1, c)
bins[idx] += 1
for c in corrections:
if c < 0:
raise ValueError("corrections must be non-negative")
idx = min(_BUCKET_COUNT - 1, c)
bins[idx] += 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/evaluation_overview/distribution.py` around lines 28
- 30, The loop over corrections allows negative c values which trigger Python
negative indexing; update the loop in distribution.py (the for c in corrections:
block using _BUCKET_COUNT, idx, and bins) to guard negatives before computing
idx—e.g., if c < 0: continue (or clamp c = 0 if you prefer mapping negatives to
the first bucket) so that idx = min(_BUCKET_COUNT - 1, c) never receives a
negative value and invalid corrections are not mis-bucketed.

Comment on lines +74 to +75
hero = self._build_hero(request, results)
tiles = self._build_tiles(results, results_prev)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compute days_since_first_eval from all evaluations, not just the current window.

Line 101 currently derives days_since_first_eval from results (window-filtered), which can flip hero state incorrectly when older evaluations exist outside the requested range.

Suggested fix
-        hero = self._build_hero(request, results)
+        hero = self._build_hero(request, results, all_results)
@@
     def _build_hero(
         self,
         request: GetEvaluationOverviewRequest,
         results: list[AgentSuccessEvaluationResult],
+        all_results: list[AgentSuccessEvaluationResult],
     ) -> HeroBlock:
-        if not results:
+        if not all_results:
             days_since = None
         else:
-            earliest = min(r.created_at for r in results)
+            earliest = min(r.created_at for r in all_results)
             days_since = (int(datetime.now(UTC).timestamp()) - earliest) // 86_400

Also applies to: 92-102

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/services/evaluation_overview/service.py` around lines 74 -
75, The code computes days_since_first_eval from the window-filtered `results`,
which causes hero state flips when older evaluations exist; update the logic so
`days_since_first_eval` is derived from the full/unfiltered evaluations
collection (e.g., `all_results` or whatever variable holds all evaluations)
rather than `results`, and pass that full collection into `_build_hero` (or
compute it before calling `_build_hero`) — update both places mentioned (around
the `days_since_first_eval` calculation and the `_build_hero` call in the same
block and the 92-102 region) so the hero uses the earliest evaluation date
across all evaluations, not just the current window.


assert response.status_code == 200
body = response.json()
assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten hero-state assertion to match test intent.

Line 21 currently allows "full" and "early", so a broken empty-state path could still pass. Narrow this to the expected empty-state variants only.

Suggested patch
-    assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty")
+    assert body["hero"]["state"] in ("shadow_off", "empty")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert body["hero"]["state"] in ("full", "early", "shadow_off", "empty")
assert body["hero"]["state"] in ("shadow_off", "empty")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/api_endpoints/test_evaluation_overview_api.py` at line 21, The
assertion currently allows non-empty variants; tighten it to only accept the
empty-state variants by changing the check on body["hero"]["state"] so it only
permits the expected empty values (e.g., "shadow_off" and "empty") instead of
("full", "early", "shadow_off", "empty"); update the assertion that references
body["hero"]["state"] in the test to reflect this narrower set.

Comment on lines +43 to +46
def test_finalize_env_cleanup() -> None:
"""Reset module state so other tests aren't affected."""
os.environ.pop("REFLEXIO_FERNET_KEYS", None)
_encryption._reset_for_test()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
ls -la
fd "test_encryption.py" . -a > /tmp/matches.txt || true
cat /tmp/matches.txt
FILE="$(cat /tmp/matches.txt | head -n 1)"
echo "Using: $FILE"
nl -ba "$FILE" | sed -n '1,120p'

Repository: ReflexioAI/reflexio

Length of output: 2209


Avoid process-wide env mutation in test_finalize_env_cleanup.

Replace os.environ.pop("REFLEXIO_FERNET_KEYS", None) with monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) and add a monkeypatch fixture parameter to ensure the prior environment state is restored by pytest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/services/braintrust/test_encryption.py` around lines 43 - 46, In
test_finalize_env_cleanup, avoid mutating process-wide env directly: add the
pytest monkeypatch fixture parameter to the test signature and replace
os.environ.pop("REFLEXIO_FERNET_KEYS", None) with
monkeypatch.delenv("REFLEXIO_FERNET_KEYS", raising=False) so pytest will restore
the original environment automatically; keep the call to
_encryption._reset_for_test() as-is.

@yilu331
Copy link
Copy Markdown
Collaborator Author

yilu331 commented May 26, 2026

Superseded by #91 — consolidated evaluations + Braintrust PR. Branch remains on remote for history.

@yilu331 yilu331 closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant