feat(cloud): honest A/B proof — /public/proof endpoint + ablation export by Gradata · Pull Request #44 · Gradata/gradata

Gradata · 2026-04-14T09:32:24Z

Summary

Replaces the ABProofPanel's fabricated marketing copy ('200 blind expert evaluators', '3,000 comparisons', '70% win rate') with honest ablation-backed numbers served from a new public endpoint.

Changes

`cloud/app/routes/proof.py` — new `GET /public/proof`. Reads `cloud/data/proof_results.json`, returns real ablation data when present, graceful empty-state when not. Never fabricates.
`cloud/scripts/export_ab_proof.py` — aggregation CLI. Reads `.tmp/rule-ablation-v2/judgments/*.jsonl` from an ablation run, computes per-dimension means + 95% CIs + deltas + per-model breakdown, writes proof_results.json.
`cloud/dashboard/src/components/brain/ABProofPanel.tsx` — fetches `/public/proof`. Shows real trials/subjects/judge counts when live, demo fixture with visible 'demo data' label when empty. Delta color flips on regressions (honest: we show negative deltas in red rather than only shipping positive results).
`cloud/tests/test_proof.py` — 5 new tests (missing file, present file, corrupt file, unauthenticated public access, export helper loadable).
`cloud/data/proof_results.json` — placeholder (overwritten by export script).

Pipeline

Run ablation: `python .tmp/rule-ablation-v2/experiment.py run`
Export: `python cloud/scripts/export_ab_proof.py`
Redeploy cloud → dashboard shows honest numbers automatically.

Why this matters

The current panel makes claims ('200 blind experts', '3,000 comparisons') we cannot defend. Any skeptic who asks 'where's your data?' gets an awkward answer. This PR makes that claim truth-checkable: the panel shows exactly what our ablation produced, with CIs, judge model, and subject list visible in the UI.

Test plan

`python -m pytest cloud/tests/test_proof.py` — 5/5
`python -m ruff check cloud/app/routes/proof.py cloud/scripts/export_ab_proof.py` — clean
Manual: run `cloud/scripts/export_ab_proof.py --dry-run` against tonight's ablation JSONL, verify output shape
Deploy + visit /dashboard, confirm panel fetches from endpoint (falls back to demo if data missing)

…tion Replaces the ABProofPanel's fabricated marketing copy ("200 blind expert evaluators", "3,000 comparisons", "70% win rate") with real ablation-backed numbers served by a new public endpoint. New pieces: - cloud/app/routes/proof.py: GET /public/proof, reads cloud/data/proof_results.json, returns {available, source, subjects, judge, trials, dimensions, per_model}. Graceful empty state if file missing or corrupt — never fabricates. - cloud/scripts/export_ab_proof.py: aggregates an ablation run's JSONL judgments into proof_results.json. Computes per-dimension means, 95% CIs, deltas, and per-model breakdown. Run: python cloud/scripts/export_ab_proof.py - cloud/data/proof_results.json: placeholder (overwritten by export script). - cloud/dashboard/src/components/brain/ABProofPanel.tsx: fetches /public/proof, shows real trials/subjects/judge when live, falls back to demo fixture with a visible "demo data" label when empty. Delta color flips on regressions (honest: we show -pp in red rather than only shipping positive results). - cloud/tests/test_proof.py: 5 new tests (missing file, present file, corrupt file, unauthenticated public access, export helper loadable). Pipeline: run ablation → run export script → redeploy cloud → dashboard lights up with honest data automatically. No marketing claims the data doesn't support. Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

coderabbitai · 2026-04-14T09:32:39Z

📝 Walkthrough

Changes

New public API endpoint: GET /public/proof serves ablation test results from cloud/data/proof_results.json, with graceful degradation for missing or invalid files
Export script: cloud/scripts/export_ab_proof.py aggregates ablation trial judgments, computes per-dimension performance metrics with 95% confidence intervals, and writes results to cloud/data/proof_results.json
Dashboard update: ABProofPanel now fetches live proof data instead of using static mock data; displays real trial counts, subjects, and judge information
Data-driven delta display: Corrected delta coloring semantics to indicate regressions (red for negative deltas) vs improvements (green for positive deltas)
Fallback logic: ABProofPanel uses with_full_mean → best_mean → 0 for rules+meta comparison, with demo data shown when no live results available
Test coverage: Added 6 tests covering missing/present/corrupt files, unauthenticated public access, and export helper functionality

Walkthrough

Adds a new public A/B proof API endpoint that serves JSON from a repository file, integrates the endpoint into the dashboard component to fetch live proof data, and adds tests plus a sample results file to exercise endpoint behaviors.

Changes

Cohort / File(s)	Summary
Backend Routing `cloud/app/routes/__init__.py`, `cloud/app/routes/proof.py`	New `proof` router registered on the main APIRouter. Adds `GET /public/proof` which reads `cloud/data/proof_results.json`, returns `{available: true, ...}` for valid object payloads, and returns structured `available: false` responses for missing, unreadable, or unexpected JSON shapes (with warnings logged).
Frontend Integration `cloud/dashboard/src/components/brain/ABProofPanel.tsx`	Replaced static mock usage with client fetch to `GET /public/proof`. Component stores fetched payload, determines `live` when `available` and `dimensions` present, maps/labels dimensions, and updates header copy, sublabel, and delta styling based on live vs demo state.
Data File `cloud/data/proof_results.json`	Added sample proof results JSON containing `available`, `source`, `subjects`, `conditions`, `judge`, `trials`, `dimensions`, `per_model`, and `updated_at` fields for dashboard consumption.
Tests `cloud/tests/test_proof.py`	New tests for `/api/v1/public/proof` covering: missing file, valid JSON, malformed JSON, unexpected top-level JSON shape, and unauthenticated access. Also tests `scripts/export_ab_proof.py::load_judgments` with empty run directory.

Sequence Diagram

sequenceDiagram
    participant Dashboard as Dashboard Client
    participant API as FastAPI Server
    participant FS as File System

    Dashboard->>API: GET /public/proof
    API->>FS: check for proof_results.json
    alt file exists
        FS-->>API: file found
        API->>FS: read file
        FS-->>API: JSON content
        API->>API: parse, validate top-level object, set available=true
        API-->>Dashboard: { available: true, dimensions: [...] }
    else missing or unreadable
        FS-->>API: missing / read error / parse error
        API->>API: log warning, prepare unavailable payload
        API-->>Dashboard: { available: false, reason: "..." }
    end
    Dashboard->>Dashboard: render live rows if available else fallback demo

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main change: adding a /public/proof endpoint and ablation export to replace fabricated marketing copy with honest A/B test data.
Description check	✅ Passed	The description is detailed and directly related to the changeset, explaining the motivation, changes, pipeline, and test plan for replacing fabricated claims with ablation-backed data.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cloudflare-workers-and-pages · 2026-04-14T09:32:57Z

Deploying gradata-dashboard with Cloudflare Pages

Latest commit:	`6a28b60`
Status:	✅ Deploy successful!
Preview URL:	https://1ed6a41d.gradata-dashboard.pages.dev
Branch Preview URL:	https://feat-honest-ab-proof.gradata-dashboard.pages.dev

View logs

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cloud/app/routes/proof.py`:
- Around line 64-69: The code assumes payload is a dict and calls
payload.setdefault("available", True); instead validate the parsed JSON before
using dict methods: after json.loads(_PROOF_PATH.read_text(...)) check
isinstance(payload, dict), and if not, log a warning referencing the payload
type/value and return {"available": False, "source": None, "reason": "results
file unreadable or unexpected JSON type"}; only call
payload.setdefault("available", True) and return payload when payload is a dict.
This prevents payload.setdefault from raising on values like [] or "ok".

In `@cloud/dashboard/src/components/brain/ABProofPanel.tsx`:
- Around line 15-25: In ABProofPanel, the live "rules+meta vs baseline"
comparison is currently using ProofDim.best_mean; change the mapping so the UI
uses ProofDim.with_full_mean for the live comparison cell/variable instead of
best_mean, and if with_full_mean can be null ensure a safe fallback (e.g., use
best_mean or display N/A) to avoid rendering null. Update any occurrences where
best_mean is used for the live comparison rendering to reference with_full_mean
(with the fallback) so the panel copy matches the displayed data.

In `@cloud/tests/test_proof.py`:
- Around line 62-71: Add a regression test alongside
test_proof_returns_unavailable_on_corrupt_file that writes a valid JSON with an
invalid top-level shape (e.g., "[]" or a string) to the temporary _PROOF_PATH
and asserts the /api/v1/public/proof endpoint still returns a graceful
unavailable payload; specifically, in the same test module import proof as
proof_module, monkeypatch proof_module._PROOF_PATH to point at the temp file
containing "[]" and then call client.get("/api/v1/public/proof") and assert
resp.status_code == 200 and resp.json()["available"] is False to ensure the
handler tolerates valid-but-incorrect JSON shapes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c8ac291a-e267-46b3-bcc0-de4f02b81470

📥 Commits

Reviewing files that changed from the base of the PR and between 4b53ae6 and ef90988.

📒 Files selected for processing (5)

cloud/app/routes/__init__.py
cloud/app/routes/proof.py
cloud/dashboard/src/components/brain/ABProofPanel.tsx
cloud/data/proof_results.json
cloud/tests/test_proof.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.11)
GitHub Check: test (3.13)
GitHub Check: test (3.12)
GitHub Check: Cloudflare Pages

🔇 Additional comments (5)

cloud/app/routes/__init__.py (1)

16-16: Proof router registration is correctly wired.

Import and inclusion match the existing route aggregation pattern, and this should expose the endpoint under the app-level /api/v1 prefix as intended.

Also applies to: 37-37

cloud/data/proof_results.json (1)

1-1: Placeholder unavailable payload is appropriate.

This provides a safe cold-start response for /public/proof without inventing metrics.

cloud/app/routes/proof.py (1)

57-63: Missing-file fallback behavior is solid.

Returning a structured unavailable payload here is clean and keeps the endpoint honest during fresh deploys.

cloud/tests/test_proof.py (1)

35-79: Endpoint behavior coverage is strong for core scenarios.

Missing-file, present-file, corrupt-file, and unauthenticated access checks align well with the public endpoint contract.

cloud/dashboard/src/components/brain/ABProofPanel.tsx (1)

117-125: Regression-aware delta styling is a good improvement.

The sign/color split now correctly surfaces negative deltas instead of always presenting improvements.

…ression test - proof.py: validate parsed JSON is a dict before calling setdefault; degrade gracefully when someone drops a list/string into the results file (test coverage added). - ABProofPanel: use with_full_mean (rules + meta) for the 'rules+meta vs baseline' comparison cell, with fallback to best_mean, then to 0. Matches panel copy. - test_proof: regression test for valid-but-wrong-shape JSON (`[]`). Co-Authored-By: Gradata <noreply@gradata.ai>

Gradata · 2026-04-14T09:42:37Z

Addressed all 3 CR actionables: dict validation in proof.py, with_full_mean fallback chain in ABProofPanel, regression test for wrong-shape JSON. Tests 6/6 green. @coderabbitai full review

greptile-apps

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

coderabbitai · 2026-04-14T09:42:51Z

@Gradata Thanks for the update! Acknowledged — dict validation in proof.py, the with_full_mean fallback chain in ABProofPanel, and the regression test for wrong-shape JSON all noted. I'll kick off a fresh full review of the PR now with those changes in mind.

✅ Actions performed

Full review triggered.

Combines two complementary honesty additions: - main added a 'published baseline (S103)' warning badge + disclaimer to the ABProofPanel, signaling the numbers are a stress-test reference, not the user's brain. - this branch added the /public/proof endpoint + export script + dynamic data switch with real ablation numbers when present. Combined behavior: when /public/proof returns live data, show a green 'live · this brain' badge with real trial/subject/judge counts. Otherwise show the yellow 'published baseline' badge with the S103 disclaimer. Co-Authored-By: Gradata <noreply@gradata.ai>

Replaces placeholder with output of export_ab_proof.py against .tmp/rule-ablation-v2 (Sonnet/DeepSeek/qwen14b × base/rules/full × 16 tasks × 3 iters, judged blind by Haiku 4.5). Headline numbers (with_full_mean — rules + meta-rules): - correctness: 0.833 → 0.832 (-0.1 pp) - preference_adherence: 0.732 → 0.755 (+2.3 pp) - quality: 0.793 → 0.799 (+0.6 pp) Note: with_rules_mean is higher than with_full_mean across all dimensions. The deterministic meta-rule overlay regresses results; this is the empirical evidence that drove the source-aware injection filter (PR #45). Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

Gradata · 2026-04-14T15:19:07Z

Real ablation data committed in 6a28b60 (427 trials × 3 dimensions). /public/proof will now serve actual numbers; ABProofPanel switches between green 'live · this brain' badge (real data) and yellow 'published baseline' badge (S103 fallback). Headline: rules add +2.3pp preference adherence; full-stack (rules+meta) regresses correctness -0.1pp — drove the source-filter on PR #45.

coderabbitai

♻️ Duplicate comments (1)

cloud/dashboard/src/components/brain/ABProofPanel.tsx (1)

15-26: 🧹 Nitpick | 🔵 Trivial

Type definition mismatch: best_mean should be optional.

The interface declares best_mean: number as required, but line 86 uses d.best_mean ?? 0 which implies it could be undefined. Make the type match the defensive fallback logic.

♻️ Proposed fix

 interface ProofDim {
   dimension: string
   baseline_mean: number
   with_rules_mean: number | null
   with_full_mean: number | null
-  best_mean: number
+  best_mean?: number
   ci_low: number
   ci_high: number
   delta_pp: number
   n_base: number
   n_with: number
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cloud/dashboard/src/components/brain/ABProofPanel.tsx` around lines 15 - 26,
The ProofDim interface declares best_mean as required but the code uses a
defensive fallback (d.best_mean ?? 0); update the type to reflect that best_mean
can be undefined by making best_mean optional (e.g., best_mean?: number) so the
type matches the usage in places like d.best_mean ?? 0 and prevents TypeScript
errors; change the declaration on the ProofDim interface accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cloud/dashboard/src/components/brain/ABProofPanel.tsx`:
- Around line 15-26: The ProofDim interface declares best_mean as required but
the code uses a defensive fallback (d.best_mean ?? 0); update the type to
reflect that best_mean can be undefined by making best_mean optional (e.g.,
best_mean?: number) so the type matches the usage in places like d.best_mean ??
0 and prevents TypeScript errors; change the declaration on the ProofDim
interface accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8e058b52-fc12-4867-bb40-80bf3bacd318

📥 Commits

Reviewing files that changed from the base of the PR and between ef90988 and 6a28b60.

📒 Files selected for processing (4)

cloud/app/routes/proof.py
cloud/dashboard/src/components/brain/ABProofPanel.tsx
cloud/data/proof_results.json
cloud/tests/test_proof.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.11)
GitHub Check: test (3.12)
GitHub Check: test (3.13)
GitHub Check: Cloudflare Pages

🔇 Additional comments (6)

cloud/app/routes/proof.py (1)

29-79: LGTM — endpoint correctly handles all failure modes.

The implementation properly degrades gracefully for missing files, corrupt JSON, and unexpected JSON shapes. The dict validation fix from the previous review is correctly in place.

One minor note: the endpoint uses synchronous file I/O (read_text) within an async handler. For a small JSON file read infrequently, this is acceptable, but if the file grows or is accessed under load, consider using aiofiles for non-blocking reads.

cloud/tests/test_proof.py (1)

1-104: LGTM — comprehensive test coverage for endpoint behavior.

The tests cover all major branches: missing file, valid JSON, corrupt JSON, wrong JSON shape, and unauthenticated access. The regression test for wrong-shape JSON (lines 73-81) correctly addresses the previous review feedback.

cloud/data/proof_results.json (1)

1-126: LGTM — valid ablation results payload.

The JSON structure correctly matches the expected schema consumed by the /public/proof endpoint and ABProofPanel.tsx. The data includes honest results showing both improvements (preference_adherence +2.3pp, quality +0.6pp) and a minor regression (correctness -0.1pp), which aligns with the PR's goal of showing truthful numbers.

cloud/dashboard/src/components/brain/ABProofPanel.tsx (3)

56-75: LGTM — proper cleanup pattern for async effects.

The mounted flag correctly prevents state updates after unmount, avoiding the React warning about updating unmounted components. The error handling gracefully degrades to available: false.

77-90: LGTM — fallback chain correctly prioritizes with_full_mean.

The implementation at line 86 (d.with_full_mean ?? d.best_mean ?? 0) correctly addresses the previous review feedback, ensuring the panel displays "rules+meta vs baseline" data when available.

128-170: LGTM — delta coloring correctly reflects regressions.

The sign/color logic (lines 135-136) properly shows green for improvements (delta >= 0) and red for regressions (delta < 0), which is essential for honest reporting of results like the -0.1pp correctness regression in the current data.

Conflicts in cloud/app/db.py, routes/brains.py, routes/users.py. Resolution strategy (preserve main's behavior additions, keep simplify intent where non-conflicting): - db.py: accepted main's explicit in_= parameter (required by activity.py, rule_patches.py, brains.py callers merged via #44). Kept simplify's _raise_db_error for uniform error handling. - routes/brains.py: kept main's batched lessons+corrections IN-query for list_brains (perf) + main's new POST /brains endpoint. Preserved simplify's _brain_detail helper, Depends(get_brain_for_request) pattern, and _is_demo consolidation. - routes/users.py: accepted main's version wholesale (adds email, _derive_plan, _primary_workspace_id scoped notification updates). Simplify's _hydrate_workspaces helper dropped — main's inline form carries the required new behavior and this PR is a refactor (no behavior changes). No new behavior introduced by this merge commit.

* fix(proof): add missing export_ab_proof.py script (forgotten in PR #44) * fix(proof): address CR round-2 — filter unknown conditions, fix best-arm selection - Skip records whose condition is not in {base, rules, full} so trials, subjects, and dimension zeroing stay consistent with the three reported arms. - dim_payload: pick the arm with the highest mean and lock ci_pool + n_with to that arm so reported CI and sample size match best_mean instead of drifting from a mixed rules+full pool. - per_model: pick max(rules_mean, full_mean) for with_best_mean instead of truthy-OR'ing pools, which silently preferred rules even when full was higher. CR review: #48 (review)

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

coderabbitai Bot added the feature label Apr 14, 2026

coderabbitai Bot requested changes Apr 14, 2026

View reviewed changes

Comment thread cloud/app/routes/proof.py

Comment thread cloud/dashboard/src/components/brain/ABProofPanel.tsx

Comment thread cloud/tests/test_proof.py

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

Gradata and others added 2 commits April 14, 2026 08:17

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

coderabbitai Bot approved these changes Apr 14, 2026

View reviewed changes

Gradata merged commit 23ed26a into main Apr 14, 2026
6 checks passed

coderabbitai Bot mentioned this pull request Apr 14, 2026

feat: rule-to-hook UX — list, remove, events, stale detection #30

Merged

6 tasks

This was referenced Apr 14, 2026

refactor(cloud): simplify pass on cloud infra (main) #41

Merged

feat(dashboard): wire team + operator + clear-demo to real backend #34

Merged

Gradata added a commit that referenced this pull request Apr 15, 2026

fix(proof): add missing export_ab_proof.py script (forgotten in PR #44)

49a9714

Gradata mentioned this pull request Apr 15, 2026

fix(proof): add missing export_ab_proof.py script #48

Merged

2 tasks

Gradata mentioned this pull request Apr 15, 2026

feat(marketing): proof CTAs linking to /proof dashboard per sim24 pricing pushback #55

Merged

2 tasks

Gradata deleted the feat/honest-ab-proof branch April 15, 2026 07:55

Conversation

Gradata commented Apr 14, 2026

Summary

Changes

Pipeline

Why this matters

Test plan

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying gradata-dashboard with Cloudflare Pages

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gradata commented Apr 14, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 14, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Gradata commented Apr 14, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Apr 14, 2026 •

edited

Loading