feat(cloud): honest A/B proof — /public/proof endpoint + ablation export#44
feat(cloud): honest A/B proof — /public/proof endpoint + ablation export#44
Conversation
…tion
Replaces the ABProofPanel's fabricated marketing copy ("200 blind expert
evaluators", "3,000 comparisons", "70% win rate") with real ablation-backed
numbers served by a new public endpoint.
New pieces:
- cloud/app/routes/proof.py: GET /public/proof, reads cloud/data/proof_results.json,
returns {available, source, subjects, judge, trials, dimensions, per_model}.
Graceful empty state if file missing or corrupt — never fabricates.
- cloud/scripts/export_ab_proof.py: aggregates an ablation run's JSONL judgments
into proof_results.json. Computes per-dimension means, 95% CIs, deltas, and
per-model breakdown. Run: python cloud/scripts/export_ab_proof.py
- cloud/data/proof_results.json: placeholder (overwritten by export script).
- cloud/dashboard/src/components/brain/ABProofPanel.tsx: fetches /public/proof,
shows real trials/subjects/judge when live, falls back to demo fixture with
a visible "demo data" label when empty. Delta color flips on regressions
(honest: we show -pp in red rather than only shipping positive results).
- cloud/tests/test_proof.py: 5 new tests (missing file, present file,
corrupt file, unauthenticated public access, export helper loadable).
Pipeline: run ablation → run export script → redeploy cloud → dashboard lights
up with honest data automatically. No marketing claims the data doesn't
support.
Co-Authored-By: Gradata <noreply@gradata.ai>
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
📝 WalkthroughChanges
WalkthroughAdds a new public A/B proof API endpoint that serves JSON from a repository file, integrates the endpoint into the dashboard component to fetch live proof data, and adds tests plus a sample results file to exercise endpoint behaviors. Changes
Sequence DiagramsequenceDiagram
participant Dashboard as Dashboard Client
participant API as FastAPI Server
participant FS as File System
Dashboard->>API: GET /public/proof
API->>FS: check for proof_results.json
alt file exists
FS-->>API: file found
API->>FS: read file
FS-->>API: JSON content
API->>API: parse, validate top-level object, set available=true
API-->>Dashboard: { available: true, dimensions: [...] }
else missing or unreadable
FS-->>API: missing / read error / parse error
API->>API: log warning, prepare unavailable payload
API-->>Dashboard: { available: false, reason: "..." }
end
Dashboard->>Dashboard: render live rows if available else fallback demo
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
Deploying gradata-dashboard with
|
| Latest commit: |
6a28b60
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://1ed6a41d.gradata-dashboard.pages.dev |
| Branch Preview URL: | https://feat-honest-ab-proof.gradata-dashboard.pages.dev |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloud/app/routes/proof.py`:
- Around line 64-69: The code assumes payload is a dict and calls
payload.setdefault("available", True); instead validate the parsed JSON before
using dict methods: after json.loads(_PROOF_PATH.read_text(...)) check
isinstance(payload, dict), and if not, log a warning referencing the payload
type/value and return {"available": False, "source": None, "reason": "results
file unreadable or unexpected JSON type"}; only call
payload.setdefault("available", True) and return payload when payload is a dict.
This prevents payload.setdefault from raising on values like [] or "ok".
In `@cloud/dashboard/src/components/brain/ABProofPanel.tsx`:
- Around line 15-25: In ABProofPanel, the live "rules+meta vs baseline"
comparison is currently using ProofDim.best_mean; change the mapping so the UI
uses ProofDim.with_full_mean for the live comparison cell/variable instead of
best_mean, and if with_full_mean can be null ensure a safe fallback (e.g., use
best_mean or display N/A) to avoid rendering null. Update any occurrences where
best_mean is used for the live comparison rendering to reference with_full_mean
(with the fallback) so the panel copy matches the displayed data.
In `@cloud/tests/test_proof.py`:
- Around line 62-71: Add a regression test alongside
test_proof_returns_unavailable_on_corrupt_file that writes a valid JSON with an
invalid top-level shape (e.g., "[]" or a string) to the temporary _PROOF_PATH
and asserts the /api/v1/public/proof endpoint still returns a graceful
unavailable payload; specifically, in the same test module import proof as
proof_module, monkeypatch proof_module._PROOF_PATH to point at the temp file
containing "[]" and then call client.get("/api/v1/public/proof") and assert
resp.status_code == 200 and resp.json()["available"] is False to ensure the
handler tolerates valid-but-incorrect JSON shapes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c8ac291a-e267-46b3-bcc0-de4f02b81470
📒 Files selected for processing (5)
cloud/app/routes/__init__.pycloud/app/routes/proof.pycloud/dashboard/src/components/brain/ABProofPanel.tsxcloud/data/proof_results.jsoncloud/tests/test_proof.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
- GitHub Check: test (3.12)
- GitHub Check: Cloudflare Pages
🔇 Additional comments (5)
cloud/app/routes/__init__.py (1)
16-16: Proof router registration is correctly wired.Import and inclusion match the existing route aggregation pattern, and this should expose the endpoint under the app-level
/api/v1prefix as intended.Also applies to: 37-37
cloud/data/proof_results.json (1)
1-1: Placeholder unavailable payload is appropriate.This provides a safe cold-start response for
/public/proofwithout inventing metrics.cloud/app/routes/proof.py (1)
57-63: Missing-file fallback behavior is solid.Returning a structured unavailable payload here is clean and keeps the endpoint honest during fresh deploys.
cloud/tests/test_proof.py (1)
35-79: Endpoint behavior coverage is strong for core scenarios.Missing-file, present-file, corrupt-file, and unauthenticated access checks align well with the public endpoint contract.
cloud/dashboard/src/components/brain/ABProofPanel.tsx (1)
117-125: Regression-aware delta styling is a good improvement.The sign/color split now correctly surfaces negative deltas instead of always presenting improvements.
…ression test - proof.py: validate parsed JSON is a dict before calling setdefault; degrade gracefully when someone drops a list/string into the results file (test coverage added). - ABProofPanel: use with_full_mean (rules + meta) for the 'rules+meta vs baseline' comparison cell, with fallback to best_mean, then to 0. Matches panel copy. - test_proof: regression test for valid-but-wrong-shape JSON (`[]`). Co-Authored-By: Gradata <noreply@gradata.ai>
|
Addressed all 3 CR actionables: dict validation in proof.py, with_full_mean fallback chain in ABProofPanel, regression test for wrong-shape JSON. Tests 6/6 green. @coderabbitai full review |
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
✅ Actions performedFull review triggered. |
Combines two complementary honesty additions: - main added a 'published baseline (S103)' warning badge + disclaimer to the ABProofPanel, signaling the numbers are a stress-test reference, not the user's brain. - this branch added the /public/proof endpoint + export script + dynamic data switch with real ablation numbers when present. Combined behavior: when /public/proof returns live data, show a green 'live · this brain' badge with real trial/subject/judge counts. Otherwise show the yellow 'published baseline' badge with the S103 disclaimer. Co-Authored-By: Gradata <noreply@gradata.ai>
Replaces placeholder with output of export_ab_proof.py against .tmp/rule-ablation-v2 (Sonnet/DeepSeek/qwen14b × base/rules/full × 16 tasks × 3 iters, judged blind by Haiku 4.5). Headline numbers (with_full_mean — rules + meta-rules): - correctness: 0.833 → 0.832 (-0.1 pp) - preference_adherence: 0.732 → 0.755 (+2.3 pp) - quality: 0.793 → 0.799 (+0.6 pp) Note: with_rules_mean is higher than with_full_mean across all dimensions. The deterministic meta-rule overlay regresses results; this is the empirical evidence that drove the source-aware injection filter (PR #45). Co-Authored-By: Gradata <noreply@gradata.ai>
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Real ablation data committed in 6a28b60 (427 trials × 3 dimensions). |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
cloud/dashboard/src/components/brain/ABProofPanel.tsx (1)
15-26: 🧹 Nitpick | 🔵 TrivialType definition mismatch:
best_meanshould be optional.The interface declares
best_mean: numberas required, but line 86 usesd.best_mean ?? 0which implies it could be undefined. Make the type match the defensive fallback logic.♻️ Proposed fix
interface ProofDim { dimension: string baseline_mean: number with_rules_mean: number | null with_full_mean: number | null - best_mean: number + best_mean?: number ci_low: number ci_high: number delta_pp: number n_base: number n_with: number }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cloud/dashboard/src/components/brain/ABProofPanel.tsx` around lines 15 - 26, The ProofDim interface declares best_mean as required but the code uses a defensive fallback (d.best_mean ?? 0); update the type to reflect that best_mean can be undefined by making best_mean optional (e.g., best_mean?: number) so the type matches the usage in places like d.best_mean ?? 0 and prevents TypeScript errors; change the declaration on the ProofDim interface accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cloud/dashboard/src/components/brain/ABProofPanel.tsx`:
- Around line 15-26: The ProofDim interface declares best_mean as required but
the code uses a defensive fallback (d.best_mean ?? 0); update the type to
reflect that best_mean can be undefined by making best_mean optional (e.g.,
best_mean?: number) so the type matches the usage in places like d.best_mean ??
0 and prevents TypeScript errors; change the declaration on the ProofDim
interface accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8e058b52-fc12-4867-bb40-80bf3bacd318
📒 Files selected for processing (4)
cloud/app/routes/proof.pycloud/dashboard/src/components/brain/ABProofPanel.tsxcloud/data/proof_results.jsoncloud/tests/test_proof.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
- GitHub Check: Cloudflare Pages
🔇 Additional comments (6)
cloud/app/routes/proof.py (1)
29-79: LGTM — endpoint correctly handles all failure modes.The implementation properly degrades gracefully for missing files, corrupt JSON, and unexpected JSON shapes. The dict validation fix from the previous review is correctly in place.
One minor note: the endpoint uses synchronous file I/O (
read_text) within an async handler. For a small JSON file read infrequently, this is acceptable, but if the file grows or is accessed under load, consider usingaiofilesfor non-blocking reads.cloud/tests/test_proof.py (1)
1-104: LGTM — comprehensive test coverage for endpoint behavior.The tests cover all major branches: missing file, valid JSON, corrupt JSON, wrong JSON shape, and unauthenticated access. The regression test for wrong-shape JSON (lines 73-81) correctly addresses the previous review feedback.
cloud/data/proof_results.json (1)
1-126: LGTM — valid ablation results payload.The JSON structure correctly matches the expected schema consumed by the
/public/proofendpoint andABProofPanel.tsx. The data includes honest results showing both improvements (preference_adherence +2.3pp, quality +0.6pp) and a minor regression (correctness -0.1pp), which aligns with the PR's goal of showing truthful numbers.cloud/dashboard/src/components/brain/ABProofPanel.tsx (3)
56-75: LGTM — proper cleanup pattern for async effects.The
mountedflag correctly prevents state updates after unmount, avoiding the React warning about updating unmounted components. The error handling gracefully degrades toavailable: false.
77-90: LGTM — fallback chain correctly prioritizeswith_full_mean.The implementation at line 86 (
d.with_full_mean ?? d.best_mean ?? 0) correctly addresses the previous review feedback, ensuring the panel displays "rules+meta vs baseline" data when available.
128-170: LGTM — delta coloring correctly reflects regressions.The sign/color logic (lines 135-136) properly shows green for improvements (
delta >= 0) and red for regressions (delta < 0), which is essential for honest reporting of results like the -0.1pp correctness regression in the current data.
Conflicts in cloud/app/db.py, routes/brains.py, routes/users.py. Resolution strategy (preserve main's behavior additions, keep simplify intent where non-conflicting): - db.py: accepted main's explicit in_= parameter (required by activity.py, rule_patches.py, brains.py callers merged via #44). Kept simplify's _raise_db_error for uniform error handling. - routes/brains.py: kept main's batched lessons+corrections IN-query for list_brains (perf) + main's new POST /brains endpoint. Preserved simplify's _brain_detail helper, Depends(get_brain_for_request) pattern, and _is_demo consolidation. - routes/users.py: accepted main's version wholesale (adds email, _derive_plan, _primary_workspace_id scoped notification updates). Simplify's _hydrate_workspaces helper dropped — main's inline form carries the required new behavior and this PR is a refactor (no behavior changes). No new behavior introduced by this merge commit.
* fix(proof): add missing export_ab_proof.py script (forgotten in PR #44) * fix(proof): address CR round-2 — filter unknown conditions, fix best-arm selection - Skip records whose condition is not in {base, rules, full} so trials, subjects, and dimension zeroing stay consistent with the three reported arms. - dim_payload: pick the arm with the highest mean and lock ci_pool + n_with to that arm so reported CI and sample size match best_mean instead of drifting from a mixed rules+full pool. - per_model: pick max(rules_mean, full_mean) for with_best_mean instead of truthy-OR'ing pools, which silently preferred rules even when full was higher. CR review: #48 (review)
Summary
Replaces the ABProofPanel's fabricated marketing copy ('200 blind expert evaluators', '3,000 comparisons', '70% win rate') with honest ablation-backed numbers served from a new public endpoint.
Changes
Pipeline
Why this matters
The current panel makes claims ('200 blind experts', '3,000 comparisons') we cannot defend. Any skeptic who asks 'where's your data?' gets an awkward answer. This PR makes that claim truth-checkable: the panel shows exactly what our ablation produced, with CIs, judge model, and subject list visible in the UI.
Test plan