exp(ablation): pilot harness for GRADATA_BETA_LB_GATE by Gradata · Pull Request #92 · Gradata/gradata

Gradata · 2026-04-15T21:17:45Z

Summary

Stages a pilot ablation harness to measure the effect of the Beta lower-bound promotion gate shipped in PR #86 (self_improvement._passes_beta_lb_gate). Default-OFF until we have an in-band signal; this is the tool that produces that signal.

No production code changes. Harness + README + tests only.

What the harness does

Runs two conditions on the same seeded synthetic brain (~20 PATTERN-tier lessons with varied (alpha, beta_param) so some would be blocked by the gate, some wouldn't):

A (baseline): GRADATA_BETA_LB_GATE=0
B (gate on): GRADATA_BETA_LB_GATE=1

Measures:

Graduation-rate delta (PATTERN -> RULE blocked by the gate)
Preference lift (Sonnet generations scored by Haiku judge on per-task quality)
Per-lesson decision trace (which specific lessons promote/block under each condition)

Writes .tmp/ablation_beta_lb_<timestamp>.json + human summary.

How to run the pilot

```bash

Dry run (safe, zero API calls):

python brain/scripts/ablation_beta_lb_gate.py --tasks 10 --iterations 2

Actually run:

GRADATA_ABLATION_CONFIRM=1 python brain/scripts/ablation_beta_lb_gate.py --tasks 10 --iterations 2
```

Estimated cost: ~$1 for 10 tasks x 2 iterations (20 trials = 40 Sonnet gens + 20 Haiku judges). Dry-run prints a precise estimate for any --tasks / --iterations combo.

Safety gate

Without `GRADATA_ABLATION_CONFIRM=1`, the script runs a dry-run only — prints trial count / token estimate / dollar estimate, exits 0, makes zero LLM calls. Enforced by a test that monkey-patches `_make_anthropic_client` to raise on any access, then invokes `main()` and asserts it still exits 0.

Decision criteria (when to default the gate ON)

Default gate ON when both hold on the pilot:

`preference_lift_pct >= +1.0%`
`graduation_drop_pct <= 50%`

If lift is positive but graduation drops too far, tune `GRADATA_BETA_LB_THRESHOLD` down from 0.70 and re-run. Source: `.tmp/autoresearch-synthesis.md` §5 / §6.

Files

`brain/scripts/ablation_beta_lb_gate.py` (~500 LOC)
`brain/scripts/README-ablation-beta-lb.md` (context, usage, cost, decision rule)
`tests/test_ablation_beta_lb_gate.py` (6 tests, all mocked — never hits API)

Test plan

`pytest tests/test_ablation_beta_lb_gate.py -xvs` — 6/6 pass
Dry-run CLI prints estimate, exits 0, makes zero LLM calls
`pytest tests/test_ablation.py tests/test_beta_scoring.py tests/test_ablation_beta_lb_gate.py` — 38/38 pass
Oliver runs `GRADATA_ABLATION_CONFIRM=1 ... --tasks 10` after merge

Generated with Gradata

Stages a small, manual-kickoff A/B harness to measure the Beta lower- bound promotion gate shipped in PR #86. Does not run the experiment — Oliver runs it with GRADATA_ABLATION_CONFIRM=1 when he wants a signal. - brain/scripts/ablation_beta_lb_gate.py: synthetic 20-lesson brain, graduation simulation under gate on/off, Sonnet generate + Haiku judge, writes .tmp/ablation_beta_lb_<ts>.json + human summary. - brain/scripts/README-ablation-beta-lb.md: context, run commands, cost table, decision criteria (pref-lift >= +1.0% AND grad-drop <= 50%). - tests/test_ablation_beta_lb_gate.py: dry-run zero-LLM-call proof, gate discriminates on synthetic pool, env-var restore, output schema. Safety gate: without GRADATA_ABLATION_CONFIRM=1 the script prints the trial count + token + dollar estimate and exits 0. Dry-run is verified by a test that raises AssertionError on any client-factory access. No changes to production code — harness PR only.

greptile-apps

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

coderabbitai · 2026-04-15T21:18:09Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Ablation harness for Beta lower-bound gate: Pilot experiment to measure the effect of the GRADATA_BETA_LB_GATE introduced in PR #86, with no production-code changes
Two-condition experiment: Compares baseline (gate OFF) vs gate ON on seeded synthetic brain with ~20 PATTERN-tier lessons
Dry-run by default: Prints cost estimate and exits without LLM calls; requires GRADATA_ABLATION_CONFIRM=1 environment variable for actual execution
Measured outputs: Graduation-rate delta (PATTERN → RULE), preference lift (Sonnet vs Haiku quality judgments), and per-lesson decision traces written to JSON
Decision criteria: Default gate ON if preference_lift_pct ≥ +1.0% AND graduation_drop_pct ≤ 50%
New test suite: 6 mocked tests in tests/test_ablation_beta_lb_gate.py verifying dry-run safety, cost estimation, simulation accuracy, environment isolation, and output schema
No breaking changes or public API additions
Includes documentation: README explaining harness operation and cost estimates (~$1 for 10 tasks × 2 iterations)

Walkthrough

A new comprehensive pytest suite (test_ablation_beta_lb_gate.py) has been added to test the brain/scripts/ablation_beta_lb_gate.py module. The test suite covers CLI behavior, cost estimation logic, simulation outcomes, environment variable handling, output structure validation, and Anthropic client integration with mocked API calls.

Changes

Cohort / File(s)	Summary
New Test Suite `tests/test_ablation_beta_lb_gate.py`	Comprehensive test coverage for ablation beta LB gate functionality, including CLI dry-run behavior verification, cost estimation scaling validation, simulation graduation outcomes comparison (gate on vs. off), environment variable preservation, output structure validation with required keys/substructure, and format summary verification. Uses monkeypatching to stub Anthropic client to prevent real API access.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

feature

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically references the main change: adding a pilot ablation harness for measuring the GRADATA_BETA_LB_GATE feature, which aligns with the primary changeset addition.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the purpose, functionality, usage, test plan, and decision criteria for the ablation harness being added.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch exp/beta-lb-ablation-harness

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps Bot reviewed Apr 15, 2026

View reviewed changes

coderabbitai Bot added the feature label Apr 15, 2026

Gradata merged commit 5cc6589 into main Apr 15, 2026
16 of 17 checks passed

Gradata deleted the exp/beta-lb-ablation-harness branch April 17, 2026 19:45

This was referenced Apr 20, 2026

fix(tests): clean up Sprites-side tests from Gradata/tests/ #122

Merged

chore: untrack private files leaking into public repo #125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp(ablation): pilot harness for GRADATA_BETA_LB_GATE#92

exp(ablation): pilot harness for GRADATA_BETA_LB_GATE#92
Gradata merged 1 commit intomainfrom
exp/beta-lb-ablation-harness

Gradata commented Apr 15, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gradata commented Apr 15, 2026

Summary

What the harness does

How to run the pilot

Dry run (safe, zero API calls):

Actually run:

Safety gate

Decision criteria (when to default the gate ON)

Files

Test plan

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading