Skip to content

exp(ablation): pilot harness for GRADATA_BETA_LB_GATE#92

Merged
Gradata merged 1 commit intomainfrom
exp/beta-lb-ablation-harness
Apr 15, 2026
Merged

exp(ablation): pilot harness for GRADATA_BETA_LB_GATE#92
Gradata merged 1 commit intomainfrom
exp/beta-lb-ablation-harness

Conversation

@Gradata
Copy link
Copy Markdown
Owner

@Gradata Gradata commented Apr 15, 2026

Summary

Stages a pilot ablation harness to measure the effect of the Beta lower-bound promotion gate shipped in PR #86 (self_improvement._passes_beta_lb_gate). Default-OFF until we have an in-band signal; this is the tool that produces that signal.

No production code changes. Harness + README + tests only.

What the harness does

Runs two conditions on the same seeded synthetic brain (~20 PATTERN-tier lessons with varied (alpha, beta_param) so some would be blocked by the gate, some wouldn't):

  • A (baseline): GRADATA_BETA_LB_GATE=0
  • B (gate on): GRADATA_BETA_LB_GATE=1

Measures:

  • Graduation-rate delta (PATTERN -> RULE blocked by the gate)
  • Preference lift (Sonnet generations scored by Haiku judge on per-task quality)
  • Per-lesson decision trace (which specific lessons promote/block under each condition)

Writes .tmp/ablation_beta_lb_<timestamp>.json + human summary.

How to run the pilot

```bash

Dry run (safe, zero API calls):

python brain/scripts/ablation_beta_lb_gate.py --tasks 10 --iterations 2

Actually run:

GRADATA_ABLATION_CONFIRM=1 python brain/scripts/ablation_beta_lb_gate.py --tasks 10 --iterations 2
```

Estimated cost: ~$1 for 10 tasks x 2 iterations (20 trials = 40 Sonnet gens + 20 Haiku judges). Dry-run prints a precise estimate for any --tasks / --iterations combo.

Safety gate

Without `GRADATA_ABLATION_CONFIRM=1`, the script runs a dry-run only — prints trial count / token estimate / dollar estimate, exits 0, makes zero LLM calls. Enforced by a test that monkey-patches `_make_anthropic_client` to raise on any access, then invokes `main()` and asserts it still exits 0.

Decision criteria (when to default the gate ON)

Default gate ON when both hold on the pilot:

  1. `preference_lift_pct >= +1.0%`
  2. `graduation_drop_pct <= 50%`

If lift is positive but graduation drops too far, tune `GRADATA_BETA_LB_THRESHOLD` down from 0.70 and re-run. Source: `.tmp/autoresearch-synthesis.md` §5 / §6.

Files

  • `brain/scripts/ablation_beta_lb_gate.py` (~500 LOC)
  • `brain/scripts/README-ablation-beta-lb.md` (context, usage, cost, decision rule)
  • `tests/test_ablation_beta_lb_gate.py` (6 tests, all mocked — never hits API)

Test plan

  • `pytest tests/test_ablation_beta_lb_gate.py -xvs` — 6/6 pass
  • Dry-run CLI prints estimate, exits 0, makes zero LLM calls
  • `pytest tests/test_ablation.py tests/test_beta_scoring.py tests/test_ablation_beta_lb_gate.py` — 38/38 pass
  • Oliver runs `GRADATA_ABLATION_CONFIRM=1 ... --tasks 10` after merge

Generated with Gradata

Stages a small, manual-kickoff A/B harness to measure the Beta lower-
bound promotion gate shipped in PR #86. Does not run the experiment —
Oliver runs it with GRADATA_ABLATION_CONFIRM=1 when he wants a signal.

- brain/scripts/ablation_beta_lb_gate.py: synthetic 20-lesson brain,
  graduation simulation under gate on/off, Sonnet generate + Haiku judge,
  writes .tmp/ablation_beta_lb_<ts>.json + human summary.
- brain/scripts/README-ablation-beta-lb.md: context, run commands, cost
  table, decision criteria (pref-lift >= +1.0% AND grad-drop <= 50%).
- tests/test_ablation_beta_lb_gate.py: dry-run zero-LLM-call proof,
  gate discriminates on synthetic pool, env-var restore, output schema.

Safety gate: without GRADATA_ABLATION_CONFIRM=1 the script prints the
trial count + token + dollar estimate and exits 0. Dry-run is verified
by a test that raises AssertionError on any client-factory access.

No changes to production code — harness PR only.
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough
  • Ablation harness for Beta lower-bound gate: Pilot experiment to measure the effect of the GRADATA_BETA_LB_GATE introduced in PR #86, with no production-code changes
  • Two-condition experiment: Compares baseline (gate OFF) vs gate ON on seeded synthetic brain with ~20 PATTERN-tier lessons
  • Dry-run by default: Prints cost estimate and exits without LLM calls; requires GRADATA_ABLATION_CONFIRM=1 environment variable for actual execution
  • Measured outputs: Graduation-rate delta (PATTERN → RULE), preference lift (Sonnet vs Haiku quality judgments), and per-lesson decision traces written to JSON
  • Decision criteria: Default gate ON if preference_lift_pct ≥ +1.0% AND graduation_drop_pct ≤ 50%
  • New test suite: 6 mocked tests in tests/test_ablation_beta_lb_gate.py verifying dry-run safety, cost estimation, simulation accuracy, environment isolation, and output schema
  • No breaking changes or public API additions
  • Includes documentation: README explaining harness operation and cost estimates (~$1 for 10 tasks × 2 iterations)

Walkthrough

A new comprehensive pytest suite (test_ablation_beta_lb_gate.py) has been added to test the brain/scripts/ablation_beta_lb_gate.py module. The test suite covers CLI behavior, cost estimation logic, simulation outcomes, environment variable handling, output structure validation, and Anthropic client integration with mocked API calls.

Changes

Cohort / File(s) Summary
New Test Suite
tests/test_ablation_beta_lb_gate.py
Comprehensive test coverage for ablation beta LB gate functionality, including CLI dry-run behavior verification, cost estimation scaling validation, simulation graduation outcomes comparison (gate on vs. off), environment variable preservation, output structure validation with required keys/substructure, and format summary verification. Uses monkeypatching to stub Anthropic client to prevent real API access.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

feature

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically references the main change: adding a pilot ablation harness for measuring the GRADATA_BETA_LB_GATE feature, which aligns with the primary changeset addition.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the purpose, functionality, usage, test plan, and decision criteria for the ablation harness being added.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch exp/beta-lb-ablation-harness

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the feature label Apr 15, 2026
@Gradata Gradata merged commit 5cc6589 into main Apr 15, 2026
16 of 17 checks passed
@Gradata Gradata deleted the exp/beta-lb-ablation-harness branch April 17, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant