feat(wiring): canary + rules.injected + scipy Beta PPF + Beta LB gate#86
feat(wiring): canary + rules.injected + scipy Beta PPF + Beta LB gate#86
Conversation
…py Beta PPF + Beta LB gate
Compound wiring fix derived from the autoresearch synthesis
(.tmp/autoresearch-synthesis.md §1-§2). Four independent recommendations
from three separate reports collapse into one PR.
## Changes
### rules/rule_engine.py:504 — scipy-backed Beta PPF
Replace normal approximation in `_beta_ppf_05` with `scipy.stats.beta.ppf`
when scipy is available; fall back to the existing approximation otherwise.
Closes the known small-sample bias (α+β < 10) that affects ~40% of
PATTERN-tier rules. Ship-alongside since scipy is already in `dev` extras.
### enhancements/self_improvement.py — Beta LB gate on RULE promotion
New `_passes_beta_lb_gate(lesson)` called in the PATTERN→RULE promotion
condition. Gate is OPT-IN via `GRADATA_BETA_LB_GATE=1` (default off) to
preserve v4-ablation calibration. When enabled, requires:
- `fire_count >= GRADATA_BETA_LB_MIN_FIRES` (default 5), and
- `_beta_ppf_05(α, β) >= GRADATA_BETA_LB_THRESHOLD` (default 0.70)
Targets the min2022 random-label control failure: ~15–20% of current
RULE-tier graduations pass on format, not content.
### _core.py:680 — wire GRADUATION → promote_to_canary
Every fresh RULE graduation now enrolls the lesson's category in canary
state. `promote_to_canary(category, session, db_path)` closes the wiring
audit §3 gap where `enhancements/rule_canary.py` was shipped but never
called from runtime. Best-effort — graduation never fails if the canary
table is unavailable.
### _core.py:end_session — canary health sweep
Before `SESSION_END` emits, iterate RULE-tier lessons and call
`check_canary_health(category, session)`. Recommendations:
- PROMOTE (0 corrections in CANARY_SESSIONS) → `promote_to_active`
- ROLLBACK (1+ corrections) → `rollback_rule`
Closes the wiring audit §3 "canary is built but architecturally bypassed"
finding. Implementation is best-effort and per-category-deduped.
### brain.py:apply_brain_rules — rules.injected + bus wiring
Pass `self.bus` into `apply_rules()` / `apply_rules_with_tree()` so
`rule_scoped_out` events fire in production (wiring audit §6B). Emit
`rules.injected` after `applied` is computed so
`SessionHistory.compute_effectiveness()` starts returning real data
instead of {} (wiring audit §4 — subscriber existed, emitter didn't).
## Why this corrects a leanness false-positive
The leanness audit flagged `rule_ranker.py` and `self_healing.py` as dead
code. The *reason* they're dead is this wiring gap: without
`rules.injected`, `SessionHistory` can't compute effectiveness, so the
ranker never gets live feedback. Wire the emit → both files become live.
Do not delete.
## Test plan
- [x] `pytest tests/test_wiring_compound.py` — 14 new tests pass
(Beta PPF shape, Beta LB gate on/off/thresholds/min-fires, canary
enrollment, rules.injected payload shape, end_session sweep no-crash)
- [x] `pytest tests/test_beta_scoring.py` — adjusted bias-measuring
assertion (> 0.8 → > 0.75) since scipy PPF is more accurate than
the normal approximation; statistical intent ("20/21 successes gives
high reliability") preserved
- [x] Full suite — 2561 pass, 24 skipped locally
## Follow-ups
- Measure Beta LB gate in ablation with `GRADATA_BETA_LB_GATE=1` before
defaulting on. Expected direction: tightens v4's +7.8% Sonnet lift by
blocking the ~15–20% false-RULE graduations the min2022 control found.
- BM25 rule ranking + Thompson sampling sit on this PR's `rules.injected`
emit (follow-up, not this PR).
Co-Authored-By: Gradata <noreply@gradata.ai>
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Caution Review failedPull request was closed or merged during review 📝 Walkthrough
WalkthroughThis PR integrates canary rollout management into the core system by adding canary enrollment during graduation transitions, health sweeps at session end, event bus wiring for rule injection notifications, and a Beta distribution lower-bound gate for PATTERN→RULE promotion. It also updates the Beta percentile computation to prefer SciPy when available and includes comprehensive test coverage for the new functionality. Changes
Sequence DiagramssequenceDiagram
participant Brain
participant GraduationLogic
participant RuleCanary
participant Database as DB
Brain->>GraduationLogic: lesson.state = PATTERN→RULE transition
GraduationLogic->>RuleCanary: promote_to_canary(category, session, db_path)
RuleCanary->>Database: INSERT/UPDATE rule_canary table
Database-->>RuleCanary: acknowledgement
GraduationLogic->>Brain: emit lesson.graduated event
sequenceDiagram
participant Brain
participant SessionEnd as Session End
participant RuleCanary
participant Database as DB
Brain->>SessionEnd: brain_end_session()
SessionEnd->>RuleCanary: iterate RULE-state lessons by category
RuleCanary->>RuleCanary: check_canary_health(category, current_session, db_path)
RuleCanary->>Database: query canary metrics & session counts
Database-->>RuleCanary: health metrics
alt Health Recommendation
RuleCanary->>RuleCanary: promote_to_active(category, db_path)
RuleCanary->>Database: UPDATE rule_canary status
else Rollback Required
RuleCanary->>RuleCanary: rollback_rule(category, reason, db_path)
RuleCanary->>Database: DELETE/UPDATE rule_canary & lessons
end
SessionEnd->>Brain: return session result
sequenceDiagram
participant Brain
participant ApplyRules as apply_brain_rules()
participant RuleEngine
participant EventBus as Event Bus
participant Listener
Brain->>ApplyRules: capture bus instance
ApplyRules->>RuleEngine: apply_rules_with_tree(event_bus=bus) or apply_rules(bus=bus)
RuleEngine-->>ApplyRules: injected rules metadata
ApplyRules->>EventBus: emit rules.injected event with rule payload + scope + task
EventBus->>Listener: dispatch event to registered observer
Listener-->>EventBus: acknowledgement
ApplyRules->>Brain: return formatted result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
#88 landed at the same time as #86 and #87 shipping from a parallel session. The merges didn't conflict line-wise, but the diffs overlap: - brain.py:apply_brain_rules — #86 already wired `rules.injected` with a richer payload (id + category + confidence + state + scope) and try/except guard. #88 added a second thinner emit after the cache.put. Result: double-fire on fresh compute. Harmless in practice — SessionHistory dedups via a set — but clearly wrong. Removing #88's emit, keeping #86's. - .gitignore — #87 already added `/cloud/` and `/sdk/`. #88's re-adds are duplicates. Removing; keeping `/railway.toml` and `apollo-leads-*.csv` which are genuinely new from #88. The regression test in tests/test_session_history.py stays — it asserts the emit fires end-to-end from a real Brain + correct() loop, complementing #86's test_wiring_compound.py coverage of payload shape. Both pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#88 landed at the same time as #86 and #87 shipping from a parallel session. The merges didn't conflict line-wise, but the diffs overlap: - brain.py:apply_brain_rules — #86 already wired `rules.injected` with a richer payload (id + category + confidence + state + scope) and try/except guard. #88 added a second thinner emit after the cache.put. Result: double-fire on fresh compute. Harmless in practice — SessionHistory dedups via a set — but clearly wrong. Removing #88's emit, keeping #86's. - .gitignore — #87 already added `/cloud/` and `/sdk/`. #88's re-adds are duplicates. Removing; keeping `/railway.toml` and `apollo-leads-*.csv` which are genuinely new from #88. The regression test in tests/test_session_history.py stays — it asserts the emit fires end-to-end from a real Brain + correct() loop, complementing #86's test_wiring_compound.py coverage of payload shape. Both pass. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…test LOC) Deletes dead code flagged in the autoresearch leanness audit after grep-verifying that no runtime import path exists. All 2453 tests pass. Source files removed (2101 LOC): - src/gradata/contrib/enhancements/outcome_feedback.py (1 LOC, docstring stub) - src/gradata/enhancements/super_meta_rules.py (197 LOC, no importers; SuperMetaRule dataclass + SQL table live in meta_rules.py / meta_rules_storage.py and remain wired) - src/gradata/enhancements/pubsub_pipeline.py (49 LOC, test-only) - src/gradata/rules/budget.py (43 LOC, test-only) - src/gradata/rules/rw_lock.py (54 LOC, test-only) - src/gradata/cloud/wiki_store.py (451 LOC, only cloud/__init__.py re-export + test) - src/gradata/enhancements/rule_verifier.py (243 LOC, only manifest string + test reference) - src/gradata/enhancements/rule_evolution.py (434 LOC, only manifest string + test references; contradiction_detector.py covers the live path via self_improvement.py:545) - src/gradata/security/privacy_model.py (113 LOC, test + docs only; _core.py / brain.py / _export_brain.py grep-clean) - src/gradata/benchmarks/swe_bench.py (516 LOC, docstring example + test only, no CLI/docs runtime reference) Test files removed (1042 LOC): matching tests for each module plus targeted pruning of rule_evolution test classes (TestRuleConflicts, TestRuleRelationEnum, rule_evolution imports in TestIntegration) from tests/test_steals.py and the TestRuleABTesting block in tests/test_adaptations.py. Registry + docstring updates: - contrib/enhancements/install_manifest.py: drop rule_verifier from rule-integrity module components - _manifest_helpers.py: drop rule_evolution from _core_modules - enhancements/__init__.py: drop rule_verifier docstring line - cloud/__init__.py: drop WikiStore lazy re-export - enhancements/meta_rules_storage.py: docstring no longer points at the deleted super_meta_rules.py NOT DELETED (verified live via PRs #77/#81/#86): - enhancements/rule_ranker.py, self_healing.py, rule_canary.py, rule_to_hook.py (all have runtime callers) - middleware/ (flagged empty in the audit but actually contains _core.py + 4 adapters — kept) - src/gradata/graphify-out/ (did not exist in this tree) Tests: 2453 passed, 24 skipped (test_integration_full.py ignored per task spec). Co-Authored-By: Gradata <noreply@gradata.ai>
…test LOC) (#90) Deletes dead code flagged in the autoresearch leanness audit after grep-verifying that no runtime import path exists. All 2453 tests pass. Source files removed (2101 LOC): - src/gradata/contrib/enhancements/outcome_feedback.py (1 LOC, docstring stub) - src/gradata/enhancements/super_meta_rules.py (197 LOC, no importers; SuperMetaRule dataclass + SQL table live in meta_rules.py / meta_rules_storage.py and remain wired) - src/gradata/enhancements/pubsub_pipeline.py (49 LOC, test-only) - src/gradata/rules/budget.py (43 LOC, test-only) - src/gradata/rules/rw_lock.py (54 LOC, test-only) - src/gradata/cloud/wiki_store.py (451 LOC, only cloud/__init__.py re-export + test) - src/gradata/enhancements/rule_verifier.py (243 LOC, only manifest string + test reference) - src/gradata/enhancements/rule_evolution.py (434 LOC, only manifest string + test references; contradiction_detector.py covers the live path via self_improvement.py:545) - src/gradata/security/privacy_model.py (113 LOC, test + docs only; _core.py / brain.py / _export_brain.py grep-clean) - src/gradata/benchmarks/swe_bench.py (516 LOC, docstring example + test only, no CLI/docs runtime reference) Test files removed (1042 LOC): matching tests for each module plus targeted pruning of rule_evolution test classes (TestRuleConflicts, TestRuleRelationEnum, rule_evolution imports in TestIntegration) from tests/test_steals.py and the TestRuleABTesting block in tests/test_adaptations.py. Registry + docstring updates: - contrib/enhancements/install_manifest.py: drop rule_verifier from rule-integrity module components - _manifest_helpers.py: drop rule_evolution from _core_modules - enhancements/__init__.py: drop rule_verifier docstring line - cloud/__init__.py: drop WikiStore lazy re-export - enhancements/meta_rules_storage.py: docstring no longer points at the deleted super_meta_rules.py NOT DELETED (verified live via PRs #77/#81/#86): - enhancements/rule_ranker.py, self_healing.py, rule_canary.py, rule_to_hook.py (all have runtime callers) - middleware/ (flagged empty in the audit but actually contains _core.py + 4 adapters — kept) - src/gradata/graphify-out/ (did not exist in this tree) Tests: 2453 passed, 24 skipped (test_integration_full.py ignored per task spec). Co-authored-by: Gradata <noreply@gradata.ai>
Stages a small, manual-kickoff A/B harness to measure the Beta lower- bound promotion gate shipped in PR #86. Does not run the experiment — Oliver runs it with GRADATA_ABLATION_CONFIRM=1 when he wants a signal. - brain/scripts/ablation_beta_lb_gate.py: synthetic 20-lesson brain, graduation simulation under gate on/off, Sonnet generate + Haiku judge, writes .tmp/ablation_beta_lb_<ts>.json + human summary. - brain/scripts/README-ablation-beta-lb.md: context, run commands, cost table, decision criteria (pref-lift >= +1.0% AND grad-drop <= 50%). - tests/test_ablation_beta_lb_gate.py: dry-run zero-LLM-call proof, gate discriminates on synthetic pool, env-var restore, output schema. Safety gate: without GRADATA_ABLATION_CONFIRM=1 the script prints the trial count + token + dollar estimate and exits 0. Dry-run is verified by a test that raises AssertionError on any client-factory access. No changes to production code — harness PR only.
Stages a small, manual-kickoff A/B harness to measure the Beta lower- bound promotion gate shipped in PR #86. Does not run the experiment — Oliver runs it with GRADATA_ABLATION_CONFIRM=1 when he wants a signal. - brain/scripts/ablation_beta_lb_gate.py: synthetic 20-lesson brain, graduation simulation under gate on/off, Sonnet generate + Haiku judge, writes .tmp/ablation_beta_lb_<ts>.json + human summary. - brain/scripts/README-ablation-beta-lb.md: context, run commands, cost table, decision criteria (pref-lift >= +1.0% AND grad-drop <= 50%). - tests/test_ablation_beta_lb_gate.py: dry-run zero-LLM-call proof, gate discriminates on synthetic pool, env-var restore, output schema. Safety gate: without GRADATA_ABLATION_CONFIRM=1 the script prints the trial count + token + dollar estimate and exits 0. Dry-run is verified by a test that raises AssertionError on any client-factory access. No changes to production code — harness PR only.
Summary
Compound wiring PR from the 2026-04-15 autoresearch synthesis (
.tmp/autoresearch-synthesis.md). Four recommendations from three independent audit reports collapse into five small, contained edits.Closes wiring gaps:
_core.py:680now callspromote_to_canaryon every RULE graduationend_sessionnow checks each RULE-tier canary; promotes/rolls back percheck_canary_healthrules.injected— emitted frombrain.apply_brain_rulessoSessionHistory.compute_effectivenessstarts returning real data (subscriber existed, emitter didn't)apply_rules/apply_rules_with_treesorule_scoped_outactually fires in productionAlgo-gaps shipped:
_beta_ppf_05— closes small-sample bias in the ~40% of PATTERN-tier rules with α+β < 10Beta.ppf(0.05, α, β) ≥ 0.70ANDfire_count ≥ 5. Off by default to preserve v4 ablation calibration; flip viaGRADATA_BETA_LB_GATE=1Why one PR, not five
Three separate audit reports identified these as independent issues. Cross-referencing them shows they share change sites and the fixes compose:
_core.py:680(GRADUATION emit point)rules.injectedemission is the unlock forSessionHistory.compute_effectivenesswhich is the unlock forrule_ranker's live effectiveness scores — so the leanness audit's "delete rule_ranker" recommendation was a false positive; wire, don't deleteFull synthesis with cross-report compound analysis in
.tmp/autoresearch-synthesis.md.Changes (6 files, +386 / -8)
rules/rule_engine.py_beta_ppf_05uses scipy when available, falls back to normal approxenhancements/self_improvement.py_passes_beta_lb_gatehelper; gate wired into PATTERN→RULE promotion_core.pypromote_to_canarycall after GRADUATION emit (RULE only); canary health sweep before SESSION_ENDbrain.pyapply_brain_rulespassesself.busto apply_rules; emitsrules.injectedwith rule ids + scope + tasktests/test_wiring_compound.pytests/test_beta_scoring.pyTest plan
pytest tests/test_wiring_compound.py— 14 new tests passtest (3.11)/(3.12)/(3.13)Follow-ups (not in this PR)
GRADATA_BETA_LB_GATE=1in ablation before defaulting onrules.injectedemit this PR adds.tmp/autoresearch-synthesis.md§4 (~1,460 LOC across 9 files) — separate hygiene PRCo-Authored-By: Gradata noreply@gradata.ai