feat: Sim 9 engine hardening — rule scoping, convergence gate, efficiency metric by Gradata · Pull Request #15 · Gradata/gradata

Gradata · 2026-04-06T20:37:23Z

Summary

Cherry-picks Mann-Kendall convergence upgrade (per-category tracking, p-values)
Rule domain scoping: per-domain misfire tracking on lessons, auto-disables rules with >30% misfire rate in a domain
Convergence gate: skips LLM extraction when a category is converged (Mann-Kendall flat)
brain.efficiency(): effort-ratio metric with optional time estimates (Kenoodl reframe: "reduced toil" not "AI learned")
New events: rule_scoped_out emitted when a rule is suppressed by domain

Source

Sim 9 consensus (2000 posts, 200 agents) + Kenoodl synthesis. See docs/superpowers/specs/2026-04-06-sim9-engine-hardening-design.md.

Test plan

pytest tests/test_rule_scoping.py -v — 8 domain scoping tests
pytest tests/test_convergence_gate.py -v — 3 gate tests
pytest tests/test_efficiency.py -v — 5 efficiency metric tests
pytest tests/test_convergence.py -v — 11 Mann-Kendall tests
pytest tests/ -q — full regression suite (1374 pass)
pyright src/gradata/ — 0 errors

Generated with Gradata

Greptile Summary

This PR hardens the Sim 9 engine with three new capabilities: per-domain rule scoping (auto-disables rules with >30% misfire rate in a given domain), a convergence gate (skips LLM extraction when a category's correction rate has statistically stabilized via Mann-Kendall), and a brain.efficiency() metric that computes an effort-ratio from correction trends with an optional time-saved estimate.

Rule domain scoping: domain_scores added to Lesson, is_rule_disabled_for_domain() added to rule_engine, and _attribute_domain_fires() added to _core.py; a new rule_scoped_out EventBus event is emitted when a rule is suppressed per domain.
Convergence gate: _get_convergence() caches the Mann-Kendall result per-session; per-category trend is checked before calling extract_behavioral_instruction, saving LLM calls when a category is already stable.
Efficiency metric: brain.efficiency() computes recent / initial correction-rate ratio; optional estimate_time=True adds a severity-weighted seconds-saved estimate.
Two issues require attention: the per-category convergence branch lacks an insufficient_data guard (sparse-data categories can be falsely labeled converged, silently suppressing extraction), and the bus.emit("rule_scoped_out", ...) call inside the filtering loop is not wrapped in try-except, meaning a misbehaving subscriber can abort the loop and truncate the applied rule set.

Confidence Score: 3/5

Two P1 logic bugs found that should be fixed before merge: sparse-category convergence mislabeling and an unguarded EventBus emit inside the rule-filtering loop.

The overall architecture is clean, test coverage is solid, and the three new features are well-reasoned. However, the per-category convergence gap is a production-quality bug — a category with sparse history will be silently marked converged and skip extraction, degrading lesson quality over time. The unguarded bus.emit in apply_rules violates Rule 4 and can cause silent rule-set truncation. Both are targeted fixes (a few lines each). P2s are minor style concerns. Score of 3 reflects two genuine logic bugs on the critical path that warrant a fix before merge.

src/gradata/_core.py (brain_convergence per-category mapping, lines 962–967; locals() pattern, lines 317–319) and src/gradata/rules/rule_engine.py (unguarded bus.emit in apply_rules loop, lines 494–502)

Important Files Changed

Filename	Overview
src/gradata/_core.py	Convergence gate + domain fire attribution added; per-category convergence lacks insufficient_data guard and uses fragile locals() check
src/gradata/rules/rule_engine.py	Domain-scoping filter added to apply_rules; bus.emit(rule_scoped_out) inside loop is not try-except protected
src/gradata/brain.py	Adds efficiency() and _get_convergence() with correct per-session caching; clean delegation pattern
src/gradata/enhancements/self_improvement.py	domain_scores parse/format round-trip implemented cleanly; backward-compatible
src/gradata/_types.py	domain_scores field added to Lesson with correct default_factory; no issues
tests/test_rule_scoping.py	8 scoping tests covering disable thresholds, event emission, and attribution; missing from future import annotations
tests/test_convergence_gate.py	3 gate tests covering skip, run, and cache behaviour; missing from future import annotations
tests/test_efficiency.py	5 efficiency tests including edge cases (insufficient data, flat rate, time estimates); well-structured
tests/test_convergence.py	11 Mann-Kendall tests covering monotone, noisy, flat, and per-category scenarios; comprehensive

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[brain.correct called] --> B[compute_diff + classify_edits]
    B --> C{_get_convergence\ncached this session?}
    C -->|yes| D[use cached result]
    C -->|no| E[brain_convergence — DB query]
    E --> F[Mann-Kendall overall + per-category]
    F --> D
    D --> G{cat_convergence.trend\n== 'converged'?}
    G -->|yes| H[use primary.description\nskip LLM extraction]
    G -->|no| I[extract_behavioral_instruction\nLLM + cache + templates]
    H --> J[find / create / update Lesson]
    I --> J
    J --> K[update_confidence + write lessons.md]
    K --> L{_fired_rules present?}
    L -->|yes| M[_attribute_domain_fires\nincrement domain fires/misfires]
    M --> N[bus.emit correction.created]
    L -->|no| N

    subgraph apply_rules [apply_rules — rule selection]
        O[filter PATTERN + RULE] --> P{scope.domain set?}
        P -->|yes| Q{is_rule_disabled_for_domain?}
        Q -->|yes| R[bus.emit rule_scoped_out\ndrop rule]
        Q -->|no| S[keep rule]
        P -->|no| S
        S --> T[score + rank → AppliedRule list]
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 962-967

Comment:
**Per-category convergence has no `insufficient_data` guard**

At the top-level, `brain_convergence` correctly emits `"insufficient_data"` when `len(counts) < 3`. But the per-category branch maps every `"no_trend"` result straight to `"converged"`:

```python
cat_trend = "converging" if cat_mk == "decreasing" else (
    "diverging" if cat_mk == "increasing" else "converged")
```

`cat_counts` is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like `[0, 0, 0, 5, 0, 0, 0]`. Because this sequence has no monotonic trend, Mann-Kendall returns `"no_trend"` → `"converged"`. The convergence gate in `_core.py` then sees `trend == "converged"` and **skips LLM extraction for future corrections in that category**, even though the category has never meaningfully converged — it simply has sparse history.

Suggested fix:

```python
if cat_mk == "decreasing":
    cat_trend = "converging"
elif cat_mk == "increasing":
    cat_trend = "diverging"
elif sum(cat_counts) < 3:          # not enough non-zero data
    cat_trend = "insufficient_data"
else:
    cat_trend = "converged"
```

The convergence gate already checks `cat_convergence.get("trend") == "converged"`, so adding the `"insufficient_data"` branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gradata/rules/rule_engine.py
Line: 494-502

Comment:
**Unguarded `bus.emit` inside eligibility-filter loop — Rule 4 violation**

The `bus.emit("rule_scoped_out", ...)` call is inside a `for lesson in eligible` loop with no try-except:

```python
if is_rule_disabled_for_domain(lesson, current_domain):
    if bus:
        scores = lesson.domain_scores.get(current_domain, {})
        bus.emit("rule_scoped_out", { ... })
```

Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the `for` loop terminates early — rules that should have been kept in `eligible` are silently dropped, producing a truncated rule set with no warning.

```suggestion
                if is_rule_disabled_for_domain(lesson, current_domain):
                    if bus:
                        try:
                            scores = lesson.domain_scores.get(current_domain, {})
                            bus.emit("rule_scoped_out", {
                                "lesson_category": lesson.category,
                                "lesson_description": lesson.description[:80],
                                "domain": current_domain,
                                "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
                            })
                        except Exception:
                            _log.debug("rule_scoped_out emit failed")
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 317-319

Comment:
**`locals()` inspection is fragile for `desc`-variable lookup**

`desc` is assigned inside the broad `try` block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using `'desc' in locals()` to probe this is an anti-pattern — easy to misread and could silently yield an empty `correction_desc` if the assignment was skipped part-way through the block.

A cleaner approach is to initialise a dedicated variable before the try block:

```python
_extracted_desc: str = ""  # initialise before the try block
# ... inside the try, after desc is computed:
_extracted_desc = desc
# ... after the try block:
if brain._fired_rules and (category or classifications):
    _attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary)
```

This makes the intent explicit and eliminates the `locals()` introspection.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_rule_scoping.py
Line: 1-2

Comment:
**Missing `from __future__ import annotations` (Rule 6)**

Both new test files added in this PR are missing the required first import. Per Rule 6, all `.py` files in the SDK should include `from __future__ import annotations` to enable `str | None` syntax consistently and maintain Pyright strict-mode compatibility.

This applies here (`tests/test_rule_scoping.py`) and also to `tests/test_convergence_gate.py` line 1.

```suggestion
from __future__ import annotations

"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
```

**Rule Used:** # Code Review Rules

## Rule 1: Never use print() ... ([source](https://app.greptile.com/review/custom-context?memory=dee613fe-ca52-4382-b9d7-fad6d0b079ec))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 1031

Comment:
**`corrections_avoided` multiplier includes the baseline period, overestimating savings**

```python
corrections_avoided = max(0, (initial - recent) * len(counts))
```

This multiplies by the **total** session count, including the first three sessions used to establish `initial`. In a 20-session brain with initial = 10, recent = 4, the formula yields `6 × 20 = 120` — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there.

A more accurate (still approximate) estimate uses only post-baseline sessions:

```suggestion
        corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3))
```

This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat: add brain.efficiency() — effort-sa..." | Re-trigger Greptile}

Greptile also left 5 inline comments on this PR.

Context used:

Rule used - # Code Review Rules

Rule 1: Never use print() ... (source)

…gory tracking Replaces naive first-half/second-half comparison with Mann-Kendall statistical trend test (pure Python, no scipy). Now returns: - p_value for trend significance - by_category breakdown (each category's own convergence curve) - Proper handling of ties and noisy data Driven by MiroFish Sim 9 consensus: 200 personas unanimously recommended Mann-Kendall over the half-split approach. Co-Authored-By: Gradata <noreply@gradata.ai>

Co-Authored-By: Gradata <noreply@gradata.ai>

… estimates Co-Authored-By: Gradata <noreply@gradata.ai>

coderabbitai · 2026-04-06T20:37:55Z

📝 Walkthrough

Walkthrough

The changes implement convergence-aware behavioral instruction gating, statistical Mann–Kendall trend testing, domain-scoped rule disabling via misfire rate thresholds, and a new efficiency metric API to quantify effort savings across learning sessions.

Changes

Cohort / File(s)	Summary
Core Convergence & Efficiency `src/gradata/_core.py`, `src/gradata/_types.py`, `src/gradata/brain.py`	Replaces heuristic trend detection with Mann–Kendall statistical testing; gates behavioral instruction extraction when category convergence is `"converged"`; adds `domain_scores` tracking to `Lesson`; introduces `brain_efficiency()` API and `Brain.efficiency()` method; implements convergence result caching via `_get_convergence()` helper.
Domain-Scoped Rule Management `src/gradata/rules/rule_engine.py`, `src/gradata/enhancements/self_improvement.py`	Adds domain-scoped rule disabling when misfire rate exceeds 30% after ≥3 fires; extends `apply_rules()` to filter ineligible lessons and emit `rule_scoped_out` events; updates lesson serialization to persist/parse `domain_scores` metadata as JSON.
Test Coverage `tests/test_convergence.py`, `tests/test_convergence_gate.py`, `tests/test_efficiency.py`, `tests/test_rule_scoping.py`	Adds comprehensive test suites validating Mann–Kendall trend detection and `p_value` output, convergence-gated instruction extraction, efficiency metric computation with optional time estimates, domain-scoped rule filtering and misfire attribution, and `domain_scores` round-trip serialization.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Brain
    participant RuleEngine as Rule Engine
    participant EventBus as Event Bus
    participant Core as _core.py

    Client->>Brain: correct(correction, category)
    activate Brain
    Brain->>Brain: _get_convergence()
    activate Brain
    alt Cached & same session
        Brain-->>Brain: return cached result
    else Cache miss or session changed
        Brain->>Core: brain_convergence(self)
        activate Core
        Core->>Core: _mann_kendall() for trend
        Core-->>Brain: {trend, p_value, by_category}
        deactivate Core
        Brain->>Brain: store in _convergence_cache
    end
    deactivate Brain
    
    alt Category trend is "converged"
        Brain->>Brain: skip extract_behavioral_instruction()
        Brain->>Brain: use primary.description
    else Trend is "diverging" or "converging"
        Brain->>Brain: call extract_behavioral_instruction()
    end
    
    Brain->>RuleEngine: apply_rules(..., domain)
    activate RuleEngine
    RuleEngine->>RuleEngine: filter by is_rule_disabled_for_domain()
    
    alt Rule disabled for domain
        RuleEngine->>EventBus: emit "rule_scoped_out"
        RuleEngine-->>EventBus: {domain, category, misfire_rate}
    end
    RuleEngine-->>Brain: eligible_lessons
    deactivate RuleEngine
    
    Brain->>Core: _attribute_domain_fires()
    activate Core
    Core->>Core: increment fires/misfires in domain_scores
    deactivate Core
    
    Brain-->>Client: lesson created/updated
    deactivate Brain

sequenceDiagram
    participant Client
    participant Brain
    participant Core as _core.py

    Client->>Brain: efficiency(estimate_time=False)
    activate Brain
    Brain->>Brain: _get_convergence()
    activate Brain
    Brain->>Core: brain_convergence(self)
    activate Core
    Core-->>Brain: {by_category, ...}
    deactivate Core
    deactivate Brain
    
    Brain->>Core: brain_efficiency(self)
    activate Core
    Core->>Core: extract first/last 3 sessions
    Core->>Core: compute effort_ratio
    alt estimate_time is True
        Core->>Core: multiply ratio by _SEVERITY_SECONDS
        Core->>Core: compute estimated_seconds_saved
    end
    Core-->>Brain: {effort_ratio, estimated_seconds_saved?, ...}
    deactivate Core
    
    Brain-->>Client: efficiency metrics dict
    deactivate Brain

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

feature, breaking-change

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.58% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title precisely summarizes the three main features introduced: rule scoping, convergence gate, and efficiency metric.
Description check	✅ Passed	The PR description comprehensively outlines the changes: Mann-Kendall convergence upgrade, rule domain scoping with per-domain misfire tracking, convergence gate for LLM extraction, brain.efficiency() metric, and test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/sim9-improvements

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-04-06T20:44:19Z

+    by_category: dict[str, dict] = {}
+    for cat, session_counts in cat_by_session.items():
+        cat_counts = [session_counts.get(s, 0) for s in sessions]
+        cat_mk, cat_p = _mann_kendall(cat_counts)
+        cat_trend = "converging" if cat_mk == "decreasing" else (
+            "diverging" if cat_mk == "increasing" else "converged")


Per-category convergence has no insufficient_data guard

At the top-level, brain_convergence correctly emits "insufficient_data" when len(counts) < 3. But the per-category branch maps every "no_trend" result straight to "converged":

cat_trend = "converging" if cat_mk == "decreasing" else ( "diverging" if cat_mk == "increasing" else "converged")

cat_counts is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like [0, 0, 0, 5, 0, 0, 0]. Because this sequence has no monotonic trend, Mann-Kendall returns "no_trend" → "converged". The convergence gate in _core.py then sees trend == "converged" and skips LLM extraction for future corrections in that category, even though the category has never meaningfully converged — it simply has sparse history.

Suggested fix:

if cat_mk == "decreasing": cat_trend = "converging" elif cat_mk == "increasing": cat_trend = "diverging" elif sum(cat_counts) < 3: # not enough non-zero data cat_trend = "insufficient_data" else: cat_trend = "converged"

The convergence gate already checks cat_convergence.get("trend") == "converged", so adding the "insufficient_data" branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gradata/_core.py Line: 962-967 Comment: **Per-category convergence has no `insufficient_data` guard** At the top-level, `brain_convergence` correctly emits `"insufficient_data"` when `len(counts) < 3`. But the per-category branch maps every `"no_trend"` result straight to `"converged"`: ```python cat_trend = "converging" if cat_mk == "decreasing" else ( "diverging" if cat_mk == "increasing" else "converged") ``` `cat_counts` is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like `[0, 0, 0, 5, 0, 0, 0]`. Because this sequence has no monotonic trend, Mann-Kendall returns `"no_trend"` → `"converged"`. The convergence gate in `_core.py` then sees `trend == "converged"` and **skips LLM extraction for future corrections in that category**, even though the category has never meaningfully converged — it simply has sparse history. Suggested fix: ```python if cat_mk == "decreasing": cat_trend = "converging" elif cat_mk == "increasing": cat_trend = "diverging" elif sum(cat_counts) < 3: # not enough non-zero data cat_trend = "insufficient_data" else: cat_trend = "converged" ``` The convergence gate already checks `cat_convergence.get("trend") == "converged"`, so adding the `"insufficient_data"` branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-06T20:44:19Z

+            if is_rule_disabled_for_domain(lesson, current_domain):
+                if bus:
+                    scores = lesson.domain_scores.get(current_domain, {})
+                    bus.emit("rule_scoped_out", {
+                        "lesson_category": lesson.category,
+                        "lesson_description": lesson.description[:80],
+                        "domain": current_domain,
+                        "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
+                    })


Unguarded bus.emit inside eligibility-filter loop — Rule 4 violation

The bus.emit("rule_scoped_out", ...) call is inside a for lesson in eligible loop with no try-except:

if is_rule_disabled_for_domain(lesson, current_domain): if bus: scores = lesson.domain_scores.get(current_domain, {}) bus.emit("rule_scoped_out", { ... })

Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the for loop terminates early — rules that should have been kept in eligible are silently dropped, producing a truncated rule set with no warning.

Suggested change

if is_rule_disabled_for_domain(lesson, current_domain):

if bus:

scores = lesson.domain_scores.get(current_domain, {})

bus.emit("rule_scoped_out", {

"lesson_category": lesson.category,

"lesson_description": lesson.description[:80],

"domain": current_domain,

"misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),

})

if is_rule_disabled_for_domain(lesson, current_domain):

if bus:

try:

scores = lesson.domain_scores.get(current_domain, {})

bus.emit("rule_scoped_out", {

"lesson_category": lesson.category,

"lesson_description": lesson.description[:80],

"domain": current_domain,

"misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),

})

except Exception:

_log.debug("rule_scoped_out emit failed")

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gradata/rules/rule_engine.py Line: 494-502 Comment: **Unguarded `bus.emit` inside eligibility-filter loop — Rule 4 violation** The `bus.emit("rule_scoped_out", ...)` call is inside a `for lesson in eligible` loop with no try-except: ```python if is_rule_disabled_for_domain(lesson, current_domain): if bus: scores = lesson.domain_scores.get(current_domain, {}) bus.emit("rule_scoped_out", { ... }) ``` Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the `for` loop terminates early — rules that should have been kept in `eligible` are silently dropped, producing a truncated rule set with no warning. ```suggestion if is_rule_disabled_for_domain(lesson, current_domain): if bus: try: scores = lesson.domain_scores.get(current_domain, {}) bus.emit("rule_scoped_out", { "lesson_category": lesson.category, "lesson_description": lesson.description[:80], "domain": current_domain, "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)), }) except Exception: _log.debug("rule_scoped_out emit failed") ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-06T20:44:20Z

+            correction_desc = ""
+            if 'desc' in locals():
+                correction_desc = desc


locals() inspection is fragile for desc-variable lookup

desc is assigned inside the broad try block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using 'desc' in locals() to probe this is an anti-pattern — easy to misread and could silently yield an empty correction_desc if the assignment was skipped part-way through the block.

A cleaner approach is to initialise a dedicated variable before the try block:

_extracted_desc: str = "" # initialise before the try block # ... inside the try, after desc is computed: _extracted_desc = desc # ... after the try block: if brain._fired_rules and (category or classifications): _attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary)

This makes the intent explicit and eliminates the locals() introspection.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gradata/_core.py Line: 317-319 Comment: **`locals()` inspection is fragile for `desc`-variable lookup** `desc` is assigned inside the broad `try` block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using `'desc' in locals()` to probe this is an anti-pattern — easy to misread and could silently yield an empty `correction_desc` if the assignment was skipped part-way through the block. A cleaner approach is to initialise a dedicated variable before the try block: ```python _extracted_desc: str = "" # initialise before the try block # ... inside the try, after desc is computed: _extracted_desc = desc # ... after the try block: if brain._fired_rules and (category or classifications): _attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary) ``` This makes the intent explicit and eliminates the `locals()` introspection. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-06T20:44:21Z

+"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
+from gradata._types import Lesson, LessonState


Missing from __future__ import annotations (Rule 6)

Both new test files added in this PR are missing the required first import. Per Rule 6, all .py files in the SDK should include from __future__ import annotations to enable str | None syntax consistently and maintain Pyright strict-mode compatibility.

This applies here (tests/test_rule_scoping.py) and also to tests/test_convergence_gate.py line 1.

Suggested change

"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""

from gradata._types import Lesson, LessonState

from __future__ import annotations

"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""

Rule Used: # Code Review Rules

Rule 1: Never use print() ... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/test_rule_scoping.py Line: 1-2 Comment: **Missing `from __future__ import annotations` (Rule 6)** Both new test files added in this PR are missing the required first import. Per Rule 6, all `.py` files in the SDK should include `from __future__ import annotations` to enable `str | None` syntax consistently and maintain Pyright strict-mode compatibility. This applies here (`tests/test_rule_scoping.py`) and also to `tests/test_convergence_gate.py` line 1. ```suggestion from __future__ import annotations """Tests for rule domain scoping — per-domain misfire tracking and auto-disable.""" ``` **Rule Used:** # Code Review Rules ## Rule 1: Never use print() ... ([source](https://app.greptile.com/review/custom-context?memory=dee613fe-ca52-4382-b9d7-fad6d0b079ec)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-06T20:44:22Z

+    }
+
+    if estimate_time:
+        corrections_avoided = max(0, (initial - recent) * len(counts))


corrections_avoided multiplier includes the baseline period, overestimating savings

corrections_avoided = max(0, (initial - recent) * len(counts))

This multiplies by the total session count, including the first three sessions used to establish initial. In a 20-session brain with initial = 10, recent = 4, the formula yields 6 × 20 = 120 — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there.

A more accurate (still approximate) estimate uses only post-baseline sessions:

Suggested change

corrections_avoided = max(0, (initial - recent) * len(counts))

corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3))

This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gradata/_core.py Line: 1031 Comment: **`corrections_avoided` multiplier includes the baseline period, overestimating savings** ```python corrections_avoided = max(0, (initial - recent) * len(counts)) ``` This multiplies by the **total** session count, including the first three sessions used to establish `initial`. In a 20-session brain with initial = 10, recent = 4, the formula yields `6 × 20 = 120` — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there. A more accurate (still approximate) estimate uses only post-baseline sessions: ```suggestion corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3)) ``` This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains. How can I resolve this? If you propose a fix, please make it concise.

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/gradata/_core.py (1)
1-7: 🧹 Nitpick | 🔵 Trivial

File exceeds 500-line limit (1040 lines).

As per coding guidelines, Python files should be kept under 500 lines. This file has grown to 1040 lines with the new additions. Consider splitting into focused modules:

_core_correct.py: brain_correct, _attribute_domain_fires

_core_session.py: brain_end_session

_core_convergence.py: brain_convergence, _mann_kendall, _normal_cdf, brain_efficiency

_core_export.py: export helpers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/_core.py` around lines 1 - 7, The file src/gradata/_core.py is
too large; split it into focused modules and update imports: extract heavy
correctness logic into a new _core_correct.py containing brain_correct and
_attribute_domain_fires, extract session logic into _core_session.py with
brain_end_session, extract convergence/statistics into _core_convergence.py with
brain_convergence, _mann_kendall, _normal_cdf and brain_efficiency, and move
export-related helpers into _core_export.py; in the original _core.py leave thin
delegator functions or import-and-reexport these symbols (e.g., from
._core_correct import brain_correct, _attribute_domain_fires) so external
callers and the Brain class continue to reference the same function names, and
update any local imports/relative references to these functions/classes to use
the new modules.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/gradata/_core.py`:
- Around line 962-972: The per-category trend mapping currently always maps a
"no_trend" result from _mann_kendall to "converged", which ignores sparse data;
update the logic in the by_category loop (where cat_counts, cat_mk, cat_p are
computed) to mirror the global logic: when cat_mk == "no_trend" return
"converged" only if len(cat_counts) >= 3, otherwise set "insufficient_data";
keep the existing mappings for "increasing" -> "diverging" and "decreasing" ->
"converging" and ensure by_category[cat]["trend"] uses this corrected cat_trend
determination.
- Around line 1030-1038: The current time-estimate overstates savings by
computing corrections_avoided = (initial - recent) * len(counts); instead,
compute avoided corrections per session (or only for sessions after an initial
baseline window) and sum those non-negative differences so you don’t assume
every session would have stayed at the initial rate. Update the block guarded by
estimate_time to derive corrections_avoided as sum(max(0, counts[i] - recent)
for each session i) or use a baseline mean of the first N sessions (e.g.,
baseline = mean(counts[:N]) then sum(max(0, counts[i] - recent) for i>=N)), then
reuse avg_severity_weight from _SEVERITY_SECONDS and set
result["estimated_seconds_saved"] and result["time_breakdown"] accordingly
(referencing variables initial, recent, counts, estimate_time, result,
_SEVERITY_SECONDS, avg_severity_weight).
- Around line 314-324: The code uses if 'desc' in locals() to detect a variable
set earlier; instead initialize desc explicitly before the classification
branches (e.g., set desc = "" near where cat/category/classifications are
prepared) so later code can rely on it, then replace the locals() check in the
attribution block with a simple use of desc (e.g., use correction_desc = desc or
correction_desc = summary fallback) when calling _attribute_domain_fires(brain,
category or "UNKNOWN", correction_desc); update references to desc across the
classification branches to assign into this initialized variable.

In `@src/gradata/brain.py`:
- Line 70: The _fired_rules list is never updated so domain attribution in
_attribute_domain_fires() never runs; update the rule application flow (e.g.,
inside apply_brain_rules() or the concrete function that executes individual
rules) to append a concise identifier for each rule when it successfully fires
(use the same identifier format tests expect, e.g., rule id/name/object used in
tests/test_rule_scoping.py), and ensure you append to self._fired_rules on the
Brain instance before/after any side-effectful action so attribution logic can
iterate over it; also ensure any existing test that directly sets _fired_rules
still passes by matching the same identifier convention.

In `@src/gradata/enhancements/self_improvement.py`:
- Around line 305-310: Move the in-loop import of json out to the module level
and update the parsing to use that module; specifically, add "import json" with
the other top-level imports, then in the block handling
meta_line.startswith("Domain scores:") replace "import json as _json" and uses
of "_json.loads" / "_json.JSONDecodeError" with "json.loads" and
"json.JSONDecodeError" so domain_scores parsing still falls back to {} on decode
errors (refer to the meta_line handling and the domain_scores variable).
- Around line 901-903: The block that formats domain scores imports json inside
the conditional (import json as _json) which is inefficient; remove the inline
import and use a module-level import (e.g., import json at top of
src/gradata/enhancements/self_improvement.py) and update the usage in the block
that appends domain scores (the lines.append(f"  Domain scores:
{_json.dumps(lesson.domain_scores)}") statement referencing
lesson.domain_scores) to call the module-level json.dumps (or the chosen alias)
instead.

In `@src/gradata/rules/rule_engine.py`:
- Around line 489-505: The production call to apply_rules via apply_brain_rules
is not passing the event bus, so rule_scoped_out events never fire; update
apply_brain_rules (the method that calls apply_rules(lessons, build_scope(ctx)))
to pass the instance bus (self.bus) as the bus argument when calling
apply_rules, ensuring the existing domain-scoping code in rule_engine.py (which
checks bus and emits "rule_scoped_out") can emit events; verify signature of
apply_rules matches (lessons, scope, bus=None) and propagate the change only in
apply_brain_rules to avoid breaking other callers.

In `@tests/test_convergence.py`:
- Around line 110-132: The test test_convergence_per_category creates CODE
corrections across three sessions but never asserts their per-session counts;
update the test to assert that result contains "by_category" and "CODE" (similar
to the TONE checks) and that
result["by_category"]["CODE"]["corrections_per_session"] equals the expected
list [2, 2, 1] (session 1:2, session 2:2, session 3:1), so the CODE category is
validated alongside TONE in Brain.convergence().

In `@tests/test_efficiency.py`:
- Around line 27-35: In test_efficiency_ratio_declining replace the direct float
equality checks for corrections_initial and corrections_recent with
pytest.approx comparisons: update the two asserts that compare
result["corrections_initial"] and result["corrections_recent"] (from the test
that constructs brain via _make_brain and uses _mock_convergence and
brain.efficiency()) to use pytest.approx(...) so they tolerate floating-point
rounding; ensure pytest is available/imported in the test module if not already.
- Around line 38-53: The assertions in test_efficiency_ratio_no_improvement and
test_efficiency_insufficient_data use direct float equality for
result["effort_ratio"]; change these to use pytest.approx for robust
floating-point comparison (e.g., assert result["effort_ratio"] ==
pytest.approx(1.0)), and ensure the test file imports pytest if not already;
keep the other integer assertions for corrections_initial and corrections_recent
unchanged.

---

Outside diff comments:
In `@src/gradata/_core.py`:
- Around line 1-7: The file src/gradata/_core.py is too large; split it into
focused modules and update imports: extract heavy correctness logic into a new
_core_correct.py containing brain_correct and _attribute_domain_fires, extract
session logic into _core_session.py with brain_end_session, extract
convergence/statistics into _core_convergence.py with brain_convergence,
_mann_kendall, _normal_cdf and brain_efficiency, and move export-related helpers
into _core_export.py; in the original _core.py leave thin delegator functions or
import-and-reexport these symbols (e.g., from ._core_correct import
brain_correct, _attribute_domain_fires) so external callers and the Brain class
continue to reference the same function names, and update any local
imports/relative references to these functions/classes to use the new modules.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ba4e2b1-7dc5-4ce5-995b-286aa05b305e

📥 Commits

Reviewing files that changed from the base of the PR and between 9f1b116 and 6c1b5b6.

📒 Files selected for processing (9)

src/gradata/_core.py
src/gradata/_types.py
src/gradata/brain.py
src/gradata/enhancements/self_improvement.py
src/gradata/rules/rule_engine.py
tests/test_convergence.py
tests/test_convergence_gate.py
tests/test_efficiency.py
tests/test_rule_scoping.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Greptile Review

🧰 Additional context used

📓 Path-based instructions (5)

**/*

📄 CodeRabbit inference engine (CLAUDE.md)

Always read a file before editing it. Always prefer editing over creating new files.

Files:

src/gradata/brain.py
src/gradata/enhancements/self_improvement.py
tests/test_rule_scoping.py
tests/test_convergence.py
tests/test_efficiency.py
src/gradata/rules/rule_engine.py
tests/test_convergence_gate.py
src/gradata/_types.py
src/gradata/_core.py

**/*.{py,sh,js,ts,json,env*}

📄 CodeRabbit inference engine (CLAUDE.md)

Never hardcode secrets. Never commit .env files. Pre-commit hook blocks both.

Files:

src/gradata/brain.py
src/gradata/enhancements/self_improvement.py
tests/test_rule_scoping.py
tests/test_convergence.py
tests/test_efficiency.py
src/gradata/rules/rule_engine.py
tests/test_convergence_gate.py
src/gradata/_types.py
src/gradata/_core.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Keep files under 500 lines. Validate input at system boundaries.

Files:

src/gradata/brain.py
src/gradata/enhancements/self_improvement.py
tests/test_rule_scoping.py
tests/test_convergence.py
tests/test_efficiency.py
src/gradata/rules/rule_engine.py
tests/test_convergence_gate.py
src/gradata/_types.py
src/gradata/_core.py

src/gradata/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use AGPL-3.0 licensing. Source code resides in src/gradata/.

Files:

src/gradata/brain.py
src/gradata/enhancements/self_improvement.py
src/gradata/rules/rule_engine.py
src/gradata/_types.py
src/gradata/_core.py

⚙️ CodeRabbit configuration file

src/gradata/**/*.py: This is the core SDK. Check for: type safety (from future import annotations required), no print()
statements (use logging), all functions accepting BrainContext where DB access occurs, no hardcoded paths. Severity
scoring must clamp to [0,1]. Confidence values must be in [0.0, 1.0].

Files:

src/gradata/brain.py
src/gradata/enhancements/self_improvement.py
src/gradata/rules/rule_engine.py
src/gradata/_types.py
src/gradata/_core.py

tests/**

⚙️ CodeRabbit configuration file

tests/**: Test files. Verify: no hardcoded paths, assertions check specific values not just truthiness,
parametrized tests preferred for boundary conditions, floating point comparisons use pytest.approx.

Files:

tests/test_rule_scoping.py
tests/test_convergence.py
tests/test_efficiency.py
tests/test_convergence_gate.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: CR
Repo: Gradata/gradata PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T15:53:08.573Z
Learning: Correction pipeline: correct → extract behavioral instruction → graduate (INSTINCT→PATTERN→RULE) → inject into next session. Meta-rules emerge when patterns cluster across domains.

🔇 Additional comments (13)

src/gradata/_types.py (1)

125-125: LGTM!

The new domain_scores field is well-typed with proper default factory initialization. The nested dict[str, dict[str, int]] structure appropriately models per-domain fire/misfire tracking.

src/gradata/brain.py (2)

344-351: LGTM — convergence caching logic is correct.

The session-based caching correctly compares self._convergence_session == self.session before returning cached results, ensuring fresh data when the session changes.

353-360: LGTM!

The efficiency() method properly delegates to brain_efficiency with keyword-only estimate_time parameter.

src/gradata/rules/rule_engine.py (1)

419-430: LGTM — domain disabling logic is correct.

The 30% misfire threshold with minimum 3 fires requirement provides reasonable statistical significance before disabling a rule for a domain.

tests/test_convergence_gate.py (2)

6-29: LGTM!

Good test coverage for the convergence gate behavior. The mock setup correctly simulates a converged category and verifies extraction is skipped.

58-70: LGTM — cache invalidation test is thorough.

The test correctly verifies both cache hit (same session) and cache miss (session change) scenarios by manipulating _convergence_session.

tests/test_rule_scoping.py (3)

5-27: LGTM — field existence and round-trip tests are well-structured.

Good coverage of the new domain_scores field initialization and serialization round-trip.

113-132: LGTM — event emission test is thorough.

Good test verifying the rule_scoped_out event is emitted with correct payload when rules are filtered by domain.

135-151: The test will pass. The actual function signature is _attribute_domain_fires(brain, correction_category: str, correction_desc: str), and the body correctly uses these parameter names. The test properly passes "DRAFTING" and "some correction" which map to these parameters in order, and the domain's fires counter will correctly increment from 5 to 6 as expected.

src/gradata/_core.py (4)

43-67: LGTM — Domain fire attribution logic is sound.

The defensive hasattr check and initialization of missing domain_scores entries is appropriate. The integration with _classify_correction_direction returning "CONTRADICTING" correctly increments misfires only when the correction contradicts the rule.

182-206: LGTM — Convergence gate implementation is well-structured.

The gating logic correctly skips expensive LLM extraction for converged categories, with appropriate fallback to primary.description. The isinstance guard prevents redundant InstructionCache instantiation.

832-880: LGTM — Mann-Kendall implementation is statistically correct.

The tie correction, continuity correction, and normal approximation follow the standard Mann-Kendall algorithm. Edge cases (n < 3, var_s == 0) are handled appropriately.

883-892: LGTM — Standard normal CDF approximation is correct.

The Abramowitz & Stegun polynomial approximation provides sufficient precision for the Mann-Kendall p-value calculations.

coderabbitai · 2026-04-06T20:44:47Z

+    # Domain-scoped misfire attribution
+    try:
+        if brain._fired_rules and (category or classifications):
+            correction_desc = ""
+            if 'desc' in locals():
+                correction_desc = desc
+            elif summary:
+                correction_desc = summary
+            _attribute_domain_fires(brain, category or "UNKNOWN", correction_desc)
+    except Exception as e:
+        _log.debug("Domain fire attribution failed: %s", e)


🛠️ Refactor suggestion | 🟠 Major

Avoid if 'desc' in locals() — initialize desc explicitly.

Using locals() lookup to check variable existence is fragile and obscure. The variable desc is conditionally assigned across multiple branches earlier in this function (lines 189, 202, 205, 207, 209), making this check difficult to verify.

♻️ Proposed fix: Initialize `desc` before the conditional blocks

Add initialization near line 178 (before the classification branches):

cat = (category or "UNKNOWN").upper() desc = "" # Will be set by classification or summary if classifications: ...

Then simplify the attribution block:

# Domain-scoped misfire attribution try: if brain._fired_rules and (category or classifications): - correction_desc = "" - if 'desc' in locals(): - correction_desc = desc - elif summary: - correction_desc = summary + correction_desc = desc or summary _attribute_domain_fires(brain, category or "UNKNOWN", correction_desc) except Exception as e: _log.debug("Domain fire attribution failed: %s", e)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gradata/_core.py` around lines 314 - 324, The code uses if 'desc' in locals() to detect a variable set earlier; instead initialize desc explicitly before the classification branches (e.g., set desc = "" near where cat/category/classifications are prepared) so later code can rely on it, then replace the locals() check in the attribution block with a simple use of desc (e.g., use correction_desc = desc or correction_desc = summary fallback) when calling _attribute_domain_fires(brain, category or "UNKNOWN", correction_desc); update references to desc across the classification branches to assign into this initialized variable.

coderabbitai · 2026-04-06T20:44:47Z

+    by_category: dict[str, dict] = {}
+    for cat, session_counts in cat_by_session.items():
+        cat_counts = [session_counts.get(s, 0) for s in sessions]
+        cat_mk, cat_p = _mann_kendall(cat_counts)
+        cat_trend = "converging" if cat_mk == "decreasing" else (
+            "diverging" if cat_mk == "increasing" else "converged")
+        by_category[cat] = {
+            "corrections_per_session": cat_counts,
+            "trend": cat_trend,
+            "p_value": cat_p,
+        }


⚠️ Potential issue | 🟡 Minor

Per-category trend logic lacks insufficient_data handling.

The overall trend logic (lines 941-948) maps no_trend to converged only when len(counts) >= 3, otherwise to insufficient_data. However, the per-category logic unconditionally maps no_trend to converged, which could prematurely gate extraction for categories with sparse data.

🐛 Proposed fix: Apply consistent length check

for cat, session_counts in cat_by_session.items(): cat_counts = [session_counts.get(s, 0) for s in sessions] cat_mk, cat_p = _mann_kendall(cat_counts) - cat_trend = "converging" if cat_mk == "decreasing" else ( - "diverging" if cat_mk == "increasing" else "converged") + if cat_mk == "decreasing": + cat_trend = "converging" + elif cat_mk == "increasing": + cat_trend = "diverging" + elif sum(1 for c in cat_counts if c > 0) >= 3: + cat_trend = "converged" + else: + cat_trend = "insufficient_data" by_category[cat] = { "corrections_per_session": cat_counts, "trend": cat_trend, "p_value": cat_p, }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gradata/_core.py` around lines 962 - 972, The per-category trend mapping currently always maps a "no_trend" result from _mann_kendall to "converged", which ignores sparse data; update the logic in the by_category loop (where cat_counts, cat_mk, cat_p are computed) to mirror the global logic: when cat_mk == "no_trend" return "converged" only if len(cat_counts) >= 3, otherwise set "insufficient_data"; keep the existing mappings for "increasing" -> "diverging" and "decreasing" -> "converging" and ensure by_category[cat]["trend"] uses this corrected cat_trend determination.

coderabbitai · 2026-04-06T20:44:47Z

+    if estimate_time:
+        corrections_avoided = max(0, (initial - recent) * len(counts))
+        avg_severity_weight = _SEVERITY_SECONDS.get("moderate", 45)
+        estimated_seconds = int(corrections_avoided * avg_severity_weight)
+        result["estimated_seconds_saved"] = estimated_seconds
+        result["time_breakdown"] = {
+            "corrections_avoided": round(corrections_avoided, 1),
+            "avg_seconds_per_correction": avg_severity_weight,
+        }


🧹 Nitpick | 🔵 Trivial

Time estimation formula may significantly overestimate savings.

The calculation corrections_avoided = (initial - recent) * len(counts) assumes every session would have had the initial correction rate without learning, which overestimates savings. For 10 sessions with initial=10 and recent=5, this yields 50 avoided corrections, but actual savings would be lower since sessions 1-3 contributed to the "initial" baseline.

Consider a more conservative estimate, e.g., using sessions after the initial window:

♻️ More accurate estimation

if estimate_time: - corrections_avoided = max(0, (initial - recent) * len(counts)) + # Estimate based on sessions after the initial baseline window + sessions_with_learning = max(0, len(counts) - 3) + corrections_avoided = max(0, (initial - recent) * sessions_with_learning) avg_severity_weight = _SEVERITY_SECONDS.get("moderate", 45) estimated_seconds = int(corrections_avoided * avg_severity_weight)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gradata/_core.py` around lines 1030 - 1038, The current time-estimate overstates savings by computing corrections_avoided = (initial - recent) * len(counts); instead, compute avoided corrections per session (or only for sessions after an initial baseline window) and sum those non-negative differences so you don’t assume every session would have stayed at the initial rate. Update the block guarded by estimate_time to derive corrections_avoided as sum(max(0, counts[i] - recent) for each session i) or use a baseline mean of the first N sessions (e.g., baseline = mean(counts[:N]) then sum(max(0, counts[i] - recent) for i>=N)), then reuse avg_severity_weight from _SEVERITY_SECONDS and set result["estimated_seconds_saved"] and result["time_breakdown"] accordingly (referencing variables initial, recent, counts, estimate_time, result, _SEVERITY_SECONDS, avg_severity_weight).

coderabbitai · 2026-04-06T20:44:47Z

                open_encrypted_db(self.dir, self._encryption_key)

        self._instruction_cache: object | None = None  # lazy: InstructionCache
+        self._fired_rules: list = []  # Rules injected this session (for misfire attribution)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for any code that appends to or assigns _fired_rules rg -n '_fired_rules' --type=py -C3

Repository: Gradata/gradata

Length of output: 1927

_fired_rules is never populated — domain attribution will silently fail.

The _fired_rules list is initialized on line 70 but never populated in the application code. The grep search confirms it only appears in:

src/gradata/brain.py:70 — initialization as empty list

src/gradata/_core.py:56 — reading it in _attribute_domain_fires()

tests/test_rule_scoping.py:147 — manually set in test code only

Since nothing appends to this list during normal execution, the loop at src/gradata/_core.py:56 never executes and domain fire/misfire counts remain permanently at their initial values. The rule injection flow must populate _fired_rules when rules are applied, likely in apply_brain_rules() or its callchain.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gradata/brain.py` at line 70, The _fired_rules list is never updated so domain attribution in _attribute_domain_fires() never runs; update the rule application flow (e.g., inside apply_brain_rules() or the concrete function that executes individual rules) to append a concise identifier for each rule when it successfully fires (use the same identifier format tests expect, e.g., rule id/name/object used in tests/test_rule_scoping.py), and ensure you append to self._fired_rules on the Brain instance before/after any side-effectful action so attribution logic can iterate over it; also ensure any existing test that directly sets _fired_rules still passes by matching the same identifier convention.

coderabbitai · 2026-04-06T20:44:47Z

+            elif meta_line.startswith("Domain scores:"):
+                import json as _json
+                try:
+                    domain_scores = _json.loads(meta_line[len("Domain scores:"):].strip())
+                except _json.JSONDecodeError:
+                    domain_scores = {}


🧹 Nitpick | 🔵 Trivial

Move json import to module level.

The json module is imported inside the parsing loop. While functional, this creates overhead on each lesson with domain scores. Consider moving the import to the module level alongside the existing imports.

♻️ Proposed refactor

Add at the top of the file with other imports:

import json

Then simplify the parsing:

elif meta_line.startswith("Domain scores:"): - import json as _json try: - domain_scores = _json.loads(meta_line[len("Domain scores:"):].strip()) - except _json.JSONDecodeError: + domain_scores = json.loads(meta_line[len("Domain scores:"):].strip()) + except json.JSONDecodeError: domain_scores = {}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gradata/enhancements/self_improvement.py` around lines 305 - 310, Move the in-loop import of json out to the module level and update the parsing to use that module; specifically, add "import json" with the other top-level imports, then in the block handling meta_line.startswith("Domain scores:") replace "import json as _json" and uses of "_json.loads" / "_json.JSONDecodeError" with "json.loads" and "json.JSONDecodeError" so domain_scores parsing still falls back to {} on decode errors (refer to the meta_line handling and the domain_scores variable).

coderabbitai · 2026-04-06T20:44:47Z

+        if lesson.domain_scores:
+            import json as _json
+            lines.append(f"  Domain scores: {_json.dumps(lesson.domain_scores)}")


🧹 Nitpick | 🔵 Trivial

Same import inefficiency in formatting.

Similar to the parsing code, json is imported inside the loop. If moved to module level as suggested above, update this block to use the module-level import.

♻️ Proposed refactor

if lesson.domain_scores: - import json as _json - lines.append(f" Domain scores: {_json.dumps(lesson.domain_scores)}") + lines.append(f" Domain scores: {json.dumps(lesson.domain_scores)}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gradata/enhancements/self_improvement.py` around lines 901 - 903, The block that formats domain scores imports json inside the conditional (import json as _json) which is inefficient; remove the inline import and use a module-level import (e.g., import json at top of src/gradata/enhancements/self_improvement.py) and update the usage in the block that appends domain scores (the lines.append(f" Domain scores: {_json.dumps(lesson.domain_scores)}") statement referencing lesson.domain_scores) to call the module-level json.dumps (or the chosen alias) instead.

coderabbitai · 2026-04-06T20:44:47Z

+    # Step 1.5 — domain scoping: filter out rules disabled for current domain
+    current_domain = scope.domain.upper() if scope.domain else ""
+    if current_domain:
+        filtered = []
+        for lesson in eligible:
+            if is_rule_disabled_for_domain(lesson, current_domain):
+                if bus:
+                    scores = lesson.domain_scores.get(current_domain, {})
+                    bus.emit("rule_scoped_out", {
+                        "lesson_category": lesson.category,
+                        "lesson_description": lesson.description[:80],
+                        "domain": current_domain,
+                        "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
+                    })
+            else:
+                filtered.append(lesson)
+        eligible = filtered


⚠️ Potential issue | 🟠 Major

rule_scoped_out events won't fire in production — bus is not passed.

The domain scoping filter and event emission logic is correct, but according to the context snippet from src/gradata/brain.py:383-406, the production call site apply_brain_rules() calls apply_rules(lessons, build_scope(ctx)) without passing the bus parameter. This means rule_scoped_out events will never be emitted in normal usage.

Consider updating apply_brain_rules() to pass self.bus if observability of rule scoping is desired.

🔧 Suggested fix in brain.py

In src/gradata/brain.py around line 406:

- return format_rules_for_prompt(apply_rules(lessons, build_scope(ctx))) + return format_rules_for_prompt(apply_rules(lessons, build_scope(ctx), bus=self.bus))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gradata/rules/rule_engine.py` around lines 489 - 505, The production call to apply_rules via apply_brain_rules is not passing the event bus, so rule_scoped_out events never fire; update apply_brain_rules (the method that calls apply_rules(lessons, build_scope(ctx))) to pass the instance bus (self.bus) as the bus argument when calling apply_rules, ensuring the existing domain-scoping code in rule_engine.py (which checks bus and emits "rule_scoped_out") can emit events; verify signature of apply_rules matches (lessons, scope, bus=None) and propagate the change only in apply_brain_rules to avoid breaking other callers.

coderabbitai · 2026-04-06T20:44:47Z

+def test_convergence_per_category():
+    """Should return per-category breakdown."""
+    d = tempfile.mkdtemp()
+    (Path(d) / "lessons.md").write_text("", encoding="utf-8")
+    brain = Brain(d)
+    # Session 1: 3 TONE, 2 CODE corrections
+    for _ in range(3):
+        brain.emit("CORRECTION", "test", {"category": "TONE", "severity": "minor"}, ["category:TONE"], 1)
+    for _ in range(2):
+        brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 1)
+    # Session 2: 1 TONE, 2 CODE
+    brain.emit("CORRECTION", "test", {"category": "TONE", "severity": "minor"}, ["category:TONE"], 2)
+    for _ in range(2):
+        brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 2)
+    # Session 3: 0 TONE, 1 CODE
+    brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 3)
+
+    result = brain.convergence()
+    assert "by_category" in result
+    assert "TONE" in result["by_category"]
+    assert "CODE" in result["by_category"]
+    # TONE went 3→1→0 = converging
+    assert result["by_category"]["TONE"]["corrections_per_session"] == [3, 1, 0]


⚠️ Potential issue | 🟡 Minor

Test is incomplete — missing assertion for CODE category.

The test sets up both TONE and CODE corrections across sessions but only asserts on TONE's corrections_per_session. Consider adding assertions for CODE to ensure both categories are tracked correctly.

💚 Proposed fix to complete the test

# TONE went 3→1→0 = converging assert result["by_category"]["TONE"]["corrections_per_session"] == [3, 1, 0] + # CODE went 2→2→1 + assert result["by_category"]["CODE"]["corrections_per_session"] == [2, 2, 1]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_convergence.py` around lines 110 - 132, The test test_convergence_per_category creates CODE corrections across three sessions but never asserts their per-session counts; update the test to assert that result contains "by_category" and "CODE" (similar to the TONE checks) and that result["by_category"]["CODE"]["corrections_per_session"] equals the expected list [2, 2, 1] (session 1:2, session 2:2, session 3:1), so the CODE category is validated alongside TONE in Brain.convergence().

coderabbitai · 2026-04-06T20:44:47Z

+def test_efficiency_ratio_declining(tmp_path):
+    brain = _make_brain(tmp_path)
+    conv = _mock_convergence([10, 12, 8, 5, 4, 3])
+    with patch.object(brain, "_get_convergence", return_value=conv):
+        result = brain.efficiency()
+    assert "effort_ratio" in result
+    assert result["effort_ratio"] < 1.0
+    assert result["corrections_initial"] == 10.0  # avg(10,12,8)
+    assert result["corrections_recent"] == 4.0    # avg(5,4,3)


⚠️ Potential issue | 🟡 Minor

Use pytest.approx for floating-point comparisons.

Lines 34-35 compare floats using ==, which can be fragile. While these specific values (10.0, 4.0) are exact, using pytest.approx is the recommended practice for float comparisons per coding guidelines.

💚 Proposed fix

+import pytest + def test_efficiency_ratio_declining(tmp_path): brain = _make_brain(tmp_path) conv = _mock_convergence([10, 12, 8, 5, 4, 3]) with patch.object(brain, "_get_convergence", return_value=conv): result = brain.efficiency() assert "effort_ratio" in result assert result["effort_ratio"] < 1.0 - assert result["corrections_initial"] == 10.0 # avg(10,12,8) - assert result["corrections_recent"] == 4.0 # avg(5,4,3) + assert result["corrections_initial"] == pytest.approx(10.0) # avg(10,12,8) + assert result["corrections_recent"] == pytest.approx(4.0) # avg(5,4,3)

As per coding guidelines: "floating point comparisons use pytest.approx"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_efficiency_ratio_declining(tmp_path):

brain = _make_brain(tmp_path)

conv = _mock_convergence([10, 12, 8, 5, 4, 3])

with patch.object(brain, "_get_convergence", return_value=conv):

result = brain.efficiency()

assert "effort_ratio" in result

assert result["effort_ratio"] < 1.0

assert result["corrections_initial"] == 10.0 # avg(10,12,8)

assert result["corrections_recent"] == 4.0 # avg(5,4,3)

import pytest

def test_efficiency_ratio_declining(tmp_path):

brain = _make_brain(tmp_path)

conv = _mock_convergence([10, 12, 8, 5, 4, 3])

with patch.object(brain, "_get_convergence", return_value=conv):

result = brain.efficiency()

assert "effort_ratio" in result

assert result["effort_ratio"] < 1.0

assert result["corrections_initial"] == pytest.approx(10.0) # avg(10,12,8)

assert result["corrections_recent"] == pytest.approx(4.0) # avg(5,4,3)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_efficiency.py` around lines 27 - 35, In test_efficiency_ratio_declining replace the direct float equality checks for corrections_initial and corrections_recent with pytest.approx comparisons: update the two asserts that compare result["corrections_initial"] and result["corrections_recent"] (from the test that constructs brain via _make_brain and uses _mock_convergence and brain.efficiency()) to use pytest.approx(...) so they tolerate floating-point rounding; ensure pytest is available/imported in the test module if not already.

coderabbitai · 2026-04-06T20:44:48Z

+def test_efficiency_ratio_no_improvement(tmp_path):
+    brain = _make_brain(tmp_path)
+    conv = _mock_convergence([5, 5, 5, 5, 5, 5])
+    with patch.object(brain, "_get_convergence", return_value=conv):
+        result = brain.efficiency()
+    assert result["effort_ratio"] == 1.0
+
+
+def test_efficiency_insufficient_data(tmp_path):
+    brain = _make_brain(tmp_path)
+    conv = _mock_convergence([5, 3])
+    with patch.object(brain, "_get_convergence", return_value=conv):
+        result = brain.efficiency()
+    assert result["effort_ratio"] == 1.0
+    assert result["corrections_initial"] == 0
+    assert result["corrections_recent"] == 0


⚠️ Potential issue | 🟡 Minor

Same floating-point comparison issue.

Lines 43 and 51 also use == for float comparisons.

💚 Proposed fix

def test_efficiency_ratio_no_improvement(tmp_path): brain = _make_brain(tmp_path) conv = _mock_convergence([5, 5, 5, 5, 5, 5]) with patch.object(brain, "_get_convergence", return_value=conv): result = brain.efficiency() - assert result["effort_ratio"] == 1.0 + assert result["effort_ratio"] == pytest.approx(1.0) def test_efficiency_insufficient_data(tmp_path): brain = _make_brain(tmp_path) conv = _mock_convergence([5, 3]) with patch.object(brain, "_get_convergence", return_value=conv): result = brain.efficiency() - assert result["effort_ratio"] == 1.0 + assert result["effort_ratio"] == pytest.approx(1.0)

As per coding guidelines: "floating point comparisons use pytest.approx"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_efficiency_ratio_no_improvement(tmp_path):

brain = _make_brain(tmp_path)

conv = _mock_convergence([5, 5, 5, 5, 5, 5])

with patch.object(brain, "_get_convergence", return_value=conv):

result = brain.efficiency()

assert result["effort_ratio"] == 1.0

def test_efficiency_insufficient_data(tmp_path):

brain = _make_brain(tmp_path)

conv = _mock_convergence([5, 3])

with patch.object(brain, "_get_convergence", return_value=conv):

result = brain.efficiency()

assert result["effort_ratio"] == 1.0

assert result["corrections_initial"] == 0

assert result["corrections_recent"] == 0

def test_efficiency_ratio_no_improvement(tmp_path):

brain = _make_brain(tmp_path)

conv = _mock_convergence([5, 5, 5, 5, 5, 5])

with patch.object(brain, "_get_convergence", return_value=conv):

result = brain.efficiency()

assert result["effort_ratio"] == pytest.approx(1.0)

def test_efficiency_insufficient_data(tmp_path):

brain = _make_brain(tmp_path)

conv = _mock_convergence([5, 3])

with patch.object(brain, "_get_convergence", return_value=conv):

result = brain.efficiency()

assert result["effort_ratio"] == pytest.approx(1.0)

assert result["corrections_initial"] == 0

assert result["corrections_recent"] == 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_efficiency.py` around lines 38 - 53, The assertions in test_efficiency_ratio_no_improvement and test_efficiency_insufficient_data use direct float equality for result["effort_ratio"]; change these to use pytest.approx for robust floating-point comparison (e.g., assert result["effort_ratio"] == pytest.approx(1.0)), and ensure the test file imports pytest if not already; keep the other integer assertions for corrections_initial and corrections_recent unchanged.

Oliver Le and others added 6 commits April 6, 2026 12:59

feat: add domain_scores field to Lesson with parse/format round-trip

7282869

Co-Authored-By: Gradata <noreply@gradata.ai>

feat: auto-disable rules by domain when misfire rate exceeds 30%

7e9d7ce

Co-Authored-By: Gradata <noreply@gradata.ai>

feat: attribute per-domain misfires and emit rule_scoped_out events

83181eb

Co-Authored-By: Gradata <noreply@gradata.ai>

feat: gate behavioral extraction on per-category convergence

ce5fb16

Co-Authored-By: Gradata <noreply@gradata.ai>

feat: add brain.efficiency() — effort-saved metric with optional time…

6c1b5b6

… estimates Co-Authored-By: Gradata <noreply@gradata.ai>

coderabbitai Bot added feature breaking-change labels Apr 6, 2026

greptile-apps Bot reviewed Apr 6, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 6, 2026

View reviewed changes

Gradata merged commit d079dd7 into main Apr 6, 2026
6 checks passed

This was referenced Apr 7, 2026

feat: Sim 9 analytics — CUSUM, Beta scoring, kill threshold fix, dynamic similarity #16

Merged

feat: SDK P0 hardening + security — 14 features, 1600+ tests #17

Merged

Gradata deleted the feat/sim9-improvements branch April 8, 2026 20:21

coderabbitai Bot mentioned this pull request Apr 22, 2026

feat(rules): slot-grouped brain injection + graduation-time classifier #138

Merged

3 tasks

coderabbitai Bot mentioned this pull request May 4, 2026

fix(rule_engine): restore ranking inside tree retrieval (was flat-rating to 1.0) #169

Merged

		"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
		from gradata._types import Lesson, LessonState

	corrections_avoided = max(0, (initial - recent) * len(counts))
	corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3))

Conversation

Gradata commented Apr 6, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Source

Test plan

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Rule 1: Never use print() ... (source)

Uh oh!

coderabbitai Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 6, 2026

Choose a reason for hiding this comment

Rule 1: Never use print() ... (source)

Uh oh!

greptile-apps Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Gradata commented Apr 6, 2026 •

edited by greptile-apps Bot

Loading

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading