feat: Sim 9 engine hardening — rule scoping, convergence gate, efficiency metric#15
feat: Sim 9 engine hardening — rule scoping, convergence gate, efficiency metric#15
Conversation
…gory tracking Replaces naive first-half/second-half comparison with Mann-Kendall statistical trend test (pure Python, no scipy). Now returns: - p_value for trend significance - by_category breakdown (each category's own convergence curve) - Proper handling of ties and noisy data Driven by MiroFish Sim 9 consensus: 200 personas unanimously recommended Mann-Kendall over the half-split approach. Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
… estimates Co-Authored-By: Gradata <noreply@gradata.ai>
📝 WalkthroughWalkthroughThe changes implement convergence-aware behavioral instruction gating, statistical Mann–Kendall trend testing, domain-scoped rule disabling via misfire rate thresholds, and a new efficiency metric API to quantify effort savings across learning sessions. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Brain
participant RuleEngine as Rule Engine
participant EventBus as Event Bus
participant Core as _core.py
Client->>Brain: correct(correction, category)
activate Brain
Brain->>Brain: _get_convergence()
activate Brain
alt Cached & same session
Brain-->>Brain: return cached result
else Cache miss or session changed
Brain->>Core: brain_convergence(self)
activate Core
Core->>Core: _mann_kendall() for trend
Core-->>Brain: {trend, p_value, by_category}
deactivate Core
Brain->>Brain: store in _convergence_cache
end
deactivate Brain
alt Category trend is "converged"
Brain->>Brain: skip extract_behavioral_instruction()
Brain->>Brain: use primary.description
else Trend is "diverging" or "converging"
Brain->>Brain: call extract_behavioral_instruction()
end
Brain->>RuleEngine: apply_rules(..., domain)
activate RuleEngine
RuleEngine->>RuleEngine: filter by is_rule_disabled_for_domain()
alt Rule disabled for domain
RuleEngine->>EventBus: emit "rule_scoped_out"
RuleEngine-->>EventBus: {domain, category, misfire_rate}
end
RuleEngine-->>Brain: eligible_lessons
deactivate RuleEngine
Brain->>Core: _attribute_domain_fires()
activate Core
Core->>Core: increment fires/misfires in domain_scores
deactivate Core
Brain-->>Client: lesson created/updated
deactivate Brain
sequenceDiagram
participant Client
participant Brain
participant Core as _core.py
Client->>Brain: efficiency(estimate_time=False)
activate Brain
Brain->>Brain: _get_convergence()
activate Brain
Brain->>Core: brain_convergence(self)
activate Core
Core-->>Brain: {by_category, ...}
deactivate Core
deactivate Brain
Brain->>Core: brain_efficiency(self)
activate Core
Core->>Core: extract first/last 3 sessions
Core->>Core: compute effort_ratio
alt estimate_time is True
Core->>Core: multiply ratio by _SEVERITY_SECONDS
Core->>Core: compute estimated_seconds_saved
end
Core-->>Brain: {effort_ratio, estimated_seconds_saved?, ...}
deactivate Core
Brain-->>Client: efficiency metrics dict
deactivate Brain
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
| by_category: dict[str, dict] = {} | ||
| for cat, session_counts in cat_by_session.items(): | ||
| cat_counts = [session_counts.get(s, 0) for s in sessions] | ||
| cat_mk, cat_p = _mann_kendall(cat_counts) | ||
| cat_trend = "converging" if cat_mk == "decreasing" else ( | ||
| "diverging" if cat_mk == "increasing" else "converged") |
There was a problem hiding this comment.
Per-category convergence has no
insufficient_data guard
At the top-level, brain_convergence correctly emits "insufficient_data" when len(counts) < 3. But the per-category branch maps every "no_trend" result straight to "converged":
cat_trend = "converging" if cat_mk == "decreasing" else (
"diverging" if cat_mk == "increasing" else "converged")cat_counts is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like [0, 0, 0, 5, 0, 0, 0]. Because this sequence has no monotonic trend, Mann-Kendall returns "no_trend" → "converged". The convergence gate in _core.py then sees trend == "converged" and skips LLM extraction for future corrections in that category, even though the category has never meaningfully converged — it simply has sparse history.
Suggested fix:
if cat_mk == "decreasing":
cat_trend = "converging"
elif cat_mk == "increasing":
cat_trend = "diverging"
elif sum(cat_counts) < 3: # not enough non-zero data
cat_trend = "insufficient_data"
else:
cat_trend = "converged"The convergence gate already checks cat_convergence.get("trend") == "converged", so adding the "insufficient_data" branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 962-967
Comment:
**Per-category convergence has no `insufficient_data` guard**
At the top-level, `brain_convergence` correctly emits `"insufficient_data"` when `len(counts) < 3`. But the per-category branch maps every `"no_trend"` result straight to `"converged"`:
```python
cat_trend = "converging" if cat_mk == "decreasing" else (
"diverging" if cat_mk == "increasing" else "converged")
```
`cat_counts` is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like `[0, 0, 0, 5, 0, 0, 0]`. Because this sequence has no monotonic trend, Mann-Kendall returns `"no_trend"` → `"converged"`. The convergence gate in `_core.py` then sees `trend == "converged"` and **skips LLM extraction for future corrections in that category**, even though the category has never meaningfully converged — it simply has sparse history.
Suggested fix:
```python
if cat_mk == "decreasing":
cat_trend = "converging"
elif cat_mk == "increasing":
cat_trend = "diverging"
elif sum(cat_counts) < 3: # not enough non-zero data
cat_trend = "insufficient_data"
else:
cat_trend = "converged"
```
The convergence gate already checks `cat_convergence.get("trend") == "converged"`, so adding the `"insufficient_data"` branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction.
How can I resolve this? If you propose a fix, please make it concise.| if is_rule_disabled_for_domain(lesson, current_domain): | ||
| if bus: | ||
| scores = lesson.domain_scores.get(current_domain, {}) | ||
| bus.emit("rule_scoped_out", { | ||
| "lesson_category": lesson.category, | ||
| "lesson_description": lesson.description[:80], | ||
| "domain": current_domain, | ||
| "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)), | ||
| }) |
There was a problem hiding this comment.
Unguarded
bus.emit inside eligibility-filter loop — Rule 4 violation
The bus.emit("rule_scoped_out", ...) call is inside a for lesson in eligible loop with no try-except:
if is_rule_disabled_for_domain(lesson, current_domain):
if bus:
scores = lesson.domain_scores.get(current_domain, {})
bus.emit("rule_scoped_out", { ... })Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the for loop terminates early — rules that should have been kept in eligible are silently dropped, producing a truncated rule set with no warning.
| if is_rule_disabled_for_domain(lesson, current_domain): | |
| if bus: | |
| scores = lesson.domain_scores.get(current_domain, {}) | |
| bus.emit("rule_scoped_out", { | |
| "lesson_category": lesson.category, | |
| "lesson_description": lesson.description[:80], | |
| "domain": current_domain, | |
| "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)), | |
| }) | |
| if is_rule_disabled_for_domain(lesson, current_domain): | |
| if bus: | |
| try: | |
| scores = lesson.domain_scores.get(current_domain, {}) | |
| bus.emit("rule_scoped_out", { | |
| "lesson_category": lesson.category, | |
| "lesson_description": lesson.description[:80], | |
| "domain": current_domain, | |
| "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)), | |
| }) | |
| except Exception: | |
| _log.debug("rule_scoped_out emit failed") |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/rules/rule_engine.py
Line: 494-502
Comment:
**Unguarded `bus.emit` inside eligibility-filter loop — Rule 4 violation**
The `bus.emit("rule_scoped_out", ...)` call is inside a `for lesson in eligible` loop with no try-except:
```python
if is_rule_disabled_for_domain(lesson, current_domain):
if bus:
scores = lesson.domain_scores.get(current_domain, {})
bus.emit("rule_scoped_out", { ... })
```
Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the `for` loop terminates early — rules that should have been kept in `eligible` are silently dropped, producing a truncated rule set with no warning.
```suggestion
if is_rule_disabled_for_domain(lesson, current_domain):
if bus:
try:
scores = lesson.domain_scores.get(current_domain, {})
bus.emit("rule_scoped_out", {
"lesson_category": lesson.category,
"lesson_description": lesson.description[:80],
"domain": current_domain,
"misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
})
except Exception:
_log.debug("rule_scoped_out emit failed")
```
How can I resolve this? If you propose a fix, please make it concise.| correction_desc = "" | ||
| if 'desc' in locals(): | ||
| correction_desc = desc |
There was a problem hiding this comment.
locals() inspection is fragile for desc-variable lookup
desc is assigned inside the broad try block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using 'desc' in locals() to probe this is an anti-pattern — easy to misread and could silently yield an empty correction_desc if the assignment was skipped part-way through the block.
A cleaner approach is to initialise a dedicated variable before the try block:
_extracted_desc: str = "" # initialise before the try block
# ... inside the try, after desc is computed:
_extracted_desc = desc
# ... after the try block:
if brain._fired_rules and (category or classifications):
_attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary)This makes the intent explicit and eliminates the locals() introspection.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 317-319
Comment:
**`locals()` inspection is fragile for `desc`-variable lookup**
`desc` is assigned inside the broad `try` block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using `'desc' in locals()` to probe this is an anti-pattern — easy to misread and could silently yield an empty `correction_desc` if the assignment was skipped part-way through the block.
A cleaner approach is to initialise a dedicated variable before the try block:
```python
_extracted_desc: str = "" # initialise before the try block
# ... inside the try, after desc is computed:
_extracted_desc = desc
# ... after the try block:
if brain._fired_rules and (category or classifications):
_attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary)
```
This makes the intent explicit and eliminates the `locals()` introspection.
How can I resolve this? If you propose a fix, please make it concise.| """Tests for rule domain scoping — per-domain misfire tracking and auto-disable.""" | ||
| from gradata._types import Lesson, LessonState |
There was a problem hiding this comment.
Missing
from __future__ import annotations (Rule 6)
Both new test files added in this PR are missing the required first import. Per Rule 6, all .py files in the SDK should include from __future__ import annotations to enable str | None syntax consistently and maintain Pyright strict-mode compatibility.
This applies here (tests/test_rule_scoping.py) and also to tests/test_convergence_gate.py line 1.
| """Tests for rule domain scoping — per-domain misfire tracking and auto-disable.""" | |
| from gradata._types import Lesson, LessonState | |
| from __future__ import annotations | |
| """Tests for rule domain scoping — per-domain misfire tracking and auto-disable.""" |
Rule Used: # Code Review Rules
Rule 1: Never use print() ... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/test_rule_scoping.py
Line: 1-2
Comment:
**Missing `from __future__ import annotations` (Rule 6)**
Both new test files added in this PR are missing the required first import. Per Rule 6, all `.py` files in the SDK should include `from __future__ import annotations` to enable `str | None` syntax consistently and maintain Pyright strict-mode compatibility.
This applies here (`tests/test_rule_scoping.py`) and also to `tests/test_convergence_gate.py` line 1.
```suggestion
from __future__ import annotations
"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
```
**Rule Used:** # Code Review Rules
## Rule 1: Never use print() ... ([source](https://app.greptile.com/review/custom-context?memory=dee613fe-ca52-4382-b9d7-fad6d0b079ec))
How can I resolve this? If you propose a fix, please make it concise.| } | ||
|
|
||
| if estimate_time: | ||
| corrections_avoided = max(0, (initial - recent) * len(counts)) |
There was a problem hiding this comment.
corrections_avoided multiplier includes the baseline period, overestimating savings
corrections_avoided = max(0, (initial - recent) * len(counts))This multiplies by the total session count, including the first three sessions used to establish initial. In a 20-session brain with initial = 10, recent = 4, the formula yields 6 × 20 = 120 — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there.
A more accurate (still approximate) estimate uses only post-baseline sessions:
| corrections_avoided = max(0, (initial - recent) * len(counts)) | |
| corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3)) |
This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 1031
Comment:
**`corrections_avoided` multiplier includes the baseline period, overestimating savings**
```python
corrections_avoided = max(0, (initial - recent) * len(counts))
```
This multiplies by the **total** session count, including the first three sessions used to establish `initial`. In a 20-session brain with initial = 10, recent = 4, the formula yields `6 × 20 = 120` — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there.
A more accurate (still approximate) estimate uses only post-baseline sessions:
```suggestion
corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3))
```
This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/gradata/_core.py (1)
1-7: 🧹 Nitpick | 🔵 TrivialFile exceeds 500-line limit (1040 lines).
As per coding guidelines, Python files should be kept under 500 lines. This file has grown to 1040 lines with the new additions. Consider splitting into focused modules:
_core_correct.py:brain_correct,_attribute_domain_fires_core_session.py:brain_end_session_core_convergence.py:brain_convergence,_mann_kendall,_normal_cdf,brain_efficiency_core_export.py: export helpers🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/gradata/_core.py` around lines 1 - 7, The file src/gradata/_core.py is too large; split it into focused modules and update imports: extract heavy correctness logic into a new _core_correct.py containing brain_correct and _attribute_domain_fires, extract session logic into _core_session.py with brain_end_session, extract convergence/statistics into _core_convergence.py with brain_convergence, _mann_kendall, _normal_cdf and brain_efficiency, and move export-related helpers into _core_export.py; in the original _core.py leave thin delegator functions or import-and-reexport these symbols (e.g., from ._core_correct import brain_correct, _attribute_domain_fires) so external callers and the Brain class continue to reference the same function names, and update any local imports/relative references to these functions/classes to use the new modules.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/gradata/_core.py`:
- Around line 962-972: The per-category trend mapping currently always maps a
"no_trend" result from _mann_kendall to "converged", which ignores sparse data;
update the logic in the by_category loop (where cat_counts, cat_mk, cat_p are
computed) to mirror the global logic: when cat_mk == "no_trend" return
"converged" only if len(cat_counts) >= 3, otherwise set "insufficient_data";
keep the existing mappings for "increasing" -> "diverging" and "decreasing" ->
"converging" and ensure by_category[cat]["trend"] uses this corrected cat_trend
determination.
- Around line 1030-1038: The current time-estimate overstates savings by
computing corrections_avoided = (initial - recent) * len(counts); instead,
compute avoided corrections per session (or only for sessions after an initial
baseline window) and sum those non-negative differences so you don’t assume
every session would have stayed at the initial rate. Update the block guarded by
estimate_time to derive corrections_avoided as sum(max(0, counts[i] - recent)
for each session i) or use a baseline mean of the first N sessions (e.g.,
baseline = mean(counts[:N]) then sum(max(0, counts[i] - recent) for i>=N)), then
reuse avg_severity_weight from _SEVERITY_SECONDS and set
result["estimated_seconds_saved"] and result["time_breakdown"] accordingly
(referencing variables initial, recent, counts, estimate_time, result,
_SEVERITY_SECONDS, avg_severity_weight).
- Around line 314-324: The code uses if 'desc' in locals() to detect a variable
set earlier; instead initialize desc explicitly before the classification
branches (e.g., set desc = "" near where cat/category/classifications are
prepared) so later code can rely on it, then replace the locals() check in the
attribution block with a simple use of desc (e.g., use correction_desc = desc or
correction_desc = summary fallback) when calling _attribute_domain_fires(brain,
category or "UNKNOWN", correction_desc); update references to desc across the
classification branches to assign into this initialized variable.
In `@src/gradata/brain.py`:
- Line 70: The _fired_rules list is never updated so domain attribution in
_attribute_domain_fires() never runs; update the rule application flow (e.g.,
inside apply_brain_rules() or the concrete function that executes individual
rules) to append a concise identifier for each rule when it successfully fires
(use the same identifier format tests expect, e.g., rule id/name/object used in
tests/test_rule_scoping.py), and ensure you append to self._fired_rules on the
Brain instance before/after any side-effectful action so attribution logic can
iterate over it; also ensure any existing test that directly sets _fired_rules
still passes by matching the same identifier convention.
In `@src/gradata/enhancements/self_improvement.py`:
- Around line 305-310: Move the in-loop import of json out to the module level
and update the parsing to use that module; specifically, add "import json" with
the other top-level imports, then in the block handling
meta_line.startswith("Domain scores:") replace "import json as _json" and uses
of "_json.loads" / "_json.JSONDecodeError" with "json.loads" and
"json.JSONDecodeError" so domain_scores parsing still falls back to {} on decode
errors (refer to the meta_line handling and the domain_scores variable).
- Around line 901-903: The block that formats domain scores imports json inside
the conditional (import json as _json) which is inefficient; remove the inline
import and use a module-level import (e.g., import json at top of
src/gradata/enhancements/self_improvement.py) and update the usage in the block
that appends domain scores (the lines.append(f" Domain scores:
{_json.dumps(lesson.domain_scores)}") statement referencing
lesson.domain_scores) to call the module-level json.dumps (or the chosen alias)
instead.
In `@src/gradata/rules/rule_engine.py`:
- Around line 489-505: The production call to apply_rules via apply_brain_rules
is not passing the event bus, so rule_scoped_out events never fire; update
apply_brain_rules (the method that calls apply_rules(lessons, build_scope(ctx)))
to pass the instance bus (self.bus) as the bus argument when calling
apply_rules, ensuring the existing domain-scoping code in rule_engine.py (which
checks bus and emits "rule_scoped_out") can emit events; verify signature of
apply_rules matches (lessons, scope, bus=None) and propagate the change only in
apply_brain_rules to avoid breaking other callers.
In `@tests/test_convergence.py`:
- Around line 110-132: The test test_convergence_per_category creates CODE
corrections across three sessions but never asserts their per-session counts;
update the test to assert that result contains "by_category" and "CODE" (similar
to the TONE checks) and that
result["by_category"]["CODE"]["corrections_per_session"] equals the expected
list [2, 2, 1] (session 1:2, session 2:2, session 3:1), so the CODE category is
validated alongside TONE in Brain.convergence().
In `@tests/test_efficiency.py`:
- Around line 27-35: In test_efficiency_ratio_declining replace the direct float
equality checks for corrections_initial and corrections_recent with
pytest.approx comparisons: update the two asserts that compare
result["corrections_initial"] and result["corrections_recent"] (from the test
that constructs brain via _make_brain and uses _mock_convergence and
brain.efficiency()) to use pytest.approx(...) so they tolerate floating-point
rounding; ensure pytest is available/imported in the test module if not already.
- Around line 38-53: The assertions in test_efficiency_ratio_no_improvement and
test_efficiency_insufficient_data use direct float equality for
result["effort_ratio"]; change these to use pytest.approx for robust
floating-point comparison (e.g., assert result["effort_ratio"] ==
pytest.approx(1.0)), and ensure the test file imports pytest if not already;
keep the other integer assertions for corrections_initial and corrections_recent
unchanged.
---
Outside diff comments:
In `@src/gradata/_core.py`:
- Around line 1-7: The file src/gradata/_core.py is too large; split it into
focused modules and update imports: extract heavy correctness logic into a new
_core_correct.py containing brain_correct and _attribute_domain_fires, extract
session logic into _core_session.py with brain_end_session, extract
convergence/statistics into _core_convergence.py with brain_convergence,
_mann_kendall, _normal_cdf and brain_efficiency, and move export-related helpers
into _core_export.py; in the original _core.py leave thin delegator functions or
import-and-reexport these symbols (e.g., from ._core_correct import
brain_correct, _attribute_domain_fires) so external callers and the Brain class
continue to reference the same function names, and update any local
imports/relative references to these functions/classes to use the new modules.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6ba4e2b1-7dc5-4ce5-995b-286aa05b305e
📒 Files selected for processing (9)
src/gradata/_core.pysrc/gradata/_types.pysrc/gradata/brain.pysrc/gradata/enhancements/self_improvement.pysrc/gradata/rules/rule_engine.pytests/test_convergence.pytests/test_convergence_gate.pytests/test_efficiency.pytests/test_rule_scoping.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*
📄 CodeRabbit inference engine (CLAUDE.md)
Always read a file before editing it. Always prefer editing over creating new files.
Files:
src/gradata/brain.pysrc/gradata/enhancements/self_improvement.pytests/test_rule_scoping.pytests/test_convergence.pytests/test_efficiency.pysrc/gradata/rules/rule_engine.pytests/test_convergence_gate.pysrc/gradata/_types.pysrc/gradata/_core.py
**/*.{py,sh,js,ts,json,env*}
📄 CodeRabbit inference engine (CLAUDE.md)
Never hardcode secrets. Never commit .env files. Pre-commit hook blocks both.
Files:
src/gradata/brain.pysrc/gradata/enhancements/self_improvement.pytests/test_rule_scoping.pytests/test_convergence.pytests/test_efficiency.pysrc/gradata/rules/rule_engine.pytests/test_convergence_gate.pysrc/gradata/_types.pysrc/gradata/_core.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Keep files under 500 lines. Validate input at system boundaries.
Files:
src/gradata/brain.pysrc/gradata/enhancements/self_improvement.pytests/test_rule_scoping.pytests/test_convergence.pytests/test_efficiency.pysrc/gradata/rules/rule_engine.pytests/test_convergence_gate.pysrc/gradata/_types.pysrc/gradata/_core.py
src/gradata/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use AGPL-3.0 licensing. Source code resides in src/gradata/.
Files:
src/gradata/brain.pysrc/gradata/enhancements/self_improvement.pysrc/gradata/rules/rule_engine.pysrc/gradata/_types.pysrc/gradata/_core.py
⚙️ CodeRabbit configuration file
src/gradata/**/*.py: This is the core SDK. Check for: type safety (from future import annotations required), no print()
statements (use logging), all functions accepting BrainContext where DB access occurs, no hardcoded paths. Severity
scoring must clamp to [0,1]. Confidence values must be in [0.0, 1.0].
Files:
src/gradata/brain.pysrc/gradata/enhancements/self_improvement.pysrc/gradata/rules/rule_engine.pysrc/gradata/_types.pysrc/gradata/_core.py
tests/**
⚙️ CodeRabbit configuration file
tests/**: Test files. Verify: no hardcoded paths, assertions check specific values not just truthiness,
parametrized tests preferred for boundary conditions, floating point comparisons use pytest.approx.
Files:
tests/test_rule_scoping.pytests/test_convergence.pytests/test_efficiency.pytests/test_convergence_gate.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: Gradata/gradata PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T15:53:08.573Z
Learning: Correction pipeline: correct → extract behavioral instruction → graduate (INSTINCT→PATTERN→RULE) → inject into next session. Meta-rules emerge when patterns cluster across domains.
🔇 Additional comments (13)
src/gradata/_types.py (1)
125-125: LGTM!The new
domain_scoresfield is well-typed with proper default factory initialization. The nesteddict[str, dict[str, int]]structure appropriately models per-domain fire/misfire tracking.src/gradata/brain.py (2)
344-351: LGTM — convergence caching logic is correct.The session-based caching correctly compares
self._convergence_session == self.sessionbefore returning cached results, ensuring fresh data when the session changes.
353-360: LGTM!The
efficiency()method properly delegates tobrain_efficiencywith keyword-onlyestimate_timeparameter.src/gradata/rules/rule_engine.py (1)
419-430: LGTM — domain disabling logic is correct.The 30% misfire threshold with minimum 3 fires requirement provides reasonable statistical significance before disabling a rule for a domain.
tests/test_convergence_gate.py (2)
6-29: LGTM!Good test coverage for the convergence gate behavior. The mock setup correctly simulates a converged category and verifies extraction is skipped.
58-70: LGTM — cache invalidation test is thorough.The test correctly verifies both cache hit (same session) and cache miss (session change) scenarios by manipulating
_convergence_session.tests/test_rule_scoping.py (3)
5-27: LGTM — field existence and round-trip tests are well-structured.Good coverage of the new
domain_scoresfield initialization and serialization round-trip.
113-132: LGTM — event emission test is thorough.Good test verifying the
rule_scoped_outevent is emitted with correct payload when rules are filtered by domain.
135-151: The test will pass. The actual function signature is_attribute_domain_fires(brain, correction_category: str, correction_desc: str), and the body correctly uses these parameter names. The test properly passes "DRAFTING" and "some correction" which map to these parameters in order, and the domain's fires counter will correctly increment from 5 to 6 as expected.src/gradata/_core.py (4)
43-67: LGTM — Domain fire attribution logic is sound.The defensive
hasattrcheck and initialization of missing domain_scores entries is appropriate. The integration with_classify_correction_directionreturning"CONTRADICTING"correctly increments misfires only when the correction contradicts the rule.
182-206: LGTM — Convergence gate implementation is well-structured.The gating logic correctly skips expensive LLM extraction for converged categories, with appropriate fallback to
primary.description. Theisinstanceguard prevents redundantInstructionCacheinstantiation.
832-880: LGTM — Mann-Kendall implementation is statistically correct.The tie correction, continuity correction, and normal approximation follow the standard Mann-Kendall algorithm. Edge cases (n < 3, var_s == 0) are handled appropriately.
883-892: LGTM — Standard normal CDF approximation is correct.The Abramowitz & Stegun polynomial approximation provides sufficient precision for the Mann-Kendall p-value calculations.
| # Domain-scoped misfire attribution | ||
| try: | ||
| if brain._fired_rules and (category or classifications): | ||
| correction_desc = "" | ||
| if 'desc' in locals(): | ||
| correction_desc = desc | ||
| elif summary: | ||
| correction_desc = summary | ||
| _attribute_domain_fires(brain, category or "UNKNOWN", correction_desc) | ||
| except Exception as e: | ||
| _log.debug("Domain fire attribution failed: %s", e) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Avoid if 'desc' in locals() — initialize desc explicitly.
Using locals() lookup to check variable existence is fragile and obscure. The variable desc is conditionally assigned across multiple branches earlier in this function (lines 189, 202, 205, 207, 209), making this check difficult to verify.
♻️ Proposed fix: Initialize `desc` before the conditional blocks
Add initialization near line 178 (before the classification branches):
cat = (category or "UNKNOWN").upper()
desc = "" # Will be set by classification or summary
if classifications:
...Then simplify the attribution block:
# Domain-scoped misfire attribution
try:
if brain._fired_rules and (category or classifications):
- correction_desc = ""
- if 'desc' in locals():
- correction_desc = desc
- elif summary:
- correction_desc = summary
+ correction_desc = desc or summary
_attribute_domain_fires(brain, category or "UNKNOWN", correction_desc)
except Exception as e:
_log.debug("Domain fire attribution failed: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gradata/_core.py` around lines 314 - 324, The code uses if 'desc' in
locals() to detect a variable set earlier; instead initialize desc explicitly
before the classification branches (e.g., set desc = "" near where
cat/category/classifications are prepared) so later code can rely on it, then
replace the locals() check in the attribution block with a simple use of desc
(e.g., use correction_desc = desc or correction_desc = summary fallback) when
calling _attribute_domain_fires(brain, category or "UNKNOWN", correction_desc);
update references to desc across the classification branches to assign into this
initialized variable.
| by_category: dict[str, dict] = {} | ||
| for cat, session_counts in cat_by_session.items(): | ||
| cat_counts = [session_counts.get(s, 0) for s in sessions] | ||
| cat_mk, cat_p = _mann_kendall(cat_counts) | ||
| cat_trend = "converging" if cat_mk == "decreasing" else ( | ||
| "diverging" if cat_mk == "increasing" else "converged") | ||
| by_category[cat] = { | ||
| "corrections_per_session": cat_counts, | ||
| "trend": cat_trend, | ||
| "p_value": cat_p, | ||
| } |
There was a problem hiding this comment.
Per-category trend logic lacks insufficient_data handling.
The overall trend logic (lines 941-948) maps no_trend to converged only when len(counts) >= 3, otherwise to insufficient_data. However, the per-category logic unconditionally maps no_trend to converged, which could prematurely gate extraction for categories with sparse data.
🐛 Proposed fix: Apply consistent length check
for cat, session_counts in cat_by_session.items():
cat_counts = [session_counts.get(s, 0) for s in sessions]
cat_mk, cat_p = _mann_kendall(cat_counts)
- cat_trend = "converging" if cat_mk == "decreasing" else (
- "diverging" if cat_mk == "increasing" else "converged")
+ if cat_mk == "decreasing":
+ cat_trend = "converging"
+ elif cat_mk == "increasing":
+ cat_trend = "diverging"
+ elif sum(1 for c in cat_counts if c > 0) >= 3:
+ cat_trend = "converged"
+ else:
+ cat_trend = "insufficient_data"
by_category[cat] = {
"corrections_per_session": cat_counts,
"trend": cat_trend,
"p_value": cat_p,
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gradata/_core.py` around lines 962 - 972, The per-category trend mapping
currently always maps a "no_trend" result from _mann_kendall to "converged",
which ignores sparse data; update the logic in the by_category loop (where
cat_counts, cat_mk, cat_p are computed) to mirror the global logic: when cat_mk
== "no_trend" return "converged" only if len(cat_counts) >= 3, otherwise set
"insufficient_data"; keep the existing mappings for "increasing" -> "diverging"
and "decreasing" -> "converging" and ensure by_category[cat]["trend"] uses this
corrected cat_trend determination.
| if estimate_time: | ||
| corrections_avoided = max(0, (initial - recent) * len(counts)) | ||
| avg_severity_weight = _SEVERITY_SECONDS.get("moderate", 45) | ||
| estimated_seconds = int(corrections_avoided * avg_severity_weight) | ||
| result["estimated_seconds_saved"] = estimated_seconds | ||
| result["time_breakdown"] = { | ||
| "corrections_avoided": round(corrections_avoided, 1), | ||
| "avg_seconds_per_correction": avg_severity_weight, | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Time estimation formula may significantly overestimate savings.
The calculation corrections_avoided = (initial - recent) * len(counts) assumes every session would have had the initial correction rate without learning, which overestimates savings. For 10 sessions with initial=10 and recent=5, this yields 50 avoided corrections, but actual savings would be lower since sessions 1-3 contributed to the "initial" baseline.
Consider a more conservative estimate, e.g., using sessions after the initial window:
♻️ More accurate estimation
if estimate_time:
- corrections_avoided = max(0, (initial - recent) * len(counts))
+ # Estimate based on sessions after the initial baseline window
+ sessions_with_learning = max(0, len(counts) - 3)
+ corrections_avoided = max(0, (initial - recent) * sessions_with_learning)
avg_severity_weight = _SEVERITY_SECONDS.get("moderate", 45)
estimated_seconds = int(corrections_avoided * avg_severity_weight)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gradata/_core.py` around lines 1030 - 1038, The current time-estimate
overstates savings by computing corrections_avoided = (initial - recent) *
len(counts); instead, compute avoided corrections per session (or only for
sessions after an initial baseline window) and sum those non-negative
differences so you don’t assume every session would have stayed at the initial
rate. Update the block guarded by estimate_time to derive corrections_avoided as
sum(max(0, counts[i] - recent) for each session i) or use a baseline mean of the
first N sessions (e.g., baseline = mean(counts[:N]) then sum(max(0, counts[i] -
recent) for i>=N)), then reuse avg_severity_weight from _SEVERITY_SECONDS and
set result["estimated_seconds_saved"] and result["time_breakdown"] accordingly
(referencing variables initial, recent, counts, estimate_time, result,
_SEVERITY_SECONDS, avg_severity_weight).
| open_encrypted_db(self.dir, self._encryption_key) | ||
|
|
||
| self._instruction_cache: object | None = None # lazy: InstructionCache | ||
| self._fired_rules: list = [] # Rules injected this session (for misfire attribution) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for any code that appends to or assigns _fired_rules
rg -n '_fired_rules' --type=py -C3Repository: Gradata/gradata
Length of output: 1927
_fired_rules is never populated — domain attribution will silently fail.
The _fired_rules list is initialized on line 70 but never populated in the application code. The grep search confirms it only appears in:
src/gradata/brain.py:70— initialization as empty listsrc/gradata/_core.py:56— reading it in_attribute_domain_fires()tests/test_rule_scoping.py:147— manually set in test code only
Since nothing appends to this list during normal execution, the loop at src/gradata/_core.py:56 never executes and domain fire/misfire counts remain permanently at their initial values. The rule injection flow must populate _fired_rules when rules are applied, likely in apply_brain_rules() or its callchain.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gradata/brain.py` at line 70, The _fired_rules list is never updated so
domain attribution in _attribute_domain_fires() never runs; update the rule
application flow (e.g., inside apply_brain_rules() or the concrete function that
executes individual rules) to append a concise identifier for each rule when it
successfully fires (use the same identifier format tests expect, e.g., rule
id/name/object used in tests/test_rule_scoping.py), and ensure you append to
self._fired_rules on the Brain instance before/after any side-effectful action
so attribution logic can iterate over it; also ensure any existing test that
directly sets _fired_rules still passes by matching the same identifier
convention.
| elif meta_line.startswith("Domain scores:"): | ||
| import json as _json | ||
| try: | ||
| domain_scores = _json.loads(meta_line[len("Domain scores:"):].strip()) | ||
| except _json.JSONDecodeError: | ||
| domain_scores = {} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Move json import to module level.
The json module is imported inside the parsing loop. While functional, this creates overhead on each lesson with domain scores. Consider moving the import to the module level alongside the existing imports.
♻️ Proposed refactor
Add at the top of the file with other imports:
import jsonThen simplify the parsing:
elif meta_line.startswith("Domain scores:"):
- import json as _json
try:
- domain_scores = _json.loads(meta_line[len("Domain scores:"):].strip())
- except _json.JSONDecodeError:
+ domain_scores = json.loads(meta_line[len("Domain scores:"):].strip())
+ except json.JSONDecodeError:
domain_scores = {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gradata/enhancements/self_improvement.py` around lines 305 - 310, Move
the in-loop import of json out to the module level and update the parsing to use
that module; specifically, add "import json" with the other top-level imports,
then in the block handling meta_line.startswith("Domain scores:") replace
"import json as _json" and uses of "_json.loads" / "_json.JSONDecodeError" with
"json.loads" and "json.JSONDecodeError" so domain_scores parsing still falls
back to {} on decode errors (refer to the meta_line handling and the
domain_scores variable).
| if lesson.domain_scores: | ||
| import json as _json | ||
| lines.append(f" Domain scores: {_json.dumps(lesson.domain_scores)}") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Same import inefficiency in formatting.
Similar to the parsing code, json is imported inside the loop. If moved to module level as suggested above, update this block to use the module-level import.
♻️ Proposed refactor
if lesson.domain_scores:
- import json as _json
- lines.append(f" Domain scores: {_json.dumps(lesson.domain_scores)}")
+ lines.append(f" Domain scores: {json.dumps(lesson.domain_scores)}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gradata/enhancements/self_improvement.py` around lines 901 - 903, The
block that formats domain scores imports json inside the conditional (import
json as _json) which is inefficient; remove the inline import and use a
module-level import (e.g., import json at top of
src/gradata/enhancements/self_improvement.py) and update the usage in the block
that appends domain scores (the lines.append(f" Domain scores:
{_json.dumps(lesson.domain_scores)}") statement referencing
lesson.domain_scores) to call the module-level json.dumps (or the chosen alias)
instead.
| # Step 1.5 — domain scoping: filter out rules disabled for current domain | ||
| current_domain = scope.domain.upper() if scope.domain else "" | ||
| if current_domain: | ||
| filtered = [] | ||
| for lesson in eligible: | ||
| if is_rule_disabled_for_domain(lesson, current_domain): | ||
| if bus: | ||
| scores = lesson.domain_scores.get(current_domain, {}) | ||
| bus.emit("rule_scoped_out", { | ||
| "lesson_category": lesson.category, | ||
| "lesson_description": lesson.description[:80], | ||
| "domain": current_domain, | ||
| "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)), | ||
| }) | ||
| else: | ||
| filtered.append(lesson) | ||
| eligible = filtered |
There was a problem hiding this comment.
rule_scoped_out events won't fire in production — bus is not passed.
The domain scoping filter and event emission logic is correct, but according to the context snippet from src/gradata/brain.py:383-406, the production call site apply_brain_rules() calls apply_rules(lessons, build_scope(ctx)) without passing the bus parameter. This means rule_scoped_out events will never be emitted in normal usage.
Consider updating apply_brain_rules() to pass self.bus if observability of rule scoping is desired.
🔧 Suggested fix in brain.py
In src/gradata/brain.py around line 406:
- return format_rules_for_prompt(apply_rules(lessons, build_scope(ctx)))
+ return format_rules_for_prompt(apply_rules(lessons, build_scope(ctx), bus=self.bus))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gradata/rules/rule_engine.py` around lines 489 - 505, The production call
to apply_rules via apply_brain_rules is not passing the event bus, so
rule_scoped_out events never fire; update apply_brain_rules (the method that
calls apply_rules(lessons, build_scope(ctx))) to pass the instance bus
(self.bus) as the bus argument when calling apply_rules, ensuring the existing
domain-scoping code in rule_engine.py (which checks bus and emits
"rule_scoped_out") can emit events; verify signature of apply_rules matches
(lessons, scope, bus=None) and propagate the change only in apply_brain_rules to
avoid breaking other callers.
| def test_convergence_per_category(): | ||
| """Should return per-category breakdown.""" | ||
| d = tempfile.mkdtemp() | ||
| (Path(d) / "lessons.md").write_text("", encoding="utf-8") | ||
| brain = Brain(d) | ||
| # Session 1: 3 TONE, 2 CODE corrections | ||
| for _ in range(3): | ||
| brain.emit("CORRECTION", "test", {"category": "TONE", "severity": "minor"}, ["category:TONE"], 1) | ||
| for _ in range(2): | ||
| brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 1) | ||
| # Session 2: 1 TONE, 2 CODE | ||
| brain.emit("CORRECTION", "test", {"category": "TONE", "severity": "minor"}, ["category:TONE"], 2) | ||
| for _ in range(2): | ||
| brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 2) | ||
| # Session 3: 0 TONE, 1 CODE | ||
| brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 3) | ||
|
|
||
| result = brain.convergence() | ||
| assert "by_category" in result | ||
| assert "TONE" in result["by_category"] | ||
| assert "CODE" in result["by_category"] | ||
| # TONE went 3→1→0 = converging | ||
| assert result["by_category"]["TONE"]["corrections_per_session"] == [3, 1, 0] |
There was a problem hiding this comment.
Test is incomplete — missing assertion for CODE category.
The test sets up both TONE and CODE corrections across sessions but only asserts on TONE's corrections_per_session. Consider adding assertions for CODE to ensure both categories are tracked correctly.
💚 Proposed fix to complete the test
# TONE went 3→1→0 = converging
assert result["by_category"]["TONE"]["corrections_per_session"] == [3, 1, 0]
+ # CODE went 2→2→1
+ assert result["by_category"]["CODE"]["corrections_per_session"] == [2, 2, 1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_convergence.py` around lines 110 - 132, The test
test_convergence_per_category creates CODE corrections across three sessions but
never asserts their per-session counts; update the test to assert that result
contains "by_category" and "CODE" (similar to the TONE checks) and that
result["by_category"]["CODE"]["corrections_per_session"] equals the expected
list [2, 2, 1] (session 1:2, session 2:2, session 3:1), so the CODE category is
validated alongside TONE in Brain.convergence().
| def test_efficiency_ratio_declining(tmp_path): | ||
| brain = _make_brain(tmp_path) | ||
| conv = _mock_convergence([10, 12, 8, 5, 4, 3]) | ||
| with patch.object(brain, "_get_convergence", return_value=conv): | ||
| result = brain.efficiency() | ||
| assert "effort_ratio" in result | ||
| assert result["effort_ratio"] < 1.0 | ||
| assert result["corrections_initial"] == 10.0 # avg(10,12,8) | ||
| assert result["corrections_recent"] == 4.0 # avg(5,4,3) |
There was a problem hiding this comment.
Use pytest.approx for floating-point comparisons.
Lines 34-35 compare floats using ==, which can be fragile. While these specific values (10.0, 4.0) are exact, using pytest.approx is the recommended practice for float comparisons per coding guidelines.
💚 Proposed fix
+import pytest
+
def test_efficiency_ratio_declining(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([10, 12, 8, 5, 4, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert "effort_ratio" in result
assert result["effort_ratio"] < 1.0
- assert result["corrections_initial"] == 10.0 # avg(10,12,8)
- assert result["corrections_recent"] == 4.0 # avg(5,4,3)
+ assert result["corrections_initial"] == pytest.approx(10.0) # avg(10,12,8)
+ assert result["corrections_recent"] == pytest.approx(4.0) # avg(5,4,3)As per coding guidelines: "floating point comparisons use pytest.approx"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_efficiency_ratio_declining(tmp_path): | |
| brain = _make_brain(tmp_path) | |
| conv = _mock_convergence([10, 12, 8, 5, 4, 3]) | |
| with patch.object(brain, "_get_convergence", return_value=conv): | |
| result = brain.efficiency() | |
| assert "effort_ratio" in result | |
| assert result["effort_ratio"] < 1.0 | |
| assert result["corrections_initial"] == 10.0 # avg(10,12,8) | |
| assert result["corrections_recent"] == 4.0 # avg(5,4,3) | |
| import pytest | |
| def test_efficiency_ratio_declining(tmp_path): | |
| brain = _make_brain(tmp_path) | |
| conv = _mock_convergence([10, 12, 8, 5, 4, 3]) | |
| with patch.object(brain, "_get_convergence", return_value=conv): | |
| result = brain.efficiency() | |
| assert "effort_ratio" in result | |
| assert result["effort_ratio"] < 1.0 | |
| assert result["corrections_initial"] == pytest.approx(10.0) # avg(10,12,8) | |
| assert result["corrections_recent"] == pytest.approx(4.0) # avg(5,4,3) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_efficiency.py` around lines 27 - 35, In
test_efficiency_ratio_declining replace the direct float equality checks for
corrections_initial and corrections_recent with pytest.approx comparisons:
update the two asserts that compare result["corrections_initial"] and
result["corrections_recent"] (from the test that constructs brain via
_make_brain and uses _mock_convergence and brain.efficiency()) to use
pytest.approx(...) so they tolerate floating-point rounding; ensure pytest is
available/imported in the test module if not already.
| def test_efficiency_ratio_no_improvement(tmp_path): | ||
| brain = _make_brain(tmp_path) | ||
| conv = _mock_convergence([5, 5, 5, 5, 5, 5]) | ||
| with patch.object(brain, "_get_convergence", return_value=conv): | ||
| result = brain.efficiency() | ||
| assert result["effort_ratio"] == 1.0 | ||
|
|
||
|
|
||
| def test_efficiency_insufficient_data(tmp_path): | ||
| brain = _make_brain(tmp_path) | ||
| conv = _mock_convergence([5, 3]) | ||
| with patch.object(brain, "_get_convergence", return_value=conv): | ||
| result = brain.efficiency() | ||
| assert result["effort_ratio"] == 1.0 | ||
| assert result["corrections_initial"] == 0 | ||
| assert result["corrections_recent"] == 0 |
There was a problem hiding this comment.
Same floating-point comparison issue.
Lines 43 and 51 also use == for float comparisons.
💚 Proposed fix
def test_efficiency_ratio_no_improvement(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 5, 5, 5, 5, 5])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
- assert result["effort_ratio"] == 1.0
+ assert result["effort_ratio"] == pytest.approx(1.0)
def test_efficiency_insufficient_data(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
- assert result["effort_ratio"] == 1.0
+ assert result["effort_ratio"] == pytest.approx(1.0)As per coding guidelines: "floating point comparisons use pytest.approx"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_efficiency_ratio_no_improvement(tmp_path): | |
| brain = _make_brain(tmp_path) | |
| conv = _mock_convergence([5, 5, 5, 5, 5, 5]) | |
| with patch.object(brain, "_get_convergence", return_value=conv): | |
| result = brain.efficiency() | |
| assert result["effort_ratio"] == 1.0 | |
| def test_efficiency_insufficient_data(tmp_path): | |
| brain = _make_brain(tmp_path) | |
| conv = _mock_convergence([5, 3]) | |
| with patch.object(brain, "_get_convergence", return_value=conv): | |
| result = brain.efficiency() | |
| assert result["effort_ratio"] == 1.0 | |
| assert result["corrections_initial"] == 0 | |
| assert result["corrections_recent"] == 0 | |
| def test_efficiency_ratio_no_improvement(tmp_path): | |
| brain = _make_brain(tmp_path) | |
| conv = _mock_convergence([5, 5, 5, 5, 5, 5]) | |
| with patch.object(brain, "_get_convergence", return_value=conv): | |
| result = brain.efficiency() | |
| assert result["effort_ratio"] == pytest.approx(1.0) | |
| def test_efficiency_insufficient_data(tmp_path): | |
| brain = _make_brain(tmp_path) | |
| conv = _mock_convergence([5, 3]) | |
| with patch.object(brain, "_get_convergence", return_value=conv): | |
| result = brain.efficiency() | |
| assert result["effort_ratio"] == pytest.approx(1.0) | |
| assert result["corrections_initial"] == 0 | |
| assert result["corrections_recent"] == 0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_efficiency.py` around lines 38 - 53, The assertions in
test_efficiency_ratio_no_improvement and test_efficiency_insufficient_data use
direct float equality for result["effort_ratio"]; change these to use
pytest.approx for robust floating-point comparison (e.g., assert
result["effort_ratio"] == pytest.approx(1.0)), and ensure the test file imports
pytest if not already; keep the other integer assertions for corrections_initial
and corrections_recent unchanged.
Summary
rule_scoped_outemitted when a rule is suppressed by domainSource
Sim 9 consensus (2000 posts, 200 agents) + Kenoodl synthesis. See
docs/superpowers/specs/2026-04-06-sim9-engine-hardening-design.md.Test plan
pytest tests/test_rule_scoping.py -v— 8 domain scoping testspytest tests/test_convergence_gate.py -v— 3 gate testspytest tests/test_efficiency.py -v— 5 efficiency metric testspytest tests/test_convergence.py -v— 11 Mann-Kendall testspytest tests/ -q— full regression suite (1374 pass)pyright src/gradata/— 0 errorsGenerated with Gradata
Greptile Summary
This PR hardens the Sim 9 engine with three new capabilities: per-domain rule scoping (auto-disables rules with >30% misfire rate in a given domain), a convergence gate (skips LLM extraction when a category's correction rate has statistically stabilized via Mann-Kendall), and a
brain.efficiency()metric that computes an effort-ratio from correction trends with an optional time-saved estimate.domain_scoresadded toLesson,is_rule_disabled_for_domain()added torule_engine, and_attribute_domain_fires()added to_core.py; a newrule_scoped_outEventBus event is emitted when a rule is suppressed per domain._get_convergence()caches the Mann-Kendall result per-session; per-category trend is checked before callingextract_behavioral_instruction, saving LLM calls when a category is already stable.brain.efficiency()computesrecent / initialcorrection-rate ratio; optionalestimate_time=Trueadds a severity-weighted seconds-saved estimate.insufficient_dataguard (sparse-data categories can be falsely labeledconverged, silently suppressing extraction), and thebus.emit("rule_scoped_out", ...)call inside the filtering loop is not wrapped in try-except, meaning a misbehaving subscriber can abort the loop and truncate the applied rule set.Confidence Score: 3/5
Two P1 logic bugs found that should be fixed before merge: sparse-category convergence mislabeling and an unguarded EventBus emit inside the rule-filtering loop.
The overall architecture is clean, test coverage is solid, and the three new features are well-reasoned. However, the per-category convergence gap is a production-quality bug — a category with sparse history will be silently marked converged and skip extraction, degrading lesson quality over time. The unguarded bus.emit in apply_rules violates Rule 4 and can cause silent rule-set truncation. Both are targeted fixes (a few lines each). P2s are minor style concerns. Score of 3 reflects two genuine logic bugs on the critical path that warrant a fix before merge.
src/gradata/_core.py (brain_convergence per-category mapping, lines 962–967; locals() pattern, lines 317–319) and src/gradata/rules/rule_engine.py (unguarded bus.emit in apply_rules loop, lines 494–502)
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[brain.correct called] --> B[compute_diff + classify_edits] B --> C{_get_convergence\ncached this session?} C -->|yes| D[use cached result] C -->|no| E[brain_convergence — DB query] E --> F[Mann-Kendall overall + per-category] F --> D D --> G{cat_convergence.trend\n== 'converged'?} G -->|yes| H[use primary.description\nskip LLM extraction] G -->|no| I[extract_behavioral_instruction\nLLM + cache + templates] H --> J[find / create / update Lesson] I --> J J --> K[update_confidence + write lessons.md] K --> L{_fired_rules present?} L -->|yes| M[_attribute_domain_fires\nincrement domain fires/misfires] M --> N[bus.emit correction.created] L -->|no| N subgraph apply_rules [apply_rules — rule selection] O[filter PATTERN + RULE] --> P{scope.domain set?} P -->|yes| Q{is_rule_disabled_for_domain?} Q -->|yes| R[bus.emit rule_scoped_out\ndrop rule] Q -->|no| S[keep rule] P -->|no| S S --> T[score + rank → AppliedRule list] endPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat: add brain.efficiency() — effort-sa..." | Re-trigger Greptile
Context used:
Rule 1: Never use print() ... (source)