Skip to content

feat: Sim 9 engine hardening — rule scoping, convergence gate, efficiency metric#15

Merged
Gradata merged 6 commits intomainfrom
feat/sim9-improvements
Apr 6, 2026
Merged

feat: Sim 9 engine hardening — rule scoping, convergence gate, efficiency metric#15
Gradata merged 6 commits intomainfrom
feat/sim9-improvements

Conversation

@Gradata
Copy link
Copy Markdown
Owner

@Gradata Gradata commented Apr 6, 2026

Summary

  • Cherry-picks Mann-Kendall convergence upgrade (per-category tracking, p-values)
  • Rule domain scoping: per-domain misfire tracking on lessons, auto-disables rules with >30% misfire rate in a domain
  • Convergence gate: skips LLM extraction when a category is converged (Mann-Kendall flat)
  • brain.efficiency(): effort-ratio metric with optional time estimates (Kenoodl reframe: "reduced toil" not "AI learned")
  • New events: rule_scoped_out emitted when a rule is suppressed by domain

Source

Sim 9 consensus (2000 posts, 200 agents) + Kenoodl synthesis. See docs/superpowers/specs/2026-04-06-sim9-engine-hardening-design.md.

Test plan

  • pytest tests/test_rule_scoping.py -v — 8 domain scoping tests
  • pytest tests/test_convergence_gate.py -v — 3 gate tests
  • pytest tests/test_efficiency.py -v — 5 efficiency metric tests
  • pytest tests/test_convergence.py -v — 11 Mann-Kendall tests
  • pytest tests/ -q — full regression suite (1374 pass)
  • pyright src/gradata/ — 0 errors

Generated with Gradata

Greptile Summary

This PR hardens the Sim 9 engine with three new capabilities: per-domain rule scoping (auto-disables rules with >30% misfire rate in a given domain), a convergence gate (skips LLM extraction when a category's correction rate has statistically stabilized via Mann-Kendall), and a brain.efficiency() metric that computes an effort-ratio from correction trends with an optional time-saved estimate.

  • Rule domain scoping: domain_scores added to Lesson, is_rule_disabled_for_domain() added to rule_engine, and _attribute_domain_fires() added to _core.py; a new rule_scoped_out EventBus event is emitted when a rule is suppressed per domain.
  • Convergence gate: _get_convergence() caches the Mann-Kendall result per-session; per-category trend is checked before calling extract_behavioral_instruction, saving LLM calls when a category is already stable.
  • Efficiency metric: brain.efficiency() computes recent / initial correction-rate ratio; optional estimate_time=True adds a severity-weighted seconds-saved estimate.
  • Two issues require attention: the per-category convergence branch lacks an insufficient_data guard (sparse-data categories can be falsely labeled converged, silently suppressing extraction), and the bus.emit("rule_scoped_out", ...) call inside the filtering loop is not wrapped in try-except, meaning a misbehaving subscriber can abort the loop and truncate the applied rule set.

Confidence Score: 3/5

Two P1 logic bugs found that should be fixed before merge: sparse-category convergence mislabeling and an unguarded EventBus emit inside the rule-filtering loop.

The overall architecture is clean, test coverage is solid, and the three new features are well-reasoned. However, the per-category convergence gap is a production-quality bug — a category with sparse history will be silently marked converged and skip extraction, degrading lesson quality over time. The unguarded bus.emit in apply_rules violates Rule 4 and can cause silent rule-set truncation. Both are targeted fixes (a few lines each). P2s are minor style concerns. Score of 3 reflects two genuine logic bugs on the critical path that warrant a fix before merge.

src/gradata/_core.py (brain_convergence per-category mapping, lines 962–967; locals() pattern, lines 317–319) and src/gradata/rules/rule_engine.py (unguarded bus.emit in apply_rules loop, lines 494–502)

Important Files Changed

Filename Overview
src/gradata/_core.py Convergence gate + domain fire attribution added; per-category convergence lacks insufficient_data guard and uses fragile locals() check
src/gradata/rules/rule_engine.py Domain-scoping filter added to apply_rules; bus.emit(rule_scoped_out) inside loop is not try-except protected
src/gradata/brain.py Adds efficiency() and _get_convergence() with correct per-session caching; clean delegation pattern
src/gradata/enhancements/self_improvement.py domain_scores parse/format round-trip implemented cleanly; backward-compatible
src/gradata/_types.py domain_scores field added to Lesson with correct default_factory; no issues
tests/test_rule_scoping.py 8 scoping tests covering disable thresholds, event emission, and attribution; missing from future import annotations
tests/test_convergence_gate.py 3 gate tests covering skip, run, and cache behaviour; missing from future import annotations
tests/test_efficiency.py 5 efficiency tests including edge cases (insufficient data, flat rate, time estimates); well-structured
tests/test_convergence.py 11 Mann-Kendall tests covering monotone, noisy, flat, and per-category scenarios; comprehensive

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[brain.correct called] --> B[compute_diff + classify_edits]
    B --> C{_get_convergence\ncached this session?}
    C -->|yes| D[use cached result]
    C -->|no| E[brain_convergence — DB query]
    E --> F[Mann-Kendall overall + per-category]
    F --> D
    D --> G{cat_convergence.trend\n== 'converged'?}
    G -->|yes| H[use primary.description\nskip LLM extraction]
    G -->|no| I[extract_behavioral_instruction\nLLM + cache + templates]
    H --> J[find / create / update Lesson]
    I --> J
    J --> K[update_confidence + write lessons.md]
    K --> L{_fired_rules present?}
    L -->|yes| M[_attribute_domain_fires\nincrement domain fires/misfires]
    M --> N[bus.emit correction.created]
    L -->|no| N

    subgraph apply_rules [apply_rules — rule selection]
        O[filter PATTERN + RULE] --> P{scope.domain set?}
        P -->|yes| Q{is_rule_disabled_for_domain?}
        Q -->|yes| R[bus.emit rule_scoped_out\ndrop rule]
        Q -->|no| S[keep rule]
        P -->|no| S
        S --> T[score + rank → AppliedRule list]
    end
Loading

Fix All in Claude Code

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 962-967

Comment:
**Per-category convergence has no `insufficient_data` guard**

At the top-level, `brain_convergence` correctly emits `"insufficient_data"` when `len(counts) < 3`. But the per-category branch maps every `"no_trend"` result straight to `"converged"`:

```python
cat_trend = "converging" if cat_mk == "decreasing" else (
    "diverging" if cat_mk == "increasing" else "converged")
```

`cat_counts` is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like `[0, 0, 0, 5, 0, 0, 0]`. Because this sequence has no monotonic trend, Mann-Kendall returns `"no_trend"``"converged"`. The convergence gate in `_core.py` then sees `trend == "converged"` and **skips LLM extraction for future corrections in that category**, even though the category has never meaningfully converged — it simply has sparse history.

Suggested fix:

```python
if cat_mk == "decreasing":
    cat_trend = "converging"
elif cat_mk == "increasing":
    cat_trend = "diverging"
elif sum(cat_counts) < 3:          # not enough non-zero data
    cat_trend = "insufficient_data"
else:
    cat_trend = "converged"
```

The convergence gate already checks `cat_convergence.get("trend") == "converged"`, so adding the `"insufficient_data"` branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gradata/rules/rule_engine.py
Line: 494-502

Comment:
**Unguarded `bus.emit` inside eligibility-filter loop — Rule 4 violation**

The `bus.emit("rule_scoped_out", ...)` call is inside a `for lesson in eligible` loop with no try-except:

```python
if is_rule_disabled_for_domain(lesson, current_domain):
    if bus:
        scores = lesson.domain_scores.get(current_domain, {})
        bus.emit("rule_scoped_out", { ... })
```

Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the `for` loop terminates early — rules that should have been kept in `eligible` are silently dropped, producing a truncated rule set with no warning.

```suggestion
                if is_rule_disabled_for_domain(lesson, current_domain):
                    if bus:
                        try:
                            scores = lesson.domain_scores.get(current_domain, {})
                            bus.emit("rule_scoped_out", {
                                "lesson_category": lesson.category,
                                "lesson_description": lesson.description[:80],
                                "domain": current_domain,
                                "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
                            })
                        except Exception:
                            _log.debug("rule_scoped_out emit failed")
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 317-319

Comment:
**`locals()` inspection is fragile for `desc`-variable lookup**

`desc` is assigned inside the broad `try` block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using `'desc' in locals()` to probe this is an anti-pattern — easy to misread and could silently yield an empty `correction_desc` if the assignment was skipped part-way through the block.

A cleaner approach is to initialise a dedicated variable before the try block:

```python
_extracted_desc: str = ""  # initialise before the try block
# ... inside the try, after desc is computed:
_extracted_desc = desc
# ... after the try block:
if brain._fired_rules and (category or classifications):
    _attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary)
```

This makes the intent explicit and eliminates the `locals()` introspection.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_rule_scoping.py
Line: 1-2

Comment:
**Missing `from __future__ import annotations` (Rule 6)**

Both new test files added in this PR are missing the required first import. Per Rule 6, all `.py` files in the SDK should include `from __future__ import annotations` to enable `str | None` syntax consistently and maintain Pyright strict-mode compatibility.

This applies here (`tests/test_rule_scoping.py`) and also to `tests/test_convergence_gate.py` line 1.

```suggestion
from __future__ import annotations

"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
```

**Rule Used:** # Code Review Rules

## Rule 1: Never use print() ... ([source](https://app.greptile.com/review/custom-context?memory=dee613fe-ca52-4382-b9d7-fad6d0b079ec))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 1031

Comment:
**`corrections_avoided` multiplier includes the baseline period, overestimating savings**

```python
corrections_avoided = max(0, (initial - recent) * len(counts))
```

This multiplies by the **total** session count, including the first three sessions used to establish `initial`. In a 20-session brain with initial = 10, recent = 4, the formula yields `6 × 20 = 120` — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there.

A more accurate (still approximate) estimate uses only post-baseline sessions:

```suggestion
        corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3))
```

This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat: add brain.efficiency() — effort-sa..." | Re-trigger Greptile

Greptile also left 5 inline comments on this PR.

Context used:

  • Rule used - # Code Review Rules

Rule 1: Never use print() ... (source)

Oliver Le and others added 6 commits April 6, 2026 12:59
…gory tracking

Replaces naive first-half/second-half comparison with Mann-Kendall
statistical trend test (pure Python, no scipy). Now returns:
- p_value for trend significance
- by_category breakdown (each category's own convergence curve)
- Proper handling of ties and noisy data

Driven by MiroFish Sim 9 consensus: 200 personas unanimously
recommended Mann-Kendall over the half-split approach.

Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
Co-Authored-By: Gradata <noreply@gradata.ai>
… estimates

Co-Authored-By: Gradata <noreply@gradata.ai>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

The changes implement convergence-aware behavioral instruction gating, statistical Mann–Kendall trend testing, domain-scoped rule disabling via misfire rate thresholds, and a new efficiency metric API to quantify effort savings across learning sessions.

Changes

Cohort / File(s) Summary
Core Convergence & Efficiency
src/gradata/_core.py, src/gradata/_types.py, src/gradata/brain.py
Replaces heuristic trend detection with Mann–Kendall statistical testing; gates behavioral instruction extraction when category convergence is "converged"; adds domain_scores tracking to Lesson; introduces brain_efficiency() API and Brain.efficiency() method; implements convergence result caching via _get_convergence() helper.
Domain-Scoped Rule Management
src/gradata/rules/rule_engine.py, src/gradata/enhancements/self_improvement.py
Adds domain-scoped rule disabling when misfire rate exceeds 30% after ≥3 fires; extends apply_rules() to filter ineligible lessons and emit rule_scoped_out events; updates lesson serialization to persist/parse domain_scores metadata as JSON.
Test Coverage
tests/test_convergence.py, tests/test_convergence_gate.py, tests/test_efficiency.py, tests/test_rule_scoping.py
Adds comprehensive test suites validating Mann–Kendall trend detection and p_value output, convergence-gated instruction extraction, efficiency metric computation with optional time estimates, domain-scoped rule filtering and misfire attribution, and domain_scores round-trip serialization.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Brain
    participant RuleEngine as Rule Engine
    participant EventBus as Event Bus
    participant Core as _core.py

    Client->>Brain: correct(correction, category)
    activate Brain
    Brain->>Brain: _get_convergence()
    activate Brain
    alt Cached & same session
        Brain-->>Brain: return cached result
    else Cache miss or session changed
        Brain->>Core: brain_convergence(self)
        activate Core
        Core->>Core: _mann_kendall() for trend
        Core-->>Brain: {trend, p_value, by_category}
        deactivate Core
        Brain->>Brain: store in _convergence_cache
    end
    deactivate Brain
    
    alt Category trend is "converged"
        Brain->>Brain: skip extract_behavioral_instruction()
        Brain->>Brain: use primary.description
    else Trend is "diverging" or "converging"
        Brain->>Brain: call extract_behavioral_instruction()
    end
    
    Brain->>RuleEngine: apply_rules(..., domain)
    activate RuleEngine
    RuleEngine->>RuleEngine: filter by is_rule_disabled_for_domain()
    
    alt Rule disabled for domain
        RuleEngine->>EventBus: emit "rule_scoped_out"
        RuleEngine-->>EventBus: {domain, category, misfire_rate}
    end
    RuleEngine-->>Brain: eligible_lessons
    deactivate RuleEngine
    
    Brain->>Core: _attribute_domain_fires()
    activate Core
    Core->>Core: increment fires/misfires in domain_scores
    deactivate Core
    
    Brain-->>Client: lesson created/updated
    deactivate Brain
Loading
sequenceDiagram
    participant Client
    participant Brain
    participant Core as _core.py

    Client->>Brain: efficiency(estimate_time=False)
    activate Brain
    Brain->>Brain: _get_convergence()
    activate Brain
    Brain->>Core: brain_convergence(self)
    activate Core
    Core-->>Brain: {by_category, ...}
    deactivate Core
    deactivate Brain
    
    Brain->>Core: brain_efficiency(self)
    activate Core
    Core->>Core: extract first/last 3 sessions
    Core->>Core: compute effort_ratio
    alt estimate_time is True
        Core->>Core: multiply ratio by _SEVERITY_SECONDS
        Core->>Core: compute estimated_seconds_saved
    end
    Core-->>Brain: {effort_ratio, estimated_seconds_saved?, ...}
    deactivate Core
    
    Brain-->>Client: efficiency metrics dict
    deactivate Brain
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

feature, breaking-change

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.58% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title precisely summarizes the three main features introduced: rule scoping, convergence gate, and efficiency metric.
Description check ✅ Passed The PR description comprehensively outlines the changes: Mann-Kendall convergence upgrade, rule domain scoping with per-domain misfire tracking, convergence gate for LLM extraction, brain.efficiency() metric, and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sim9-improvements

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/gradata/_core.py
Comment on lines +962 to +967
by_category: dict[str, dict] = {}
for cat, session_counts in cat_by_session.items():
cat_counts = [session_counts.get(s, 0) for s in sessions]
cat_mk, cat_p = _mann_kendall(cat_counts)
cat_trend = "converging" if cat_mk == "decreasing" else (
"diverging" if cat_mk == "increasing" else "converged")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Per-category convergence has no insufficient_data guard

At the top-level, brain_convergence correctly emits "insufficient_data" when len(counts) < 3. But the per-category branch maps every "no_trend" result straight to "converged":

cat_trend = "converging" if cat_mk == "decreasing" else (
    "diverging" if cat_mk == "increasing" else "converged")

cat_counts is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like [0, 0, 0, 5, 0, 0, 0]. Because this sequence has no monotonic trend, Mann-Kendall returns "no_trend""converged". The convergence gate in _core.py then sees trend == "converged" and skips LLM extraction for future corrections in that category, even though the category has never meaningfully converged — it simply has sparse history.

Suggested fix:

if cat_mk == "decreasing":
    cat_trend = "converging"
elif cat_mk == "increasing":
    cat_trend = "diverging"
elif sum(cat_counts) < 3:          # not enough non-zero data
    cat_trend = "insufficient_data"
else:
    cat_trend = "converged"

The convergence gate already checks cat_convergence.get("trend") == "converged", so adding the "insufficient_data" branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 962-967

Comment:
**Per-category convergence has no `insufficient_data` guard**

At the top-level, `brain_convergence` correctly emits `"insufficient_data"` when `len(counts) < 3`. But the per-category branch maps every `"no_trend"` result straight to `"converged"`:

```python
cat_trend = "converging" if cat_mk == "decreasing" else (
    "diverging" if cat_mk == "increasing" else "converged")
```

`cat_counts` is built with zeros for every session where the category had no corrections, so a category that appeared in only one or two sessions out of seven gets a list like `[0, 0, 0, 5, 0, 0, 0]`. Because this sequence has no monotonic trend, Mann-Kendall returns `"no_trend"``"converged"`. The convergence gate in `_core.py` then sees `trend == "converged"` and **skips LLM extraction for future corrections in that category**, even though the category has never meaningfully converged — it simply has sparse history.

Suggested fix:

```python
if cat_mk == "decreasing":
    cat_trend = "converging"
elif cat_mk == "increasing":
    cat_trend = "diverging"
elif sum(cat_counts) < 3:          # not enough non-zero data
    cat_trend = "insufficient_data"
else:
    cat_trend = "converged"
```

The convergence gate already checks `cat_convergence.get("trend") == "converged"`, so adding the `"insufficient_data"` branch here is a backward-compatible fix — categories with thin history will continue to get LLM extraction.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Comment on lines +494 to +502
if is_rule_disabled_for_domain(lesson, current_domain):
if bus:
scores = lesson.domain_scores.get(current_domain, {})
bus.emit("rule_scoped_out", {
"lesson_category": lesson.category,
"lesson_description": lesson.description[:80],
"domain": current_domain,
"misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unguarded bus.emit inside eligibility-filter loop — Rule 4 violation

The bus.emit("rule_scoped_out", ...) call is inside a for lesson in eligible loop with no try-except:

if is_rule_disabled_for_domain(lesson, current_domain):
    if bus:
        scores = lesson.domain_scores.get(current_domain, {})
        bus.emit("rule_scoped_out", { ... })

Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the for loop terminates early — rules that should have been kept in eligible are silently dropped, producing a truncated rule set with no warning.

Suggested change
if is_rule_disabled_for_domain(lesson, current_domain):
if bus:
scores = lesson.domain_scores.get(current_domain, {})
bus.emit("rule_scoped_out", {
"lesson_category": lesson.category,
"lesson_description": lesson.description[:80],
"domain": current_domain,
"misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
})
if is_rule_disabled_for_domain(lesson, current_domain):
if bus:
try:
scores = lesson.domain_scores.get(current_domain, {})
bus.emit("rule_scoped_out", {
"lesson_category": lesson.category,
"lesson_description": lesson.description[:80],
"domain": current_domain,
"misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
})
except Exception:
_log.debug("rule_scoped_out emit failed")
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/rules/rule_engine.py
Line: 494-502

Comment:
**Unguarded `bus.emit` inside eligibility-filter loop — Rule 4 violation**

The `bus.emit("rule_scoped_out", ...)` call is inside a `for lesson in eligible` loop with no try-except:

```python
if is_rule_disabled_for_domain(lesson, current_domain):
    if bus:
        scores = lesson.domain_scores.get(current_domain, {})
        bus.emit("rule_scoped_out", { ... })
```

Per Rule 4, EventBus emit calls must be wrapped so a misbehaving subscriber cannot propagate exceptions to the caller. If the handler raises here, the `for` loop terminates early — rules that should have been kept in `eligible` are silently dropped, producing a truncated rule set with no warning.

```suggestion
                if is_rule_disabled_for_domain(lesson, current_domain):
                    if bus:
                        try:
                            scores = lesson.domain_scores.get(current_domain, {})
                            bus.emit("rule_scoped_out", {
                                "lesson_category": lesson.category,
                                "lesson_description": lesson.description[:80],
                                "domain": current_domain,
                                "misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
                            })
                        except Exception:
                            _log.debug("rule_scoped_out emit failed")
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Comment thread src/gradata/_core.py
Comment on lines +317 to +319
correction_desc = ""
if 'desc' in locals():
correction_desc = desc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 locals() inspection is fragile for desc-variable lookup

desc is assigned inside the broad try block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using 'desc' in locals() to probe this is an anti-pattern — easy to misread and could silently yield an empty correction_desc if the assignment was skipped part-way through the block.

A cleaner approach is to initialise a dedicated variable before the try block:

_extracted_desc: str = ""  # initialise before the try block
# ... inside the try, after desc is computed:
_extracted_desc = desc
# ... after the try block:
if brain._fired_rules and (category or classifications):
    _attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary)

This makes the intent explicit and eliminates the locals() introspection.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 317-319

Comment:
**`locals()` inspection is fragile for `desc`-variable lookup**

`desc` is assigned inside the broad `try` block that ends at line 311. Whether it exists in the local namespace after the except depends on exactly how far execution got. Using `'desc' in locals()` to probe this is an anti-pattern — easy to misread and could silently yield an empty `correction_desc` if the assignment was skipped part-way through the block.

A cleaner approach is to initialise a dedicated variable before the try block:

```python
_extracted_desc: str = ""  # initialise before the try block
# ... inside the try, after desc is computed:
_extracted_desc = desc
# ... after the try block:
if brain._fired_rules and (category or classifications):
    _attribute_domain_fires(brain, category or "UNKNOWN", _extracted_desc or summary)
```

This makes the intent explicit and eliminates the `locals()` introspection.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Comment on lines +1 to +2
"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
from gradata._types import Lesson, LessonState
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing from __future__ import annotations (Rule 6)

Both new test files added in this PR are missing the required first import. Per Rule 6, all .py files in the SDK should include from __future__ import annotations to enable str | None syntax consistently and maintain Pyright strict-mode compatibility.

This applies here (tests/test_rule_scoping.py) and also to tests/test_convergence_gate.py line 1.

Suggested change
"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
from gradata._types import Lesson, LessonState
from __future__ import annotations
"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""

Rule Used: # Code Review Rules

Rule 1: Never use print() ... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/test_rule_scoping.py
Line: 1-2

Comment:
**Missing `from __future__ import annotations` (Rule 6)**

Both new test files added in this PR are missing the required first import. Per Rule 6, all `.py` files in the SDK should include `from __future__ import annotations` to enable `str | None` syntax consistently and maintain Pyright strict-mode compatibility.

This applies here (`tests/test_rule_scoping.py`) and also to `tests/test_convergence_gate.py` line 1.

```suggestion
from __future__ import annotations

"""Tests for rule domain scoping — per-domain misfire tracking and auto-disable."""
```

**Rule Used:** # Code Review Rules

## Rule 1: Never use print() ... ([source](https://app.greptile.com/review/custom-context?memory=dee613fe-ca52-4382-b9d7-fad6d0b079ec))

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Comment thread src/gradata/_core.py
}

if estimate_time:
corrections_avoided = max(0, (initial - recent) * len(counts))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 corrections_avoided multiplier includes the baseline period, overestimating savings

corrections_avoided = max(0, (initial - recent) * len(counts))

This multiplies by the total session count, including the first three sessions used to establish initial. In a 20-session brain with initial = 10, recent = 4, the formula yields 6 × 20 = 120 — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there.

A more accurate (still approximate) estimate uses only post-baseline sessions:

Suggested change
corrections_avoided = max(0, (initial - recent) * len(counts))
corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3))

This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gradata/_core.py
Line: 1031

Comment:
**`corrections_avoided` multiplier includes the baseline period, overestimating savings**

```python
corrections_avoided = max(0, (initial - recent) * len(counts))
```

This multiplies by the **total** session count, including the first three sessions used to establish `initial`. In a 20-session brain with initial = 10, recent = 4, the formula yields `6 × 20 = 120` — but sessions 1–3 were already running at the initial correction rate, so nothing was avoided there.

A more accurate (still approximate) estimate uses only post-baseline sessions:

```suggestion
        corrections_avoided = max(0, (initial - recent) * max(0, len(counts) - 3))
```

This is still labelled "approximate" in the docstring, but avoids inflating the reported savings for long-running brains.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/gradata/_core.py (1)

1-7: 🧹 Nitpick | 🔵 Trivial

File exceeds 500-line limit (1040 lines).

As per coding guidelines, Python files should be kept under 500 lines. This file has grown to 1040 lines with the new additions. Consider splitting into focused modules:

  • _core_correct.py: brain_correct, _attribute_domain_fires
  • _core_session.py: brain_end_session
  • _core_convergence.py: brain_convergence, _mann_kendall, _normal_cdf, brain_efficiency
  • _core_export.py: export helpers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/_core.py` around lines 1 - 7, The file src/gradata/_core.py is
too large; split it into focused modules and update imports: extract heavy
correctness logic into a new _core_correct.py containing brain_correct and
_attribute_domain_fires, extract session logic into _core_session.py with
brain_end_session, extract convergence/statistics into _core_convergence.py with
brain_convergence, _mann_kendall, _normal_cdf and brain_efficiency, and move
export-related helpers into _core_export.py; in the original _core.py leave thin
delegator functions or import-and-reexport these symbols (e.g., from
._core_correct import brain_correct, _attribute_domain_fires) so external
callers and the Brain class continue to reference the same function names, and
update any local imports/relative references to these functions/classes to use
the new modules.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/gradata/_core.py`:
- Around line 962-972: The per-category trend mapping currently always maps a
"no_trend" result from _mann_kendall to "converged", which ignores sparse data;
update the logic in the by_category loop (where cat_counts, cat_mk, cat_p are
computed) to mirror the global logic: when cat_mk == "no_trend" return
"converged" only if len(cat_counts) >= 3, otherwise set "insufficient_data";
keep the existing mappings for "increasing" -> "diverging" and "decreasing" ->
"converging" and ensure by_category[cat]["trend"] uses this corrected cat_trend
determination.
- Around line 1030-1038: The current time-estimate overstates savings by
computing corrections_avoided = (initial - recent) * len(counts); instead,
compute avoided corrections per session (or only for sessions after an initial
baseline window) and sum those non-negative differences so you don’t assume
every session would have stayed at the initial rate. Update the block guarded by
estimate_time to derive corrections_avoided as sum(max(0, counts[i] - recent)
for each session i) or use a baseline mean of the first N sessions (e.g.,
baseline = mean(counts[:N]) then sum(max(0, counts[i] - recent) for i>=N)), then
reuse avg_severity_weight from _SEVERITY_SECONDS and set
result["estimated_seconds_saved"] and result["time_breakdown"] accordingly
(referencing variables initial, recent, counts, estimate_time, result,
_SEVERITY_SECONDS, avg_severity_weight).
- Around line 314-324: The code uses if 'desc' in locals() to detect a variable
set earlier; instead initialize desc explicitly before the classification
branches (e.g., set desc = "" near where cat/category/classifications are
prepared) so later code can rely on it, then replace the locals() check in the
attribution block with a simple use of desc (e.g., use correction_desc = desc or
correction_desc = summary fallback) when calling _attribute_domain_fires(brain,
category or "UNKNOWN", correction_desc); update references to desc across the
classification branches to assign into this initialized variable.

In `@src/gradata/brain.py`:
- Line 70: The _fired_rules list is never updated so domain attribution in
_attribute_domain_fires() never runs; update the rule application flow (e.g.,
inside apply_brain_rules() or the concrete function that executes individual
rules) to append a concise identifier for each rule when it successfully fires
(use the same identifier format tests expect, e.g., rule id/name/object used in
tests/test_rule_scoping.py), and ensure you append to self._fired_rules on the
Brain instance before/after any side-effectful action so attribution logic can
iterate over it; also ensure any existing test that directly sets _fired_rules
still passes by matching the same identifier convention.

In `@src/gradata/enhancements/self_improvement.py`:
- Around line 305-310: Move the in-loop import of json out to the module level
and update the parsing to use that module; specifically, add "import json" with
the other top-level imports, then in the block handling
meta_line.startswith("Domain scores:") replace "import json as _json" and uses
of "_json.loads" / "_json.JSONDecodeError" with "json.loads" and
"json.JSONDecodeError" so domain_scores parsing still falls back to {} on decode
errors (refer to the meta_line handling and the domain_scores variable).
- Around line 901-903: The block that formats domain scores imports json inside
the conditional (import json as _json) which is inefficient; remove the inline
import and use a module-level import (e.g., import json at top of
src/gradata/enhancements/self_improvement.py) and update the usage in the block
that appends domain scores (the lines.append(f"  Domain scores:
{_json.dumps(lesson.domain_scores)}") statement referencing
lesson.domain_scores) to call the module-level json.dumps (or the chosen alias)
instead.

In `@src/gradata/rules/rule_engine.py`:
- Around line 489-505: The production call to apply_rules via apply_brain_rules
is not passing the event bus, so rule_scoped_out events never fire; update
apply_brain_rules (the method that calls apply_rules(lessons, build_scope(ctx)))
to pass the instance bus (self.bus) as the bus argument when calling
apply_rules, ensuring the existing domain-scoping code in rule_engine.py (which
checks bus and emits "rule_scoped_out") can emit events; verify signature of
apply_rules matches (lessons, scope, bus=None) and propagate the change only in
apply_brain_rules to avoid breaking other callers.

In `@tests/test_convergence.py`:
- Around line 110-132: The test test_convergence_per_category creates CODE
corrections across three sessions but never asserts their per-session counts;
update the test to assert that result contains "by_category" and "CODE" (similar
to the TONE checks) and that
result["by_category"]["CODE"]["corrections_per_session"] equals the expected
list [2, 2, 1] (session 1:2, session 2:2, session 3:1), so the CODE category is
validated alongside TONE in Brain.convergence().

In `@tests/test_efficiency.py`:
- Around line 27-35: In test_efficiency_ratio_declining replace the direct float
equality checks for corrections_initial and corrections_recent with
pytest.approx comparisons: update the two asserts that compare
result["corrections_initial"] and result["corrections_recent"] (from the test
that constructs brain via _make_brain and uses _mock_convergence and
brain.efficiency()) to use pytest.approx(...) so they tolerate floating-point
rounding; ensure pytest is available/imported in the test module if not already.
- Around line 38-53: The assertions in test_efficiency_ratio_no_improvement and
test_efficiency_insufficient_data use direct float equality for
result["effort_ratio"]; change these to use pytest.approx for robust
floating-point comparison (e.g., assert result["effort_ratio"] ==
pytest.approx(1.0)), and ensure the test file imports pytest if not already;
keep the other integer assertions for corrections_initial and corrections_recent
unchanged.

---

Outside diff comments:
In `@src/gradata/_core.py`:
- Around line 1-7: The file src/gradata/_core.py is too large; split it into
focused modules and update imports: extract heavy correctness logic into a new
_core_correct.py containing brain_correct and _attribute_domain_fires, extract
session logic into _core_session.py with brain_end_session, extract
convergence/statistics into _core_convergence.py with brain_convergence,
_mann_kendall, _normal_cdf and brain_efficiency, and move export-related helpers
into _core_export.py; in the original _core.py leave thin delegator functions or
import-and-reexport these symbols (e.g., from ._core_correct import
brain_correct, _attribute_domain_fires) so external callers and the Brain class
continue to reference the same function names, and update any local
imports/relative references to these functions/classes to use the new modules.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ba4e2b1-7dc5-4ce5-995b-286aa05b305e

📥 Commits

Reviewing files that changed from the base of the PR and between 9f1b116 and 6c1b5b6.

📒 Files selected for processing (9)
  • src/gradata/_core.py
  • src/gradata/_types.py
  • src/gradata/brain.py
  • src/gradata/enhancements/self_improvement.py
  • src/gradata/rules/rule_engine.py
  • tests/test_convergence.py
  • tests/test_convergence_gate.py
  • tests/test_efficiency.py
  • tests/test_rule_scoping.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*

📄 CodeRabbit inference engine (CLAUDE.md)

Always read a file before editing it. Always prefer editing over creating new files.

Files:

  • src/gradata/brain.py
  • src/gradata/enhancements/self_improvement.py
  • tests/test_rule_scoping.py
  • tests/test_convergence.py
  • tests/test_efficiency.py
  • src/gradata/rules/rule_engine.py
  • tests/test_convergence_gate.py
  • src/gradata/_types.py
  • src/gradata/_core.py
**/*.{py,sh,js,ts,json,env*}

📄 CodeRabbit inference engine (CLAUDE.md)

Never hardcode secrets. Never commit .env files. Pre-commit hook blocks both.

Files:

  • src/gradata/brain.py
  • src/gradata/enhancements/self_improvement.py
  • tests/test_rule_scoping.py
  • tests/test_convergence.py
  • tests/test_efficiency.py
  • src/gradata/rules/rule_engine.py
  • tests/test_convergence_gate.py
  • src/gradata/_types.py
  • src/gradata/_core.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Keep files under 500 lines. Validate input at system boundaries.

Files:

  • src/gradata/brain.py
  • src/gradata/enhancements/self_improvement.py
  • tests/test_rule_scoping.py
  • tests/test_convergence.py
  • tests/test_efficiency.py
  • src/gradata/rules/rule_engine.py
  • tests/test_convergence_gate.py
  • src/gradata/_types.py
  • src/gradata/_core.py
src/gradata/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use AGPL-3.0 licensing. Source code resides in src/gradata/.

Files:

  • src/gradata/brain.py
  • src/gradata/enhancements/self_improvement.py
  • src/gradata/rules/rule_engine.py
  • src/gradata/_types.py
  • src/gradata/_core.py

⚙️ CodeRabbit configuration file

src/gradata/**/*.py: This is the core SDK. Check for: type safety (from future import annotations required), no print()
statements (use logging), all functions accepting BrainContext where DB access occurs, no hardcoded paths. Severity
scoring must clamp to [0,1]. Confidence values must be in [0.0, 1.0].

Files:

  • src/gradata/brain.py
  • src/gradata/enhancements/self_improvement.py
  • src/gradata/rules/rule_engine.py
  • src/gradata/_types.py
  • src/gradata/_core.py
tests/**

⚙️ CodeRabbit configuration file

tests/**: Test files. Verify: no hardcoded paths, assertions check specific values not just truthiness,
parametrized tests preferred for boundary conditions, floating point comparisons use pytest.approx.

Files:

  • tests/test_rule_scoping.py
  • tests/test_convergence.py
  • tests/test_efficiency.py
  • tests/test_convergence_gate.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: Gradata/gradata PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T15:53:08.573Z
Learning: Correction pipeline: correct → extract behavioral instruction → graduate (INSTINCT→PATTERN→RULE) → inject into next session. Meta-rules emerge when patterns cluster across domains.
🔇 Additional comments (13)
src/gradata/_types.py (1)

125-125: LGTM!

The new domain_scores field is well-typed with proper default factory initialization. The nested dict[str, dict[str, int]] structure appropriately models per-domain fire/misfire tracking.

src/gradata/brain.py (2)

344-351: LGTM — convergence caching logic is correct.

The session-based caching correctly compares self._convergence_session == self.session before returning cached results, ensuring fresh data when the session changes.


353-360: LGTM!

The efficiency() method properly delegates to brain_efficiency with keyword-only estimate_time parameter.

src/gradata/rules/rule_engine.py (1)

419-430: LGTM — domain disabling logic is correct.

The 30% misfire threshold with minimum 3 fires requirement provides reasonable statistical significance before disabling a rule for a domain.

tests/test_convergence_gate.py (2)

6-29: LGTM!

Good test coverage for the convergence gate behavior. The mock setup correctly simulates a converged category and verifies extraction is skipped.


58-70: LGTM — cache invalidation test is thorough.

The test correctly verifies both cache hit (same session) and cache miss (session change) scenarios by manipulating _convergence_session.

tests/test_rule_scoping.py (3)

5-27: LGTM — field existence and round-trip tests are well-structured.

Good coverage of the new domain_scores field initialization and serialization round-trip.


113-132: LGTM — event emission test is thorough.

Good test verifying the rule_scoped_out event is emitted with correct payload when rules are filtered by domain.


135-151: The test will pass. The actual function signature is _attribute_domain_fires(brain, correction_category: str, correction_desc: str), and the body correctly uses these parameter names. The test properly passes "DRAFTING" and "some correction" which map to these parameters in order, and the domain's fires counter will correctly increment from 5 to 6 as expected.

src/gradata/_core.py (4)

43-67: LGTM — Domain fire attribution logic is sound.

The defensive hasattr check and initialization of missing domain_scores entries is appropriate. The integration with _classify_correction_direction returning "CONTRADICTING" correctly increments misfires only when the correction contradicts the rule.


182-206: LGTM — Convergence gate implementation is well-structured.

The gating logic correctly skips expensive LLM extraction for converged categories, with appropriate fallback to primary.description. The isinstance guard prevents redundant InstructionCache instantiation.


832-880: LGTM — Mann-Kendall implementation is statistically correct.

The tie correction, continuity correction, and normal approximation follow the standard Mann-Kendall algorithm. Edge cases (n < 3, var_s == 0) are handled appropriately.


883-892: LGTM — Standard normal CDF approximation is correct.

The Abramowitz & Stegun polynomial approximation provides sufficient precision for the Mann-Kendall p-value calculations.

Comment thread src/gradata/_core.py
Comment on lines +314 to +324
# Domain-scoped misfire attribution
try:
if brain._fired_rules and (category or classifications):
correction_desc = ""
if 'desc' in locals():
correction_desc = desc
elif summary:
correction_desc = summary
_attribute_domain_fires(brain, category or "UNKNOWN", correction_desc)
except Exception as e:
_log.debug("Domain fire attribution failed: %s", e)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Avoid if 'desc' in locals() — initialize desc explicitly.

Using locals() lookup to check variable existence is fragile and obscure. The variable desc is conditionally assigned across multiple branches earlier in this function (lines 189, 202, 205, 207, 209), making this check difficult to verify.

♻️ Proposed fix: Initialize `desc` before the conditional blocks

Add initialization near line 178 (before the classification branches):

cat = (category or "UNKNOWN").upper()
desc = ""  # Will be set by classification or summary
if classifications:
    ...

Then simplify the attribution block:

     # Domain-scoped misfire attribution
     try:
         if brain._fired_rules and (category or classifications):
-            correction_desc = ""
-            if 'desc' in locals():
-                correction_desc = desc
-            elif summary:
-                correction_desc = summary
+            correction_desc = desc or summary
             _attribute_domain_fires(brain, category or "UNKNOWN", correction_desc)
     except Exception as e:
         _log.debug("Domain fire attribution failed: %s", e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/_core.py` around lines 314 - 324, The code uses if 'desc' in
locals() to detect a variable set earlier; instead initialize desc explicitly
before the classification branches (e.g., set desc = "" near where
cat/category/classifications are prepared) so later code can rely on it, then
replace the locals() check in the attribution block with a simple use of desc
(e.g., use correction_desc = desc or correction_desc = summary fallback) when
calling _attribute_domain_fires(brain, category or "UNKNOWN", correction_desc);
update references to desc across the classification branches to assign into this
initialized variable.

Comment thread src/gradata/_core.py
Comment on lines +962 to +972
by_category: dict[str, dict] = {}
for cat, session_counts in cat_by_session.items():
cat_counts = [session_counts.get(s, 0) for s in sessions]
cat_mk, cat_p = _mann_kendall(cat_counts)
cat_trend = "converging" if cat_mk == "decreasing" else (
"diverging" if cat_mk == "increasing" else "converged")
by_category[cat] = {
"corrections_per_session": cat_counts,
"trend": cat_trend,
"p_value": cat_p,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Per-category trend logic lacks insufficient_data handling.

The overall trend logic (lines 941-948) maps no_trend to converged only when len(counts) >= 3, otherwise to insufficient_data. However, the per-category logic unconditionally maps no_trend to converged, which could prematurely gate extraction for categories with sparse data.

🐛 Proposed fix: Apply consistent length check
     for cat, session_counts in cat_by_session.items():
         cat_counts = [session_counts.get(s, 0) for s in sessions]
         cat_mk, cat_p = _mann_kendall(cat_counts)
-        cat_trend = "converging" if cat_mk == "decreasing" else (
-            "diverging" if cat_mk == "increasing" else "converged")
+        if cat_mk == "decreasing":
+            cat_trend = "converging"
+        elif cat_mk == "increasing":
+            cat_trend = "diverging"
+        elif sum(1 for c in cat_counts if c > 0) >= 3:
+            cat_trend = "converged"
+        else:
+            cat_trend = "insufficient_data"
         by_category[cat] = {
             "corrections_per_session": cat_counts,
             "trend": cat_trend,
             "p_value": cat_p,
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/_core.py` around lines 962 - 972, The per-category trend mapping
currently always maps a "no_trend" result from _mann_kendall to "converged",
which ignores sparse data; update the logic in the by_category loop (where
cat_counts, cat_mk, cat_p are computed) to mirror the global logic: when cat_mk
== "no_trend" return "converged" only if len(cat_counts) >= 3, otherwise set
"insufficient_data"; keep the existing mappings for "increasing" -> "diverging"
and "decreasing" -> "converging" and ensure by_category[cat]["trend"] uses this
corrected cat_trend determination.

Comment thread src/gradata/_core.py
Comment on lines +1030 to +1038
if estimate_time:
corrections_avoided = max(0, (initial - recent) * len(counts))
avg_severity_weight = _SEVERITY_SECONDS.get("moderate", 45)
estimated_seconds = int(corrections_avoided * avg_severity_weight)
result["estimated_seconds_saved"] = estimated_seconds
result["time_breakdown"] = {
"corrections_avoided": round(corrections_avoided, 1),
"avg_seconds_per_correction": avg_severity_weight,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Time estimation formula may significantly overestimate savings.

The calculation corrections_avoided = (initial - recent) * len(counts) assumes every session would have had the initial correction rate without learning, which overestimates savings. For 10 sessions with initial=10 and recent=5, this yields 50 avoided corrections, but actual savings would be lower since sessions 1-3 contributed to the "initial" baseline.

Consider a more conservative estimate, e.g., using sessions after the initial window:

♻️ More accurate estimation
     if estimate_time:
-        corrections_avoided = max(0, (initial - recent) * len(counts))
+        # Estimate based on sessions after the initial baseline window
+        sessions_with_learning = max(0, len(counts) - 3)
+        corrections_avoided = max(0, (initial - recent) * sessions_with_learning)
         avg_severity_weight = _SEVERITY_SECONDS.get("moderate", 45)
         estimated_seconds = int(corrections_avoided * avg_severity_weight)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/_core.py` around lines 1030 - 1038, The current time-estimate
overstates savings by computing corrections_avoided = (initial - recent) *
len(counts); instead, compute avoided corrections per session (or only for
sessions after an initial baseline window) and sum those non-negative
differences so you don’t assume every session would have stayed at the initial
rate. Update the block guarded by estimate_time to derive corrections_avoided as
sum(max(0, counts[i] - recent) for each session i) or use a baseline mean of the
first N sessions (e.g., baseline = mean(counts[:N]) then sum(max(0, counts[i] -
recent) for i>=N)), then reuse avg_severity_weight from _SEVERITY_SECONDS and
set result["estimated_seconds_saved"] and result["time_breakdown"] accordingly
(referencing variables initial, recent, counts, estimate_time, result,
_SEVERITY_SECONDS, avg_severity_weight).

Comment thread src/gradata/brain.py
open_encrypted_db(self.dir, self._encryption_key)

self._instruction_cache: object | None = None # lazy: InstructionCache
self._fired_rules: list = [] # Rules injected this session (for misfire attribution)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for any code that appends to or assigns _fired_rules
rg -n '_fired_rules' --type=py -C3

Repository: Gradata/gradata

Length of output: 1927


_fired_rules is never populated — domain attribution will silently fail.

The _fired_rules list is initialized on line 70 but never populated in the application code. The grep search confirms it only appears in:

  • src/gradata/brain.py:70 — initialization as empty list
  • src/gradata/_core.py:56 — reading it in _attribute_domain_fires()
  • tests/test_rule_scoping.py:147 — manually set in test code only

Since nothing appends to this list during normal execution, the loop at src/gradata/_core.py:56 never executes and domain fire/misfire counts remain permanently at their initial values. The rule injection flow must populate _fired_rules when rules are applied, likely in apply_brain_rules() or its callchain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/brain.py` at line 70, The _fired_rules list is never updated so
domain attribution in _attribute_domain_fires() never runs; update the rule
application flow (e.g., inside apply_brain_rules() or the concrete function that
executes individual rules) to append a concise identifier for each rule when it
successfully fires (use the same identifier format tests expect, e.g., rule
id/name/object used in tests/test_rule_scoping.py), and ensure you append to
self._fired_rules on the Brain instance before/after any side-effectful action
so attribution logic can iterate over it; also ensure any existing test that
directly sets _fired_rules still passes by matching the same identifier
convention.

Comment on lines +305 to +310
elif meta_line.startswith("Domain scores:"):
import json as _json
try:
domain_scores = _json.loads(meta_line[len("Domain scores:"):].strip())
except _json.JSONDecodeError:
domain_scores = {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Move json import to module level.

The json module is imported inside the parsing loop. While functional, this creates overhead on each lesson with domain scores. Consider moving the import to the module level alongside the existing imports.

♻️ Proposed refactor

Add at the top of the file with other imports:

import json

Then simplify the parsing:

             elif meta_line.startswith("Domain scores:"):
-                import json as _json
                 try:
-                    domain_scores = _json.loads(meta_line[len("Domain scores:"):].strip())
-                except _json.JSONDecodeError:
+                    domain_scores = json.loads(meta_line[len("Domain scores:"):].strip())
+                except json.JSONDecodeError:
                     domain_scores = {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/enhancements/self_improvement.py` around lines 305 - 310, Move
the in-loop import of json out to the module level and update the parsing to use
that module; specifically, add "import json" with the other top-level imports,
then in the block handling meta_line.startswith("Domain scores:") replace
"import json as _json" and uses of "_json.loads" / "_json.JSONDecodeError" with
"json.loads" and "json.JSONDecodeError" so domain_scores parsing still falls
back to {} on decode errors (refer to the meta_line handling and the
domain_scores variable).

Comment on lines +901 to +903
if lesson.domain_scores:
import json as _json
lines.append(f" Domain scores: {_json.dumps(lesson.domain_scores)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Same import inefficiency in formatting.

Similar to the parsing code, json is imported inside the loop. If moved to module level as suggested above, update this block to use the module-level import.

♻️ Proposed refactor
         if lesson.domain_scores:
-            import json as _json
-            lines.append(f"  Domain scores: {_json.dumps(lesson.domain_scores)}")
+            lines.append(f"  Domain scores: {json.dumps(lesson.domain_scores)}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/enhancements/self_improvement.py` around lines 901 - 903, The
block that formats domain scores imports json inside the conditional (import
json as _json) which is inefficient; remove the inline import and use a
module-level import (e.g., import json at top of
src/gradata/enhancements/self_improvement.py) and update the usage in the block
that appends domain scores (the lines.append(f"  Domain scores:
{_json.dumps(lesson.domain_scores)}") statement referencing
lesson.domain_scores) to call the module-level json.dumps (or the chosen alias)
instead.

Comment on lines +489 to +505
# Step 1.5 — domain scoping: filter out rules disabled for current domain
current_domain = scope.domain.upper() if scope.domain else ""
if current_domain:
filtered = []
for lesson in eligible:
if is_rule_disabled_for_domain(lesson, current_domain):
if bus:
scores = lesson.domain_scores.get(current_domain, {})
bus.emit("rule_scoped_out", {
"lesson_category": lesson.category,
"lesson_description": lesson.description[:80],
"domain": current_domain,
"misfire_rate": scores.get("misfires", 0) / max(1, scores.get("fires", 1)),
})
else:
filtered.append(lesson)
eligible = filtered
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

rule_scoped_out events won't fire in production — bus is not passed.

The domain scoping filter and event emission logic is correct, but according to the context snippet from src/gradata/brain.py:383-406, the production call site apply_brain_rules() calls apply_rules(lessons, build_scope(ctx)) without passing the bus parameter. This means rule_scoped_out events will never be emitted in normal usage.

Consider updating apply_brain_rules() to pass self.bus if observability of rule scoping is desired.

🔧 Suggested fix in brain.py

In src/gradata/brain.py around line 406:

-        return format_rules_for_prompt(apply_rules(lessons, build_scope(ctx)))
+        return format_rules_for_prompt(apply_rules(lessons, build_scope(ctx), bus=self.bus))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gradata/rules/rule_engine.py` around lines 489 - 505, The production call
to apply_rules via apply_brain_rules is not passing the event bus, so
rule_scoped_out events never fire; update apply_brain_rules (the method that
calls apply_rules(lessons, build_scope(ctx))) to pass the instance bus
(self.bus) as the bus argument when calling apply_rules, ensuring the existing
domain-scoping code in rule_engine.py (which checks bus and emits
"rule_scoped_out") can emit events; verify signature of apply_rules matches
(lessons, scope, bus=None) and propagate the change only in apply_brain_rules to
avoid breaking other callers.

Comment thread tests/test_convergence.py
Comment on lines +110 to +132
def test_convergence_per_category():
"""Should return per-category breakdown."""
d = tempfile.mkdtemp()
(Path(d) / "lessons.md").write_text("", encoding="utf-8")
brain = Brain(d)
# Session 1: 3 TONE, 2 CODE corrections
for _ in range(3):
brain.emit("CORRECTION", "test", {"category": "TONE", "severity": "minor"}, ["category:TONE"], 1)
for _ in range(2):
brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 1)
# Session 2: 1 TONE, 2 CODE
brain.emit("CORRECTION", "test", {"category": "TONE", "severity": "minor"}, ["category:TONE"], 2)
for _ in range(2):
brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 2)
# Session 3: 0 TONE, 1 CODE
brain.emit("CORRECTION", "test", {"category": "CODE", "severity": "minor"}, ["category:CODE"], 3)

result = brain.convergence()
assert "by_category" in result
assert "TONE" in result["by_category"]
assert "CODE" in result["by_category"]
# TONE went 3→1→0 = converging
assert result["by_category"]["TONE"]["corrections_per_session"] == [3, 1, 0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test is incomplete — missing assertion for CODE category.

The test sets up both TONE and CODE corrections across sessions but only asserts on TONE's corrections_per_session. Consider adding assertions for CODE to ensure both categories are tracked correctly.

💚 Proposed fix to complete the test
     # TONE went 3→1→0 = converging
     assert result["by_category"]["TONE"]["corrections_per_session"] == [3, 1, 0]
+    # CODE went 2→2→1
+    assert result["by_category"]["CODE"]["corrections_per_session"] == [2, 2, 1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_convergence.py` around lines 110 - 132, The test
test_convergence_per_category creates CODE corrections across three sessions but
never asserts their per-session counts; update the test to assert that result
contains "by_category" and "CODE" (similar to the TONE checks) and that
result["by_category"]["CODE"]["corrections_per_session"] equals the expected
list [2, 2, 1] (session 1:2, session 2:2, session 3:1), so the CODE category is
validated alongside TONE in Brain.convergence().

Comment thread tests/test_efficiency.py
Comment on lines +27 to +35
def test_efficiency_ratio_declining(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([10, 12, 8, 5, 4, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert "effort_ratio" in result
assert result["effort_ratio"] < 1.0
assert result["corrections_initial"] == 10.0 # avg(10,12,8)
assert result["corrections_recent"] == 4.0 # avg(5,4,3)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use pytest.approx for floating-point comparisons.

Lines 34-35 compare floats using ==, which can be fragile. While these specific values (10.0, 4.0) are exact, using pytest.approx is the recommended practice for float comparisons per coding guidelines.

💚 Proposed fix
+import pytest
+
 def test_efficiency_ratio_declining(tmp_path):
     brain = _make_brain(tmp_path)
     conv = _mock_convergence([10, 12, 8, 5, 4, 3])
     with patch.object(brain, "_get_convergence", return_value=conv):
         result = brain.efficiency()
     assert "effort_ratio" in result
     assert result["effort_ratio"] < 1.0
-    assert result["corrections_initial"] == 10.0  # avg(10,12,8)
-    assert result["corrections_recent"] == 4.0    # avg(5,4,3)
+    assert result["corrections_initial"] == pytest.approx(10.0)  # avg(10,12,8)
+    assert result["corrections_recent"] == pytest.approx(4.0)    # avg(5,4,3)

As per coding guidelines: "floating point comparisons use pytest.approx"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_efficiency_ratio_declining(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([10, 12, 8, 5, 4, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert "effort_ratio" in result
assert result["effort_ratio"] < 1.0
assert result["corrections_initial"] == 10.0 # avg(10,12,8)
assert result["corrections_recent"] == 4.0 # avg(5,4,3)
import pytest
def test_efficiency_ratio_declining(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([10, 12, 8, 5, 4, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert "effort_ratio" in result
assert result["effort_ratio"] < 1.0
assert result["corrections_initial"] == pytest.approx(10.0) # avg(10,12,8)
assert result["corrections_recent"] == pytest.approx(4.0) # avg(5,4,3)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_efficiency.py` around lines 27 - 35, In
test_efficiency_ratio_declining replace the direct float equality checks for
corrections_initial and corrections_recent with pytest.approx comparisons:
update the two asserts that compare result["corrections_initial"] and
result["corrections_recent"] (from the test that constructs brain via
_make_brain and uses _mock_convergence and brain.efficiency()) to use
pytest.approx(...) so they tolerate floating-point rounding; ensure pytest is
available/imported in the test module if not already.

Comment thread tests/test_efficiency.py
Comment on lines +38 to +53
def test_efficiency_ratio_no_improvement(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 5, 5, 5, 5, 5])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert result["effort_ratio"] == 1.0


def test_efficiency_insufficient_data(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert result["effort_ratio"] == 1.0
assert result["corrections_initial"] == 0
assert result["corrections_recent"] == 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same floating-point comparison issue.

Lines 43 and 51 also use == for float comparisons.

💚 Proposed fix
 def test_efficiency_ratio_no_improvement(tmp_path):
     brain = _make_brain(tmp_path)
     conv = _mock_convergence([5, 5, 5, 5, 5, 5])
     with patch.object(brain, "_get_convergence", return_value=conv):
         result = brain.efficiency()
-    assert result["effort_ratio"] == 1.0
+    assert result["effort_ratio"] == pytest.approx(1.0)


 def test_efficiency_insufficient_data(tmp_path):
     brain = _make_brain(tmp_path)
     conv = _mock_convergence([5, 3])
     with patch.object(brain, "_get_convergence", return_value=conv):
         result = brain.efficiency()
-    assert result["effort_ratio"] == 1.0
+    assert result["effort_ratio"] == pytest.approx(1.0)

As per coding guidelines: "floating point comparisons use pytest.approx"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_efficiency_ratio_no_improvement(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 5, 5, 5, 5, 5])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert result["effort_ratio"] == 1.0
def test_efficiency_insufficient_data(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert result["effort_ratio"] == 1.0
assert result["corrections_initial"] == 0
assert result["corrections_recent"] == 0
def test_efficiency_ratio_no_improvement(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 5, 5, 5, 5, 5])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert result["effort_ratio"] == pytest.approx(1.0)
def test_efficiency_insufficient_data(tmp_path):
brain = _make_brain(tmp_path)
conv = _mock_convergence([5, 3])
with patch.object(brain, "_get_convergence", return_value=conv):
result = brain.efficiency()
assert result["effort_ratio"] == pytest.approx(1.0)
assert result["corrections_initial"] == 0
assert result["corrections_recent"] == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_efficiency.py` around lines 38 - 53, The assertions in
test_efficiency_ratio_no_improvement and test_efficiency_insufficient_data use
direct float equality for result["effort_ratio"]; change these to use
pytest.approx for robust floating-point comparison (e.g., assert
result["effort_ratio"] == pytest.approx(1.0)), and ensure the test file imports
pytest if not already; keep the other integer assertions for corrections_initial
and corrections_recent unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant