feat(claude-plugin): record guideline usage per session in audit.log#239
feat(claude-plugin): record guideline usage per session in audit.log#239visahak merged 10 commits intoAgentToolkit:mainfrom
Conversation
Adds the reverse provenance direction: which sessions used a given guideline. Complements the existing source-trajectory stamp on each entity. - recall/retrieve_entities.py now appends one `recall` event per UserPromptSubmit to .evolve/audit.log, listing the served entity slugs and the session_id derived from transcript_path. Failures are swallowed so logging cannot break the user-visible recall path. - learn/SKILL.md gains a Step 4 that reads audit.log, reconstructs the set of guidelines served to this session, and emits per-entity verdicts (followed | contradicted | not_applicable) with a short evidence line. - A new log_influence.py script validates and writes those verdicts back into audit.log. - The e2e test asserts both event types land correctly after session 2.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds recall-to-influence provenance: recall events now include entity Changes
Sequence DiagramsequenceDiagram
participant Transcript as Input Transcript
participant Recall as Recall Service
participant Audit as Audit Log
participant Assess as log_influence.py
participant Test as Test Runner
Transcript->>Recall: request recall (transcript_path)
Recall->>Recall: load entity files, set `_id` from path
Recall->>Audit: append "recall" event (project_root, session_id, entities)
Assess->>Audit: read `.evolve/audit.log` for session's recall events
Assess->>Assess: for each recalled entity load guideline, determine verdict & evidence
Assess->>Audit: append "influence" event (project_root, session_id, entity, verdict, evidence)
Test->>Audit: read audit log entries
Test->>Test: assert presence/structure of recall & influence events
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 32 minutes and 27 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/e2e/test_sandbox_learn_recall.py (1)
168-213:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd unit tests for the new influence logging paths.
This adds solid e2e coverage, but the new feature still needs unit tests for
log_influence.pyvalidation branches (bad payload shape, non-dict assessments, invalid verdict skipping) so failures are localized and deterministic.As per coding guidelines
tests/**/*.py: “All new features need tests (unit + e2e where applicable)”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/test_sandbox_learn_recall.py` around lines 168 - 213, Add unit tests for log_influence.py that target its validation branches: create tests that call the public entry points (e.g., the main handler function in log_influence.py such as process_influence or validate_influence_payload) with (1) a malformed payload shape (missing keys / wrong types) to assert it raises or returns the expected validation error, (2) payloads where "assessments" is not a dict or contains non-dict entries to assert those entries are skipped/flagged, and (3) assessments containing invalid "verdict" values to assert those assessments are ignored and do not produce influence events; assert the function logs or returns the correct diagnostics and does not produce side-effect influence records for invalid inputs. Ensure tests import log_influence.py functions directly and stub/mock any I/O so failures are deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py`:
- Around line 39-55: The handler currently assumes payload and its items are
dicts (using payload.get(...) and a.get(...)), which raises AttributeError for
non-dict JSON; first validate that payload is a mapping (e.g.,
isinstance(payload, dict)) before accessing payload.get("session_id") and
payload.get("assessments"), and ensure assessments is a list of dicts by
filtering or checking each item in the for loop (in the loop over assessments,
replace direct a.get(...) with a type guard: if not isinstance(a, dict): log and
continue), then extract entity/verdict/evidence from dicts only and skip/log
invalid entries; update the checks around session_id, assessments, and the loop
over a in assessments (refer to session_id, assessments, and the loop variable
a) to avoid AttributeError and to produce clear logs when items are skipped.
In `@platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md`:
- Around line 125-127: The current flow reads recalled slugs from
`.evolve/audit.log` (event == "recall", session_id) and then only opens
`.evolve/entities/guideline/<slug>.md`, which misses slugs served from
subscribed trees; update the resolution step so for each slug you attempt to
resolve it across subscribed trees rather than only
`.evolve/entities/guideline/`. Implement or call a resolver (e.g.,
resolveEntityPath(slug, subscribedTrees) or similar) that searches
`.evolve/entities/<tree>/.../<slug>.md` (falling back to
`.evolve/entities/guideline/<slug>.md` if needed), then read the resolved file
content + trigger for influence assessment. Ensure the union of `entities` from
audit.log is used with this multi-tree resolution.
---
Outside diff comments:
In `@tests/e2e/test_sandbox_learn_recall.py`:
- Around line 168-213: Add unit tests for log_influence.py that target its
validation branches: create tests that call the public entry points (e.g., the
main handler function in log_influence.py such as process_influence or
validate_influence_payload) with (1) a malformed payload shape (missing keys /
wrong types) to assert it raises or returns the expected validation error, (2)
payloads where "assessments" is not a dict or contains non-dict entries to
assert those entries are skipped/flagged, and (3) assessments containing invalid
"verdict" values to assert those assessments are ignored and do not produce
influence events; assert the function logs or returns the correct diagnostics
and does not produce side-effect influence records for invalid inputs. Ensure
tests import log_influence.py functions directly and stub/mock any I/O so
failures are deterministic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f1f237e-addc-4971-96a9-9efbbf92934f
📒 Files selected for processing (4)
platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.mdplatform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.pyplatform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.pytests/e2e/test_sandbox_learn_recall.py
Fixes failing CI check: check-formatting
Validate that the JSON payload is a mapping before calling payload.get(), and skip non-dict assessment items instead of letting a.get() raise AttributeError. Logs each skipped item for traceability. Addresses CodeRabbit review finding: Guard payload and assessment item types before calling `.get()`
Step 4 previously hard-coded .evolve/entities/guideline/<slug>.md, which misses entities served from subscribed repositories at .evolve/entities/subscribed/<source>/guideline/<slug>.md. Switch the instructions to a recursive search under .evolve/entities/ so the influence assessment can reach the actual file wherever it lives. Addresses CodeRabbit review finding: Resolve recalled slugs beyond only `.evolve/entities/guideline/`
Covers happy path (single/multiple assessments, default evidence), per-item skip paths (invalid verdict, missing entity, non-dict item), and top-level input validation (non-dict payload, missing session_id, non-list assessments, invalid JSON). Complements the e2e sandbox test with fast, deterministic coverage of the helper's input-validation surface. Addresses CodeRabbit review finding: Add unit tests for log_influence.py
SummaryThis PR adds a provenance loop to the Claude Findings
Testing
Follow-up IssueWe are opening a separate issue to track the currently failing full-suite e2e/platform tests, since they do not clearly line up with this PR’s write set. Failed tests from the full run:
Additional full-run errors to track in that issue:
Notes for that issue:
|
|
@vinodmut Not sure if the findings are a problem. |
…trees The recall audit log stored bare filename stems, so the same slug from a local entity and a subscribed repo collapsed into one entry and the influence step couldn't tell which guideline actually fired. Switch the stored id to the path relative to .evolve/entities/ (without .md): "guideline/foo" for local entries, "subscribed/alice/guideline/foo" for subscribed ones. The id is unambiguous, names exactly one file, and lets learn open it directly (no recursive search). SKILL.md Step 4 is simplified accordingly — no more find / multi-tree resolver; just Read .evolve/entities/<id>.md. The e2e test matches session 1's guidelines against the new qualified ids, and the existing log_influence unit tests pass unchanged. Addresses review feedback from visahak
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/platform_integrations/test_log_influence.py (1)
42-199:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
temp_project_dirinstead oftmp_pathin these platform integration tests.From Line 42 onward, these tests run a subprocess and do file I/O under
tmp_path. Fortests/platform_integrations/test_*.py, these operations should usetemp_project_dirto keep integration-test isolation consistent with suite conventions.Suggested patch pattern
-class TestLogInfluence: - def test_writes_single_assessment(self, tmp_path): - evolve_dir = tmp_path / ".evolve" +class TestLogInfluence: + def test_writes_single_assessment(self, temp_project_dir): + evolve_dir = temp_project_dir / ".evolve" result = run_log_influence( - tmp_path, + temp_project_dir, { "session_id": "abc-123", "assessments": [ {"entity": "slug-a", "verdict": "followed", "evidence": "because"}, ], }, evolve_dir=evolve_dir, )Apply the same replacement across the remaining test methods.
As per coding guidelines, “All file operations in platform integration tests must use
temp_project_dirfixture - never touch real files”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/platform_integrations/test_log_influence.py` around lines 42 - 199, Replace the use of the tmp_path fixture with the temp_project_dir fixture in these platform integration tests: change each test method signature (e.g., test_writes_single_assessment(self, tmp_path) -> test_writes_single_assessment(self, temp_project_dir)) and update all usages of tmp_path inside the test (e.g., tmp_path / ".evolve" -> temp_project_dir / ".evolve" and run_log_influence(tmp_path, ...) -> run_log_influence(temp_project_dir, ...)); apply the same replacement for every test function in this file (test_writes_multiple_assessments, test_skips_assessments_with_invalid_verdict, test_skips_assessments_missing_entity, test_skips_non_dict_assessment_items, test_empty_assessments_list_is_ok, test_evidence_defaults_to_empty_string, test_rejects_non_dict_payload, test_rejects_missing_session_id, test_rejects_non_list_assessments, test_rejects_invalid_json) to ensure all subprocess and file I/O use temp_project_dir.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/platform_integrations/test_log_influence.py`:
- Around line 42-199: Replace the use of the tmp_path fixture with the
temp_project_dir fixture in these platform integration tests: change each test
method signature (e.g., test_writes_single_assessment(self, tmp_path) ->
test_writes_single_assessment(self, temp_project_dir)) and update all usages of
tmp_path inside the test (e.g., tmp_path / ".evolve" -> temp_project_dir /
".evolve" and run_log_influence(tmp_path, ...) ->
run_log_influence(temp_project_dir, ...)); apply the same replacement for every
test function in this file (test_writes_multiple_assessments,
test_skips_assessments_with_invalid_verdict,
test_skips_assessments_missing_entity, test_skips_non_dict_assessment_items,
test_empty_assessments_list_is_ok, test_evidence_defaults_to_empty_string,
test_rejects_non_dict_payload, test_rejects_missing_session_id,
test_rejects_non_list_assessments, test_rejects_invalid_json) to ensure all
subprocess and file I/O use temp_project_dir.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: adf197b4-c721-4d19-aef0-faf4c1751a51
📒 Files selected for processing (5)
platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.mdplatform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.pyplatform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.pytests/e2e/test_sandbox_learn_recall.pytests/platform_integrations/test_log_influence.py
🚧 Files skipped from review as they are similar to previous changes (3)
- platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
- platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
- platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py
|
@visahak thanks — your first finding reproduced exactly, and is fixed in Change: the recall audit now records each entity as a path relative to
Where:
Repro on this commit — same setup you described, local {"event":"recall","session_id":"fake-session-abc123",
"entities":["guideline/shared-slug","subscribed/alice/guideline/shared-slug"],
"ts":"2026-05-01T15:32:08Z"}Before the fix, both would have collapsed to Example from the sandbox e2e (single-source case, just to show round-trip on a real run): {
"event": "recall",
"session_id": "e8df99dd-fecb-4a75-840d-06c290bcd3dd",
"entities": ["guideline/in-this-environment-extract-jpeg-exif-metadata-including"],
"ts": "2026-05-01T15:06:05Z"
}
{
"event": "influence",
"session_id": "e8df99dd-fecb-4a75-840d-06c290bcd3dd",
"entity": "guideline/in-this-environment-extract-jpeg-exif-metadata-including",
"verdict": "followed",
"evidence": "Agent went directly to raw Python struct-based JPEG/EXIF/TIFF parsing without attempting exiftool, ImageMagick, or PIL/Pillow",
"ts": "2026-05-01T15:07:14Z"
}On the follow-up full-suite issue — I can confirm locally that |
Merging public/main brought in AgentToolkit#243's SKILL.md restructure, which inserted a "Review Existing Guidelines" step and shifted Save Entities to Step 4 — collidingwith the "Step 4: Assess Influence" section this branch added. Rename the influence section to Step 5 and update its sub-steps to reference Step 4 (save) and derive session_id from the saved_trajectory_path variable (the post-AgentToolkit#243 name) instead of the removed transcript_path.
Align with the rest of tests/platform_integrations/ which use the temp_project_dir fixture (a thin wrapper over tmp_path that creates an isolated test_project/ subdir). All 11 tests still pass. Addresses CodeRabbit review finding: Replace the use of the tmp_path fixture with the temp_project_dir fixture
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/platform_integrations/test_log_influence.py (1)
194-199:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd the missing “no audit side effect” assertion in invalid-JSON case.
This reject-path test checks exit code/stderr, but unlike other reject tests it does not verify
audit.logremains empty.✅ Suggested patch
def test_rejects_invalid_json(self, temp_project_dir): evolve_dir = temp_project_dir / ".evolve" result = run_log_influence(temp_project_dir, None, raw_input="{not valid json", evolve_dir=evolve_dir) assert result.returncode == 1 assert "json" in result.stderr.lower() + assert read_audit(evolve_dir) == []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/platform_integrations/test_log_influence.py` around lines 194 - 199, The test_rejects_invalid_json case is missing the "no audit side effect" assertion; after calling run_log_influence (in test_rejects_invalid_json) add an assertion that the audit log in evolve_dir (e.g., check evolve_dir / "audit.log") is either non-existent or empty to match other reject-path tests and ensure no auditing occurred when raw_input is invalid JSON.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/platform_integrations/test_log_influence.py`:
- Around line 194-199: The test_rejects_invalid_json case is missing the "no
audit side effect" assertion; after calling run_log_influence (in
test_rejects_invalid_json) add an assertion that the audit log in evolve_dir
(e.g., check evolve_dir / "audit.log") is either non-existent or empty to match
other reject-path tests and ensure no auditing occurred when raw_input is
invalid JSON.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c4edbc29-5e2b-44df-ac32-47bae691e334
📒 Files selected for processing (2)
platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.mdtests/platform_integrations/test_log_influence.py
🚧 Files skipped from review as they are similar to previous changes (1)
- platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
Mirrors the "no audit side effect" assertion present in the other reject-path tests so we catch any regression where invalid JSON would sneak a partial write into audit.log. Addresses CodeRabbit review finding: The test_rejects_invalid_json case is missing the "no audit side effect" assertion
|
@visahak ready for another look when you have a minute. Since your last review:
All 16 CI checks are green. |
Summary
Adds the reverse provenance direction to complement #236: now we know not just which session produced a guideline, but also which sessions used it, and whether it actually influenced the agent's behavior.
recall/retrieve_entities.py— every UserPromptSubmit appends one JSONL record to.evolve/audit.logwith the served entity slugs and the session_id derived fromtranscript_path. Audit failures are swallowed so logging cannot break the user-visible recall path.learn/SKILL.md— adds Step 4 (Assess Influence). The forked learn skill reads.evolve/audit.log, reconstructs the set of guidelines served to this session, and for each one emits a verdict (followed|contradicted|not_applicable) with a short evidence sentence grounded in the transcript.learn/scripts/log_influence.py— new stdin → audit bridge. Validates the verdict against the allowed set and writes it back via the existingaudit.append()helper (lib/audit.py).tests/e2e/test_sandbox_learn_recall.py— the existing end-to-end test is extended with assertions that after session 2, the audit log contains both arecallevent (with at least one slug from session 1) and aninfluenceevent (with a valid verdict).No new dependencies, no entity-file mutation, no per-entity write contention — everything is append-only JSONL.
Example audit log from a real run:
{"event":"recall","session_id":"4bce...","entities":["in-this-environment-use-raw-python-with-the-struct-module"],"ts":"..."} {"event":"influence","session_id":"4bce...","entity":"in-this-environment-use-raw-python-with-the-struct-module","verdict":"followed","evidence":"Agent immediately used raw Python struct parsing for EXIF extraction without attempting exiftool or PIL first.","ts":"..."}Addresses the second half of #180 (Explainability & Provenance) — specifically "Track entity usage — record when an entity is retrieved and whether it influenced the outcome" and "A user can see which entities were used in a given session."
Test plan
pytest tests/e2e/test_sandbox_learn_recall.py -m e2epasses (3:15 on macOS with the claude-sandbox image).Follow-ups (not in this PR)
altk_evolve/viz/) — natural place to surface "sessions that used guideline X" once the log schema settles here.Summary by CodeRabbit
New Features
Documentation
Tests