feat(claude-plugin): record guideline usage per session in audit.log by vinodmut · Pull Request #239 · AgentToolkit/altk-evolve

vinodmut · 2026-04-30T16:49:42Z

Summary

Adds the reverse provenance direction to complement #236: now we know not just which session produced a guideline, but also which sessions used it, and whether it actually influenced the agent's behavior.

recall/retrieve_entities.py — every UserPromptSubmit appends one JSONL record to .evolve/audit.log with the served entity slugs and the session_id derived from transcript_path. Audit failures are swallowed so logging cannot break the user-visible recall path.
learn/SKILL.md — adds Step 4 (Assess Influence). The forked learn skill reads .evolve/audit.log, reconstructs the set of guidelines served to this session, and for each one emits a verdict (followed | contradicted | not_applicable) with a short evidence sentence grounded in the transcript.
learn/scripts/log_influence.py — new stdin → audit bridge. Validates the verdict against the allowed set and writes it back via the existing audit.append() helper (lib/audit.py).
tests/e2e/test_sandbox_learn_recall.py — the existing end-to-end test is extended with assertions that after session 2, the audit log contains both a recall event (with at least one slug from session 1) and an influence event (with a valid verdict).

No new dependencies, no entity-file mutation, no per-entity write contention — everything is append-only JSONL.

Example audit log from a real run:

{"event":"recall","session_id":"4bce...","entities":["in-this-environment-use-raw-python-with-the-struct-module"],"ts":"..."}
{"event":"influence","session_id":"4bce...","entity":"in-this-environment-use-raw-python-with-the-struct-module","verdict":"followed","evidence":"Agent immediately used raw Python struct parsing for EXIF extraction without attempting exiftool or PIL first.","ts":"..."}

Addresses the second half of #180 (Explainability & Provenance) — specifically "Track entity usage — record when an entity is retrieved and whether it influenced the outcome" and "A user can see which entities were used in a given session."

Test plan

End-to-end: pytest tests/e2e/test_sandbox_learn_recall.py -m e2e passes (3:15 on macOS with the claude-sandbox image).
Manually inspected the resulting audit.log — both recall and influence events land with session_id correctly derived and evidence grounded in the transcript.
Codex and bob integrations still use their own recall scripts and don't emit these events — follow-up PRs will propagate the pattern.

Follow-ups (not in this PR)

Viz rendering of the new events (altk_evolve/viz/) — natural place to surface "sessions that used guideline X" once the log schema settles here.
Per-turn influence granularity (currently we emit one verdict per (session, entity); a long multi-turn session can only report one verdict even if relevance changed across turns).
Bob and codex parity.

Summary by CodeRabbit

New Features
- Adds automated post-processing that links recalled guidelines to session impact, emitting validated per-entity verdicts (followed/contradicted/not_applicable) with brief evidence and recording influence events.
- Improves entity identification to avoid ambiguity across projects.
Documentation
- Updated learning workflow docs to describe the new assessment/evidence step.
Tests
- Expanded tests to cover provenance audit parsing, influence recording, validation rules, and malformed-input handling.

Adds the reverse provenance direction: which sessions used a given guideline. Complements the existing source-trajectory stamp on each entity. - recall/retrieve_entities.py now appends one `recall` event per UserPromptSubmit to .evolve/audit.log, listing the served entity slugs and the session_id derived from transcript_path. Failures are swallowed so logging cannot break the user-visible recall path. - learn/SKILL.md gains a Step 4 that reads audit.log, reconstructs the set of guidelines served to this session, and emits per-entity verdicts (followed | contradicted | not_applicable) with a short evidence line. - A new log_influence.py script validates and writes those verdicts back into audit.log. - The e2e test asserts both event types land correctly after session 2.

coderabbitai · 2026-04-30T16:49:58Z

Warning

Rate limit exceeded

@vinodmut has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 32 minutes and 27 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b82a557-0564-498d-830a-98ea79864e21

📥 Commits

Reviewing files that changed from the base of the PR and between a18073c and cbef977.

📒 Files selected for processing (1)

tests/platform_integrations/test_log_influence.py

📝 Walkthrough

Walkthrough

Adds recall-to-influence provenance: recall events now include entity _ids; a new log_influence.py ingests JSON assessments and writes influence audit events; recall script appends recall events; tests validate recall and influence entries in .evolve/audit.log.

Changes

Cohort / File(s)	Summary
Workflow Documentation `platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md`	Adds a new step describing linking recall provenance to impact evaluation and emitting a JSON payload to `log_influence.py` for the current session.
Influence Logging Script `platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py`	New executable: reads JSON from stdin, validates payload and verdicts (`followed`, `contradicted`, `not_applicable`), appends `influence` events to the evolve audit log, logs counts, and exits non-zero on malformed top-level payload/JSON.
Entity Recall + Audit `platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py`	Adds `_id` derived from entity filepath (relative stem) when loading entities; performs a best-effort `audit.append(event="recall", session_id, entities)` using `get_evolve_dir`/`audit`, swallowing audit errors so recall still succeeds.
End-to-End Tests `tests/e2e/test_sandbox_learn_recall.py`	Extends E2E test to parse `.evolve/audit.log` (ndjson), assert a `recall` event for session 2 containing at least one entity from session 1, and assert at least one `influence` event for session 2 with an allowed `verdict`.
Unit/Integration Tests for Script `tests/platform_integrations/test_log_influence.py`	New test suite running `log_influence.py` via subprocess: verifies correct `influence` audit writes for valid payloads, that invalid assessments are skipped, and that malformed payloads / bad JSON exit with code 1 and do not write events.

Sequence Diagram

sequenceDiagram
    participant Transcript as Input Transcript
    participant Recall as Recall Service
    participant Audit as Audit Log
    participant Assess as log_influence.py
    participant Test as Test Runner

    Transcript->>Recall: request recall (transcript_path)
    Recall->>Recall: load entity files, set `_id` from path
    Recall->>Audit: append "recall" event (project_root, session_id, entities)

    Assess->>Audit: read `.evolve/audit.log` for session's recall events
    Assess->>Assess: for each recalled entity load guideline, determine verdict & evidence
    Assess->>Audit: append "influence" event (project_root, session_id, entity, verdict, evidence)

    Test->>Audit: read audit log entries
    Test->>Test: assert presence/structure of recall & influence events

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

test(e2e): add sandbox learn+recall test #233: Extends sandbox learn+recall E2E tests and is closely related to the audit/test changes here.
feat(claude-plugin): stamp source trajectory path on extracted guidelines #236: Adds saved trajectory metadata used when deriving session_id and linking entities; directly relevant to the new influence flow.
feat(bob): add save-trajectory skill for entity provenance #184: Prior changes to learn workflow and provenance handling that overlap with recall/influence additions.

Suggested Reviewers

visahak
gaodan-fang

Poem

🐇 I hopped through logs with curious paws,

Found stems and IDs beneath the laws.
I scribble verdicts, short and spry,
Then stamp the audit—watch them fly. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: recording guideline usage (recall and influence verdicts) per session in the audit log. It directly reflects the core objective of adding reverse provenance logging.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 32 minutes and 27 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/e2e/test_sandbox_learn_recall.py (1)
168-213: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add unit tests for the new influence logging paths.

This adds solid e2e coverage, but the new feature still needs unit tests for log_influence.py validation branches (bad payload shape, non-dict assessments, invalid verdict skipping) so failures are localized and deterministic.

As per coding guidelines tests/**/*.py: “All new features need tests (unit + e2e where applicable)”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_sandbox_learn_recall.py` around lines 168 - 213, Add unit
tests for log_influence.py that target its validation branches: create tests
that call the public entry points (e.g., the main handler function in
log_influence.py such as process_influence or validate_influence_payload) with
(1) a malformed payload shape (missing keys / wrong types) to assert it raises
or returns the expected validation error, (2) payloads where "assessments" is
not a dict or contains non-dict entries to assert those entries are
skipped/flagged, and (3) assessments containing invalid "verdict" values to
assert those assessments are ignored and do not produce influence events; assert
the function logs or returns the correct diagnostics and does not produce
side-effect influence records for invalid inputs. Ensure tests import
log_influence.py functions directly and stub/mock any I/O so failures are
deterministic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py`:
- Around line 39-55: The handler currently assumes payload and its items are
dicts (using payload.get(...) and a.get(...)), which raises AttributeError for
non-dict JSON; first validate that payload is a mapping (e.g.,
isinstance(payload, dict)) before accessing payload.get("session_id") and
payload.get("assessments"), and ensure assessments is a list of dicts by
filtering or checking each item in the for loop (in the loop over assessments,
replace direct a.get(...) with a type guard: if not isinstance(a, dict): log and
continue), then extract entity/verdict/evidence from dicts only and skip/log
invalid entries; update the checks around session_id, assessments, and the loop
over a in assessments (refer to session_id, assessments, and the loop variable
a) to avoid AttributeError and to produce clear logs when items are skipped.

In `@platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md`:
- Around line 125-127: The current flow reads recalled slugs from
`.evolve/audit.log` (event == "recall", session_id) and then only opens
`.evolve/entities/guideline/<slug>.md`, which misses slugs served from
subscribed trees; update the resolution step so for each slug you attempt to
resolve it across subscribed trees rather than only
`.evolve/entities/guideline/`. Implement or call a resolver (e.g.,
resolveEntityPath(slug, subscribedTrees) or similar) that searches
`.evolve/entities/<tree>/.../<slug>.md` (falling back to
`.evolve/entities/guideline/<slug>.md` if needed), then read the resolved file
content + trigger for influence assessment. Ensure the union of `entities` from
audit.log is used with this multi-tree resolution.

---

Outside diff comments:
In `@tests/e2e/test_sandbox_learn_recall.py`:
- Around line 168-213: Add unit tests for log_influence.py that target its
validation branches: create tests that call the public entry points (e.g., the
main handler function in log_influence.py such as process_influence or
validate_influence_payload) with (1) a malformed payload shape (missing keys /
wrong types) to assert it raises or returns the expected validation error, (2)
payloads where "assessments" is not a dict or contains non-dict entries to
assert those entries are skipped/flagged, and (3) assessments containing invalid
"verdict" values to assert those assessments are ignored and do not produce
influence events; assert the function logs or returns the correct diagnostics
and does not produce side-effect influence records for invalid inputs. Ensure
tests import log_influence.py functions directly and stub/mock any I/O so
failures are deterministic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f1f237e-addc-4971-96a9-9efbbf92934f

📥 Commits

Reviewing files that changed from the base of the PR and between 6166cf0 and 98e7c32.

📒 Files selected for processing (4)

platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py
platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
tests/e2e/test_sandbox_learn_recall.py

Fixes failing CI check: check-formatting

Validate that the JSON payload is a mapping before calling payload.get(), and skip non-dict assessment items instead of letting a.get() raise AttributeError. Logs each skipped item for traceability. Addresses CodeRabbit review finding: Guard payload and assessment item types before calling `.get()`

Step 4 previously hard-coded .evolve/entities/guideline/<slug>.md, which misses entities served from subscribed repositories at .evolve/entities/subscribed/<source>/guideline/<slug>.md. Switch the instructions to a recursive search under .evolve/entities/ so the influence assessment can reach the actual file wherever it lives. Addresses CodeRabbit review finding: Resolve recalled slugs beyond only `.evolve/entities/guideline/`

Covers happy path (single/multiple assessments, default evidence), per-item skip paths (invalid verdict, missing entity, non-dict item), and top-level input validation (non-dict payload, missing session_id, non-list assessments, invalid JSON). Complements the e2e sandbox test with fast, deterministic coverage of the helper's input-validation surface. Addresses CodeRabbit review finding: Add unit tests for log_influence.py

visahak · 2026-05-01T00:32:47Z

Summary

This PR adds a provenance loop to the Claude evolve-lite plugin: recall now audits which guideline slugs were served, learn can log post-hoc influence assessments, and the stop hook includes the
saved trajectory path so new guidelines can point back to their source session. It also adds focused tests for log_influence.py, a sandbox e2e for the learn/recall flow, and docs for running that
sandbox test.

Findings

Recall provenance collapses duplicate local/subscribed slugs (confidence: 96/100)
- Why it matters: the new audit trail is supposed to let learn assess whether recalled guidance was followed, but it currently drops the subscribed-source identity and keeps only the filename stem.
  If the same slug exists locally and in a subscribed repo, the later influence step can attribute behavior to the wrong guideline.
- Evidence: platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py:86-95 already knows whether an entity came from entities/subscribed/{source}/..., but
  :141-148 writes only entities=[_slug] to audit.log. Then platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md:125-127 tells learn to reopen the "first match" for that slug
  under .evolve/entities/, which is ambiguous across local and subscribed trees.
- Example: I reproduced this with both .evolve/entities/guideline/shared-slug.md and .evolve/entities/subscribed/alice/guideline/shared-slug.md present. Recall served both entries, but the audit
  row recorded only entities: ["shared-slug"], so the later influence pass has no way to know which one was actually recalled.

Testing

Targeted agent e2e checks passed after installing the required extras:
- uv run pytest tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s -v: passed
- uv run pytest tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -s -v: passed
Focused validation on the changed Claude plugin paths passed:
- uv run pytest tests/platform_integrations/test_log_influence.py -v --run-e2e: 11 passed
- uv run pytest tests/platform_integrations/test_retrieve.py -v: 18 passed
Sandbox learn/recall e2e was not runnable in this environment:
- uv run pytest tests/e2e/test_sandbox_learn_recall.py -v --run-e2e: skipped locally because sandbox prerequisites/credentials were unavailable
Full suite status after installing extras:
- uv run pytest -m e2e --run-e2e -v: 6 failed, 155 passed, 1 skipped, 32 errors, 524 deselected

Follow-up Issue

We are opening a separate issue to track the currently failing full-suite e2e/platform tests, since they do not clearly line up with this PR’s write set.

Failed tests from the full run:

tests/e2e/test_e2e_segmentation.py::test_segment_trajectory_min_subtasks
tests/platform_integrations/test_bob_sharing.py::TestBobSync::test_skips_invalid_subscription_name
tests/platform_integrations/test_bob_sharing.py::TestBobSync::test_rejects_dot_and_double_dot_names
tests/platform_integrations/test_codex_sharing.py::TestCodexSharingScripts::test_subscribe_warns_when_audit_write_fails
tests/platform_integrations/test_codex_sharing.py::TestCodexSharingScripts::test_sync_skips_invalid_subscription_name
tests/platform_integrations/test_sync.py::TestSync::test_skips_invalid_subscription_name

Additional full-run errors to track in that issue:

tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[manual_phoenix]
tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[simple_openai]
all selected tests/e2e/test_mcp.py cases from the full run
all selected tests/e2e/test_sharing.py cases from the full run

Notes for that issue:

tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[smolagents] passed when run individually
tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[openai_agents] passed when run individually
the earlier package/import issue was resolved by installing the required extras
the full-suite pipeline errors were caused by the Phoenix fixture failing to bind gRPC on port 4317
the invalid-subscription-name failures are in sync/subscribe paths not modified by this PR

visahak · 2026-05-01T00:44:34Z

@vinodmut Not sure if the findings are a problem.

…trees The recall audit log stored bare filename stems, so the same slug from a local entity and a subscribed repo collapsed into one entry and the influence step couldn't tell which guideline actually fired. Switch the stored id to the path relative to .evolve/entities/ (without .md): "guideline/foo" for local entries, "subscribed/alice/guideline/foo" for subscribed ones. The id is unambiguous, names exactly one file, and lets learn open it directly (no recursive search). SKILL.md Step 4 is simplified accordingly — no more find / multi-tree resolver; just Read .evolve/entities/<id>.md. The e2e test matches session 1's guidelines against the new qualified ids, and the existing log_influence unit tests pass unchanged. Addresses review feedback from visahak

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/platform_integrations/test_log_influence.py (1)

42-199: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use temp_project_dir instead of tmp_path in these platform integration tests.

From Line 42 onward, these tests run a subprocess and do file I/O under tmp_path. For tests/platform_integrations/test_*.py, these operations should use temp_project_dir to keep integration-test isolation consistent with suite conventions.

Suggested patch pattern

-class TestLogInfluence:
-    def test_writes_single_assessment(self, tmp_path):
-        evolve_dir = tmp_path / ".evolve"
+class TestLogInfluence:
+    def test_writes_single_assessment(self, temp_project_dir):
+        evolve_dir = temp_project_dir / ".evolve"
         result = run_log_influence(
-            tmp_path,
+            temp_project_dir,
             {
                 "session_id": "abc-123",
                 "assessments": [
                     {"entity": "slug-a", "verdict": "followed", "evidence": "because"},
                 ],
             },
             evolve_dir=evolve_dir,
         )

Apply the same replacement across the remaining test methods.

As per coding guidelines, “All file operations in platform integration tests must use temp_project_dir fixture - never touch real files”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/platform_integrations/test_log_influence.py` around lines 42 - 199,
Replace the use of the tmp_path fixture with the temp_project_dir fixture in
these platform integration tests: change each test method signature (e.g.,
test_writes_single_assessment(self, tmp_path) ->
test_writes_single_assessment(self, temp_project_dir)) and update all usages of
tmp_path inside the test (e.g., tmp_path / ".evolve" -> temp_project_dir /
".evolve" and run_log_influence(tmp_path, ...) ->
run_log_influence(temp_project_dir, ...)); apply the same replacement for every
test function in this file (test_writes_multiple_assessments,
test_skips_assessments_with_invalid_verdict,
test_skips_assessments_missing_entity, test_skips_non_dict_assessment_items,
test_empty_assessments_list_is_ok, test_evidence_defaults_to_empty_string,
test_rejects_non_dict_payload, test_rejects_missing_session_id,
test_rejects_non_list_assessments, test_rejects_invalid_json) to ensure all
subprocess and file I/O use temp_project_dir.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/platform_integrations/test_log_influence.py`:
- Around line 42-199: Replace the use of the tmp_path fixture with the
temp_project_dir fixture in these platform integration tests: change each test
method signature (e.g., test_writes_single_assessment(self, tmp_path) ->
test_writes_single_assessment(self, temp_project_dir)) and update all usages of
tmp_path inside the test (e.g., tmp_path / ".evolve" -> temp_project_dir /
".evolve" and run_log_influence(tmp_path, ...) ->
run_log_influence(temp_project_dir, ...)); apply the same replacement for every
test function in this file (test_writes_multiple_assessments,
test_skips_assessments_with_invalid_verdict,
test_skips_assessments_missing_entity, test_skips_non_dict_assessment_items,
test_empty_assessments_list_is_ok, test_evidence_defaults_to_empty_string,
test_rejects_non_dict_payload, test_rejects_missing_session_id,
test_rejects_non_list_assessments, test_rejects_invalid_json) to ensure all
subprocess and file I/O use temp_project_dir.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: adf197b4-c721-4d19-aef0-faf4c1751a51

📥 Commits

Reviewing files that changed from the base of the PR and between 98e7c32 and c005954.

📒 Files selected for processing (5)

platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py
platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
tests/e2e/test_sandbox_learn_recall.py
tests/platform_integrations/test_log_influence.py

🚧 Files skipped from review as they are similar to previous changes (3)

platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py

vinodmut · 2026-05-01T15:32:51Z

@visahak thanks — your first finding reproduced exactly, and is fixed in c005954.

Change: the recall audit now records each entity as a path relative to .evolve/entities/ without the .md suffix, instead of a bare filename stem. Same-slug entries in different trees are now distinct.

Local: guideline/<slug>
Subscribed: subscribed/<source>/guideline/<slug>

Where:

retrieve_entities.py — computes str(md.relative_to(entities_dir).with_suffix("")) and stores it in the entities array of the recall row.
learn/SKILL.md Step 4 — simplified. The id names exactly one file, so the instruction is now "open .evolve/entities/<id>.md with Read" — no recursive find, no multi-tree resolver.
The influence example in SKILL.md now shows "entity": "guideline/<slug>" with a note that the prefix must match whatever appeared in the recall event.
tests/e2e/test_sandbox_learn_recall.py — session1_ids builds qualified ids to match.
test_log_influence.py — unchanged; the helper treats entity as an opaque string.

Repro on this commit — same setup you described, local guideline/shared-slug.md plus a subscribed copy at subscribed/alice/guideline/shared-slug.md. Simulated a UserPromptSubmit hook with a fake session_id=fake-session-abc123:

{"event":"recall","session_id":"fake-session-abc123",
 "entities":["guideline/shared-slug","subscribed/alice/guideline/shared-slug"],
 "ts":"2026-05-01T15:32:08Z"}

Before the fix, both would have collapsed to "shared-slug" in the set literal. Now they're distinct identities, and the influence step can attribute verdicts to the correct guideline.

Example from the sandbox e2e (single-source case, just to show round-trip on a real run):

{
  "event": "recall",
  "session_id": "e8df99dd-fecb-4a75-840d-06c290bcd3dd",
  "entities": ["guideline/in-this-environment-extract-jpeg-exif-metadata-including"],
  "ts": "2026-05-01T15:06:05Z"
}
{
  "event": "influence",
  "session_id": "e8df99dd-fecb-4a75-840d-06c290bcd3dd",
  "entity": "guideline/in-this-environment-extract-jpeg-exif-metadata-including",
  "verdict": "followed",
  "evidence": "Agent went directly to raw Python struct-based JPEG/EXIF/TIFF parsing without attempting exiftool, ImageMagick, or PIL/Pillow",
  "ts": "2026-05-01T15:07:14Z"
}

On the follow-up full-suite issue — I can confirm locally that test_e2e_pipeline.py -k smolagents|openai_agents need OPENAI_API_KEY for the evolve sync phoenix step (gpt-4o via LiteLLM), which this PR doesn't touch. Agreed those belong in a separate issue.

Merging public/main brought in AgentToolkit#243's SKILL.md restructure, which inserted a "Review Existing Guidelines" step and shifted Save Entities to Step 4 — collidingwith the "Step 4: Assess Influence" section this branch added. Rename the influence section to Step 5 and update its sub-steps to reference Step 4 (save) and derive session_id from the saved_trajectory_path variable (the post-AgentToolkit#243 name) instead of the removed transcript_path.

Align with the rest of tests/platform_integrations/ which use the temp_project_dir fixture (a thin wrapper over tmp_path that creates an isolated test_project/ subdir). All 11 tests still pass. Addresses CodeRabbit review finding: Replace the use of the tmp_path fixture with the temp_project_dir fixture

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/platform_integrations/test_log_influence.py (1)

194-199: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the missing “no audit side effect” assertion in invalid-JSON case.

This reject-path test checks exit code/stderr, but unlike other reject tests it does not verify audit.log remains empty.

✅ Suggested patch

     def test_rejects_invalid_json(self, temp_project_dir):
         evolve_dir = temp_project_dir / ".evolve"
         result = run_log_influence(temp_project_dir, None, raw_input="{not valid json", evolve_dir=evolve_dir)
         assert result.returncode == 1
         assert "json" in result.stderr.lower()
+        assert read_audit(evolve_dir) == []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/platform_integrations/test_log_influence.py` around lines 194 - 199,
The test_rejects_invalid_json case is missing the "no audit side effect"
assertion; after calling run_log_influence (in test_rejects_invalid_json) add an
assertion that the audit log in evolve_dir (e.g., check evolve_dir /
"audit.log") is either non-existent or empty to match other reject-path tests
and ensure no auditing occurred when raw_input is invalid JSON.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/platform_integrations/test_log_influence.py`:
- Around line 194-199: The test_rejects_invalid_json case is missing the "no
audit side effect" assertion; after calling run_log_influence (in
test_rejects_invalid_json) add an assertion that the audit log in evolve_dir
(e.g., check evolve_dir / "audit.log") is either non-existent or empty to match
other reject-path tests and ensure no auditing occurred when raw_input is
invalid JSON.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4edbc29-5e2b-44df-ac32-47bae691e334

📥 Commits

Reviewing files that changed from the base of the PR and between c005954 and a18073c.

📒 Files selected for processing (2)

platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
tests/platform_integrations/test_log_influence.py

🚧 Files skipped from review as they are similar to previous changes (1)

platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md

Mirrors the "no audit side effect" assertion present in the other reject-path tests so we catch any regression where invalid JSON would sneak a partial write into audit.log. Addresses CodeRabbit review finding: The test_rejects_invalid_json case is missing the "no audit side effect" assertion

vinodmut · 2026-05-01T17:41:02Z

@visahak ready for another look when you have a minute.

Since your last review:

c005954 — qualify recall audit ids with the subscribed-source prefix (addresses your main finding; repro in this earlier comment)
Merged public/main (picks up fix(learn): stop the learn-skill permission-prompt storm at Stop #243's SKILL.md restructure) and renumbered the influence section to Step 5 so it doesn't collide with the renamed Save step
a18073c — swap tmp_path → temp_project_dir in test_log_influence.py to match the rest of tests/platform_integrations/
cbef977 — add the "no audit side effect" assertion to test_rejects_invalid_json

All 16 CI checks are green.

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py

Comment thread platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md Outdated

vinodmut added 4 commits April 30, 2026 12:33

style(tests): apply ruff format to sandbox learn+recall test

9cc5eea

Fixes failing CI check: check-formatting

vinodmut requested review from illeatmyhat and visahak April 30, 2026 18:44

visahak previously approved these changes May 1, 2026

View reviewed changes

vinodmut dismissed visahak’s stale review via c005954 May 1, 2026 15:02

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

vinodmut added 3 commits May 1, 2026 10:44

Merge remote-tracking branch 'public/main' into provenance-usage

52c481f

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

vinodmut mentioned this pull request May 1, 2026

feat(platform-integrations): unify plugin code under a single canonical source #235

Open

visahak self-requested a review May 2, 2026 02:07

visahak approved these changes May 2, 2026

View reviewed changes

visahak merged commit 6cc2a5b into AgentToolkit:main May 2, 2026
16 checks passed

Conversation

vinodmut commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Follow-ups (not in this PR)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

visahak commented May 1, 2026

Summary

Findings

Testing

Follow-up Issue

Uh oh!

visahak commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vinodmut commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vinodmut commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinodmut commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

vinodmut commented May 1, 2026 •

edited

Loading