feat(cloud): Phase 2 — push/pull, materializer, conflict resolution, telemetry + legacy bloat cleanup by Gradata · Pull Request #142 · Gradata/gradata

Gradata · 2026-04-22T08:32:27Z

Summary

Ships Phase 2 of cloud sync end-to-end, plus a significant legacy-bloat audit that removes ~1,600 net LOC of deprecated shims and dead code.

Cloud sync (Phase 2)

Unified CloudClient with keyfile + env credential chain (3a76c3ad)
/events/push with watermark + retry (ac4ecc4e)
/events/pull client, materializer (Tier 1 LWW + Tier 2 conflict detection), and apply pipeline (ac31aba7 → a2b61d13)
Auto-resume from persisted watermark + pagination (f8762487)
CLI surface: gradata cloud enable/rotate-key/status/disconnect + gradata cloud sync-pull (082fc05a, cf945139)
Conflict resolution events + watermark persistence (64bee364)
Configurable Tier 2 threshold via cloud-config.json (266a76fe)
Idempotent apply — category normalized to match persisted form (cfa1cfac)
CLOUD_SYNC_COMPLETED telemetry emitted after watermark persist, with sync_status tag enrichment (04d651f1)
Convergence property test — 200+ shuffled orderings yield identical state (4c86d323)

Legacy bloat cleanup (`c8a106f9`)

Shims whose deprecation warnings fired since 0.6.x are now removed per new docs/contributing/deprecation-policy.md (two-minor-version carry rule):

gradata.patterns/ shim → use gradata.contrib.patterns / gradata.rules
gradata.enhancements.carl → use .behavioral_engine
gradata.enhancements.profiling.tone_profile (orphan, 546 LOC)
gradata.integrations.{anthropic,openai,langchain,crewai}_adapter → use gradata.middleware.{wrap_*, LangChainCallback, CrewAIGuard}
gradata login / gradata logout CLI + device-auth helpers → use gradata cloud enable --key gk_live_...

Live modules colocated with the deprecated tree relocated to gradata.services/:

gradata.services.embeddings
gradata.services.session_history

Forwarding shims kept at old paths through v0.9.0 per policy.

Supporting

Emit-time PII redaction + raw side-log (a6e9c518)
RULE_GRADUATED events at every state transition (731cb536)
Migrations 002 (event identity) + 003 (sync_state) (10df605a)
Docs, install_manifest, CHANGELOG (0.7.0 entry) updated

Test plan

Full suite: 4025 passed, 2 skipped (pytest tests/)
Convergence property test — 200+ shuffled orderings yield same state
CLOUD_SYNC_COMPLETED emits on apply, not on dry-run
Deprecation shims for services.embeddings / services.session_history still importable from old path
Smoke test gradata cloud enable → gradata cloud sync-pull against staging
Observe CLOUD_SYNC_COMPLETED in dashboard after real pull

Breaking changes

See CHANGELOG [0.7.0] entry. Public symbols/paths removed have emitted DeprecationWarning for at least two minor versions per the new deprecation policy.

Generated with Gradata

Council Phase 1 tasks 2-3. Migration 002 adds event_id (ULID), device_id, content_hash, correction_chain_id, origin_agent as nullable columns with chunked 10k backfill. Migration 003 creates sync_state(brain_id, device_id, last_push_event_id, last_pull_cursor) for watermark tracking. No legacy-* device_id prefix (Pragmatist cut): first device to run 002 assigns its device_id to all historical events. Co-Authored-By: Gradata <noreply@gradata.ai>

…afety net Council Phase 1 task 4, rejects Decision 8 as originally specified. Redacts emails in event data/tags inside emit() before any persistent write. Raw unredacted events tee to events.raw.jsonl (gitignored via *.jsonl pattern) for local debugging only. Fail-closed: if the redactor raises, neither raw side-log nor canonical events.jsonl is written. RetainOrchestrator.flush() re-runs redaction as a defense-in-depth safety net in case anyone queues raw events bypassing emit(). Co-Authored-By: Gradata <noreply@gradata.ai>

Council Phase 1 task 1. Unblocks Decision 4 materializer work (deferred to Phase 2+) by making graduation state changes visible in events.jsonl instead of only writing to SQLite lesson_transitions. Instrumented 7 transitions: instinct->pattern, pattern->rule, demoted below threshold, zero-confidence, untestable expired/idle, moved to untestable. Each emits {category, description, old_state, new_state, confidence, fire_count, reason}. Co-Authored-By: Gradata <noreply@gradata.ai>

Council Phase 1 task 5. cloud/_credentials.py owns the single resolve_credential() chain: kwarg > ~/.gradata/key > GRADATA_API_KEY env > fallback. CloudClient._post() and /events/push both flow through this chain instead of using config.token directly — closes the silo where a keyfile-configured brain couldn't reach the wire. Also adds GRADATA_CLOUD_SYNC_DISABLE kill switch honored by both the new CloudClient.enabled and the legacy _cloud_sync.enabled() paths. key_scope field carried in cloud-config.json for future scoped keys. Co-Authored-By: Gradata <noreply@gradata.ai>

…deprecation Council Phase 1 task 6. Ships four subcommands as the primary path to configuring cloud sync: gradata cloud enable --key gk_live_... [--scope ...] gradata cloud rotate-key --key gk_live_... gradata cloud status gradata cloud disconnect gradata login prints a deprecation notice redirecting users to the new path. Preserves the old command per User Advocate's "no silent breakage" objection to hard deletion. Co-Authored-By: Gradata <noreply@gradata.ai>

Council Phase 1 task 7. Pushes rows from the local events table to POST /events/push in 500-row chunks. Exponential backoff on 5xx, fatal on 4xx. Watermark advances via sync_state.last_push_event_id only after all batches of a run succeed, so partial failures re-push on the next call. Honors GRADATA_CLOUD_SYNC_DISABLE kill switch. Credential flows through the unified _credentials.resolve_credential() chain so keyfile, env var, and cloud-config.json all reach the wire. Co-Authored-By: Gradata <noreply@gradata.ai>

Locks the /events/pull endpoint contract in code before Phase 2 wires the merge path. 501 Not Implemented from server surfaces as a clean 'disabled_server_side' status; 200 OK raises NotImplementedError loudly so nothing accidentally merges into local state before the materializer and Decision 9 merge semantics ship. Contract details in docs/specs/events-pull-contract.md. Co-Authored-By: Gradata <noreply@gradata.ai>

…ontract Addresses the four council blind spots that must be resolved before Phase 2 implementation: - docs/specs/merge-semantics.md — Decision 9. Three-tier conflict resolution: last-write-wins for small deltas, user-surfaced conflict queue for >= 0.15 confidence gaps or opposite directions, source- authority override reserved for team RBAC in Phase 3. - docs/errors.md — Every user-facing error code + message + remediation for cloud/migration/redaction/event-validation paths. - docs/specs/retention-clock.md — first_cloud_store_ts is the retention clock. Rejects emit_ts (backup restore breaks) and push_ts (clock reset gaming). Tier-scoped retention windows. - docs/specs/events-pull-contract.md — frozen interface for GET /events/pull, documents the Phase 1 disabled-server-side behavior and Phase 2 merge requirements. Co-Authored-By: Gradata <noreply@gradata.ai>

…ction Pure event-stream → rule-state fold. Tier 1 LWW (|Δconf| < 0.15 and matching new_state), Tier 2 conflict hold on large drift or state disagreement, Tier 3 RULE_OVERRIDE recognized. No writes, no network — caller decides how to apply materialized state. Ships with 15 tests covering spec invariants: idempotent replay, dedup, order-independence for non-conflicting streams, tiebreak on device_id. Co-Authored-By: Gradata <noreply@gradata.ai>

…mission Adapter between the pure event-fold and the persistent lessons store. apply_to_lessons() non-destructively merges materialized state into an existing Lesson list — only state/confidence/fire_count overwrite; scope and metadata are preserved. Rules held in Tier 2 conflict are skipped. emit_conflict_events() walks result.conflicts and emits one RULE_CONFLICT per key so the dashboard conflict queue can surface them. Injectable emit_fn for tests. Co-Authored-By: Gradata <noreply@gradata.ai>

Replaces the NotImplementedError guard on the 200 OK path with the full merge pipeline. apply=False (default) returns a preview summary with rules_materialized + conflict counts and never touches lessons.md. apply=True merges materialized state into lessons.md and emits RULE_CONFLICT events for unresolved Tier 2 conflicts. Summary shape gains rules_materialized, conflicts, applied, and conflict_events_emitted fields. Malformed JSON now returns status=error/reason=malformed_response instead of crashing. Co-Authored-By: Gradata <noreply@gradata.ai>

…ence Materializer now recognizes RULE_CONFLICT_RESOLVED events: clears the Tier 2 hold and re-applies the user-picked graduation by (ts, device_id). Resolved conflicts are pruned from result.conflicts so callers don't emit stale RULE_CONFLICT events for keys already adjudicated. New cloud._sync_state module upserts last_pull_cursor keyed by (tenant_id, device_id). pull_events(apply=True) now persists the watermark after a successful lessons.md write so the next pull can resume incrementally. Schema ensure runs lazily so brains predating Migration 003 still get a working sync_state table. Co-Authored-By: Gradata <noreply@gradata.ai>

Successive pulls now stream incrementally: when neither rebuild_from nor cursor is provided, pull_events falls back to last_pull_cursor from sync_state so the client doesn't re-download history it already applied. Added a pagination loop (capped at 50 pages) that drains end_of_stream=False responses by passing the page watermark as the next cursor. Defensive: stops if the server forgets to return a watermark, so a broken pagination contract can't spin the client. Explicit rebuild_from still wins — it's the escape hatch for rebuilding from retention, and should never be masked by a stale persisted cursor. 4048 pass / 2 skip. 5 new tests cover auto-resume, rebuild_from override, multi-page drain, no-watermark defense, and the summary shape. Co-Authored-By: Gradata <noreply@gradata.ai>

…pull Exposes the materialize→apply pipeline under the existing `gradata cloud` family. Dry-run by default (counts printed, lessons.md untouched); --apply merges materialized state and emits RULE_CONFLICT events. Flags: --apply, --rebuild-from <watermark>, --limit <n>. No-db / no-cred paths surface as printable reasons so smoke tests can grep the output. 2 new tests cover the no-db path and the dry-run summary shape. Co-Authored-By: Gradata <noreply@gradata.ai>

…e state Ship-gate from merge-semantics.md §6: any permutation of an event pool must materialize to identical rules once sorted by ts. Two cases: - Pure LWW pool (no Tier 2 drift) — 200 permutations × 8 keys × 4 events each → byte-equal rules dict. - Mixed pool with injected 0.95-confidence bursts that force Tier 2 holds — 100 permutations → same rules AND same conflict key set. Keeps under 1s so it stays in the default suite rather than hiding behind a marker. Co-Authored-By: Gradata <noreply@gradata.ai>

Adds a conflict_reason closed-vocabulary prefix (confidence_drift / state_disagreement) and wires enrich_tags so RULE_GRADUATED, RULE_DEMOTED, RULE_OVERRIDE, RULE_CONFLICT, and RULE_CONFLICT_RESOLVED auto-attach category:<X> pulled from the payload. RULE_CONFLICT also auto-attaches conflict_reason:<Y>, and is marked required_on so broken emitters surface as validation issues. Net: cloud events now query the same way as the SDK's own learning events — brain.query_events(tags=["category:STYLE"]) returns both local corrections and cloud-graduated patterns without the caller doing manual tag plumbing. 7 new taxonomy tests. 4065 pass / 2 skip. Co-Authored-By: Gradata <noreply@gradata.ai>

…json CloudConfig.conflict_threshold (default 0.0 sentinel = use SDK default) is threaded through pull_events into materialize(). Out-of-range and unparsable values collapse to the sentinel so a broken config never silently changes merge behavior. gradata cloud sync-pull now prints the active threshold so operators can verify the override took effect without opening the config file. 4065 pass / 2 skip. 6 new tests: defaults, persist, clamp, garbage, end-to-end override flips a drift into a conflict. Co-Authored-By: Gradata <noreply@gradata.ai>

…form A second pull-and-apply of the same stream was duplicating rows in lessons.md. Root cause: apply_to_lessons matched existing lessons with raw event-payload category (lowercase), but the lessons.md serializer uppercases on persist. The match always missed on replay → every materialized rule was appended as a fresh row. Fix: case-insensitive lookup (.upper() on both sides) and uppercase category when constructing fresh Lesson rows so the in-memory form matches what comes back from parse_lessons. Updated the existing append-test assertion to lock the canonical form. Also lands 3 end-to-end disaster-recovery tests proving the ship-gate: - Blank brain + one apply → rules identical to the origin brain's. - Replay of the same stream → byte-identical lessons.md (regression lock for this specific bug). - Pre-existing local lesson in a different category survives a merge. 4068 pass / 2 skip. Co-Authored-By: Gradata <noreply@gradata.ai>

After watermark persist, emit a summary event (events_pulled, rules_materialized, conflicts, pages_fetched, conflict_threshold, watermark, status) so the learning dashboard can observe sync outcomes. Dry-run paths do not emit. Tag taxonomy: add closed-value sync_status tag ({ok, error, skipped}) and enrichment block for the new event. Co-Authored-By: Gradata <noreply@gradata.ai>

Shims whose deprecation warnings have fired since 0.6.x are now removed per docs/contributing/deprecation-policy.md (new file) and the two-minor-version carry rule: - gradata.patterns/ shim (use gradata.contrib.patterns / gradata.rules) - gradata.enhancements.carl re-export (use .behavioral_engine) - gradata.enhancements.profiling.tone_profile (orphan, 546 LOC) - gradata.integrations.{anthropic,openai,langchain,crewai}_adapter (use gradata.middleware.{wrap_*, LangChainCallback, CrewAIGuard}) - gradata login / gradata logout CLI + device-auth helpers (use `gradata cloud enable --key gk_live_...`) Live modules mistakenly colocated with the deprecated adapters are relocated to gradata.services/: - gradata.services.embeddings - gradata.services.session_history Forwarding shims kept at the old paths through v0.9.0 per policy. Docs, install_manifest, changelog (0.7.0 entry), and affected tests updated. Full suite: 4025 passed, 2 skipped. Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-04-22T08:32:38Z

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Cloud Sync Phase 2 & Legacy Cleanup (v0.7.0)

Breaking Changes:

BREAKING: Removed deprecated modules: gradata.patterns, gradata.enhancements.carl, gradata.enhancements.profiling.tone_profile, and all gradata.integrations.*_adapter modules (OpenAI, Anthropic, LangChain, CrewAI)
BREAKING: Removed gradata login/gradata logout CLI commands (replaced with gradata cloud enable/rotate-key/status/disconnect)
BREAKING: Relocated middleware APIs from gradata.integrations.*_adapter to gradata.middleware (wrap_openai, wrap_anthropic, LangChainCallback, CrewAIGuard)
Forwarding shims for gradata.integrations.embeddings and gradata.integrations.session_history retained through v0.9.0

New Cloud Sync Features:

Unified CloudClient with keyfile + environment credential chain via gradata.cloud._credentials
/events/push with exponential backoff retries, watermark persistence, and event batching
/events/pull with pagination, cursor-based resume, and optional rebuild-from support
Conflict resolution with Tier 1 (LWW), Tier 2 (conflict detection/hold), and Tier 3 (override) mechanics via new gradata.cloud.materializer
Auto-resume from persisted pull watermark via gradata.cloud._sync_state
New CLOUD_SYNC_COMPLETED telemetry event with sync_status tag enrichment

Security:

Email PII redaction in event payloads with best-effort raw side-log (events.raw.jsonl)

New Public APIs:

gradata.cloud.push.push_pending_events(brain_dir, ...)
gradata.cloud.pull.pull_events(brain_dir, ...)
gradata.cloud.materializer.materialize(events, ...)
gradata.services.embeddings and gradata.services.session_history (relocated from integrations)

Database & Events:

Migration 002: Adds event identity (event_id ULID, device_id, content_hash)
Migration 003: Adds per-device watermark persistence table (sync_state)
RULE_GRADUATED events emitted on rule state transitions with transition reasons
New tag taxonomy entries: conflict_reason, sync_status

CLI Enhancements:

New gradata cloud subcommands: enable, rotate-key, status, disconnect, sync-pull
Support for --apply, --rebuild-from, --limit flags on sync-pull

Documentation:

New deprecation policy defining 2-version carry window for shims
Error specification with structured codes (CRED_, SYNC_, MIG_, REDACT_, EVT_, CONFLICT_)
Merge semantics specification for conflict resolution
Retention clock specification (Free: 30d raw/90d rules; Personal+: 2y raw/indefinite rules)
/events/pull contract specification

Test Coverage:

Full test suite: 4,025 passed, 2 skipped
Convergence property test: 200+ shuffled event orderings yield identical state
Cloud sync, materialization, credential, and CLI testing added

Walkthrough

Adds a v0.7.0 cloud sync system (push/pull) with materializer and apply path, migrations (event identity, sync_state), ULID/device-id utilities, email PII redaction, package relocations (patterns/integrations→contrib/services/middleware), a unified cloud CLI, embedding/session-history services, docs, and many tests.

Changes

Cohort / File(s)	Summary
Cloud sync & materialization `Gradata/src/gradata/cloud/pull.py`, `Gradata/src/gradata/cloud/push.py`, `Gradata/src/gradata/cloud/materializer.py`, `Gradata/src/gradata/cloud/_apply_materialized.py`, `Gradata/src/gradata/cloud/_credentials.py`, `Gradata/src/gradata/cloud/_sync_state.py`, `Gradata/src/gradata/cloud/sync.py`, `Gradata/src/gradata/cloud/__init__.py`	New push/pull clients, credential/keyfile helpers and kill‑switch, per-(tenant,device) sync_state persistence, paginated pull with materialize + optional apply, batch push with retries/backoff and watermark advancement, and materializer/apply helpers.
Migrations & helpers `Gradata/src/gradata/_migrations/002_add_event_identity.py`, `Gradata/src/gradata/_migrations/003_add_sync_state.py`, `Gradata/src/gradata/_migrations/_ulid.py`, `Gradata/src/gradata/_migrations/device_uuid.py`, `Gradata/src/gradata/_migrations/_runner.py`	Added migrations 002/003 with chunked backfill, ULID generator, disk-scoped device-id persistence, and migration-runner index/unique upgrade behavior.
Event pipeline & core `Gradata/src/gradata/_events.py`, `Gradata/src/gradata/_tag_taxonomy.py`, `Gradata/src/gradata/enhancements/self_improvement/_graduation.py`, `Gradata/src/gradata/_core.py`, `Gradata/src/gradata/_doctor.py`, `Gradata/src/gradata/brain.py`, `Gradata/src/gradata/__init__.py`	Email PII redaction for persisted events (with raw side-log), tag enrichment additions (`conflict_reason`, `sync_status`, auto-category), emit RULE_GRADUATED on transitions, doctor endpoint/env refactor, and multiple formatting/payload tweaks.
Namespace, shims & adapters `Gradata/src/gradata/enhancements/carl.py`, `Gradata/src/gradata/patterns/__init__.py`, `Gradata/src/gradata/integrations/...`, `Gradata/src/gradata/contrib/enhancements/install_manifest.py`	Removed or converted deprecated compatibility shims/adapters; canonicalized legacy aliases (`carl`→`behavioral-engine`, `integrations`→`middleware`); updated default manifest registry and removed several adapter modules.
Services: embeddings & session history `Gradata/src/gradata/services/embeddings.py`, `Gradata/src/gradata/services/session_history.py`, `Gradata/src/gradata/services/__init__.py`	Added gradata.services implementations: EmbeddingClient (API + local fallback, cache, clustering, bus subscription) and SessionHistory service (event handlers, effectiveness computation).
CLI `Gradata/src/gradata/cli.py`, `Gradata/tests/test_cli_cloud.py`	Removed `login`/`logout`; added unified `gradata cloud` command (enable, rotate-key, status, disconnect, sync-pull) wired to new cloud modules; tests added.
Removed legacy modules `Gradata/src/gradata/_cloud_sync.py`, `Gradata/src/gradata/enhancements/profiling/tone_profile.py`, `Gradata/src/gradata/enhancements/profiling/__init__.py`	Deleted legacy push-only _cloud_sync and tone-profiling modules (and associated exports/docstring removal).
Docs, specs & policies `Gradata/docs/...` (api, architecture, specs, guides, contributing, changelog, errors)	Docs updated to new import paths (`gradata.contrib.patterns`, `gradata.enhancements.behavioral_engine`, `gradata.middleware`, etc.), added event-pull/merge/retention specs, deprecation policy, error catalog, and v0.7.0 changelog entry.
Tests: added/updated/removed `Gradata/tests/...` (many files added/modified/removed)	Extensive tests for cloud CLI, pull/push/materializer, migrations, PII redaction, rule-graduated events, tag enrichment and credentials; updated imports for relocated packages; removed legacy adapter integration and device-auth tests.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant CLI
    participant PullClient
    participant CloudAPI
    participant Materializer
    participant Lessons
    participant Brain
    end
    CLI->>PullClient: pull_events(brain_dir, apply=...)
    PullClient->>CloudAPI: GET /events/pull?brain_id&device_id...(auth)
    CloudAPI-->>PullClient: 200 {events, watermark, end_of_stream}
    PullClient->>Materializer: materialize(events)
    Materializer-->>PullClient: MaterializeResult(rules, conflicts, counts)
    alt apply=True
        PullClient->>Lessons: apply_to_lessons(MaterializeResult)
        Lessons-->>PullClient: updated lessons file
        PullClient->>Brain: emit(CLOUD_SYNC_COMPLETED)
        PullClient->>PullClient: update_pull_cursor(watermark)
    end
    PullClient-->>CLI: summary/status

sequenceDiagram
    rect rgba(255,200,200,0.5)
    participant PushClient
    participant DB
    participant CloudAPI
    participant SyncState
    end
    PushClient->>DB: fetch events > last_push_event_id (chunk)
    PushClient->>CloudAPI: POST /events/push (batch)
    alt 2xx
        CloudAPI-->>PushClient: success
        PushClient->>SyncState: upsert last_push_event_id
    else 4xx
        CloudAPI-->>PushClient: client error -> abort (no retry)
    else 5xx/transport
        CloudAPI-->>PushClient: transient -> retry/backoff
    end
    loop until no pending events
    end

sequenceDiagram
    rect rgba(200,255,200,0.5)
    participant Graduation
    participant EventEmitter
    participant EventStore
    end
    Graduation->>EventEmitter: _emit_rule_graduated(lesson, old_state, new_state, reason)
    EventEmitter->>EventEmitter: redact payload (emails)
    EventEmitter->>EventStore: append redacted event to events.jsonl
    EventEmitter->>EventStore: persist redacted event to system.db
    EventEmitter->>EventStore: write raw unredacted events.raw.jsonl (best-effort)
    EventEmitter-->>Graduation: emit result

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

feat(sdk): middleware adapters for OpenAI / Anthropic / LangChain / CrewAI #32 — integrations → middleware migration: overlaps adapter removals, middleware docs/usage, and manifest aliasing changes.
feat: multi-tenant SDK + cloud alignment (tenant_id, visibility, clusters, sync_state) #102 — cloud sync/migrations: touches cloud push/pull and migration surfaces similar to new push/pull/materializer work.
v0.6.0: open-source pivot (Apache + cloud algorithms merged into SDK) #93 — deprecations & package moves: continues package-reshaping (patterns, carl shim removals) aligned with this PR.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/cloud-sync-phase-2

….cloud.* _cloud_sync.py was the pre-Phase-2 Supabase row-push MVP with its own env-var scheme (GRADATA_CLOUD_SYNC / GRADATA_CLOUD_URL / GRADATA_CLOUD_KEY) and direct PostgREST calls. It had zero production callers after the new gradata.cloud.{client,push,pull,sync,materializer,_credentials} package shipped. Removed: - src/gradata/_cloud_sync.py (440 LOC) - tests/test_cloud_row_push.py (tested the deleted module) Updated: - _doctor._check_cloud_env_vars: now checks GRADATA_API_KEY / GRADATA_ENDPOINT / GRADATA_CLOUD_SYNC_DISABLE (the env-var surface the new gradata.cloud.* uses) instead of the old Supabase trio. - test_cloud_credentials.test_kill_switch: now targets cloud.push.push_pending_events status=kill_switch (the real network entrypoint that honours the switch) instead of the deleted row-push enabled() helper. - test_doctor_cloud env-var tests rewritten for the new surface. - _migrations/003_add_sync_state docstring: drop stale references. - test_embeddings_integration: switch to gradata.services.embeddings (missed in prior services/ relocation sweep). Full suite: 4019 passed, 2 skipped, 0 warnings. Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai

Actionable comments posted: 33

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Gradata/src/gradata/_doctor.py (1)
252-267: ⚠️ Potential issue | 🟠 Major

Reuse the unified credential resolver here.

These checks still only accept [cloud].api_key from config.toml. After the new keyfile/env credential chain, a valid setup can now be reported as missing/fail, and cloud_auth can be skipped entirely. doctor --cloud should resolve credentials the same way the runtime client does.

Also applies to: 346-365
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/_doctor.py` around lines 252 - 267, _update
_check_cloud_config to use the unified credential resolver instead of directly
reading config.toml: call the shared resolver used by the runtime client (the
credential resolution function that checks keyfile, env vars, and config) and
base the returned status on whether that resolver yields valid credentials;
return "ok" when credentials are resolved, "missing" when resolver indicates no
credentials available, and "fail" when resolver indicates invalid credentials or
an error. Replace uses of _read_cloud_config/_gradata_config_path in
_check_cloud_config and mirror the exact same replacement in the similar block
at lines 346-365 (the cloud_auth-related check) so both doctor --cloud checks
use the same resolver logic and outcomes. Ensure you reference the resolver
function by its existing name in the codebase when integrating.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Gradata/docs/api/patterns.md`:
- Line 114: The docs import paths are wrong: replace any occurrences of
gradata.contrib.patterns.* with gradata.rules.* so the examples import the rule
system modules correctly; specifically update the import that currently reads
"from gradata.contrib.patterns.scope import classify_scope, register_task_type"
to "from gradata.rules.scope import classify_scope, register_task_type" (and do
the same for the other imports in this document that reference
gradata.contrib.patterns, updating their module paths to gradata.rules.* to
match the actual symbols).

In `@Gradata/docs/changelog.md`:
- Around line 7-13: The changelog currently claims the `gradata.patterns` shim
(and other shims) were removed in the 0.7.0 "BREAKING" entry but an earlier note
states removal is scheduled for v0.8.0; reconcile by restoring consistency:
either move the removal bullet(s) out of the 0.7.0 entry and mark them as
deprecated (keep the deprecation text and reference the planned removal in
v0.8.0), or update the wording to say "deprecated in 0.7.0; removal scheduled
for v0.8.0"; specifically edit the bullets referencing `gradata.patterns` (and
any related shim entries like `gradata.enhancements.carl`,
`gradata.integrations.*_adapter`) so they match the previously announced v0.8.0
removal plan and do not claim they are already removed. Ensure the final message
uses the exact symbols `gradata.patterns`, `gradata.enhancements.carl`, and
`gradata.integrations.{anthropic,openai,langchain,crewai}_adapter` so readers
can reconcile with code and releases.

In `@Gradata/docs/contributing/deprecation-policy.md`:
- Around line 3-37: The headings (e.g., "## Scope", "## Lifecycle", "##
Exceptions", "## Non-deprecation refactors", "## Why two versions") currently
lack surrounding blank lines and trigger markdownlint MD022; update the document
by ensuring there is a blank line before and after each section heading so every
"## ..." heading is separated from surrounding content by one empty line, then
run the linter to verify the MD022 warning is resolved.

In `@Gradata/docs/errors.md`:
- Around line 67-72: The table row for CONFLICT_TIER2_HOLD contains the
threshold expression `|Δconfidence| >= 0.15` whose literal pipes break the
Markdown table; update that cell to escape the pipes (e.g., replace
`|Δconfidence| >= 0.15` with either backtick-wrapped code ``|Δconfidence| >=
0.15`` or escaped pipes `\|Δconfidence\| >= 0.15`) so the `CONFLICT_TIER2_HOLD`
row renders correctly in the table.

In `@Gradata/docs/specs/merge-semantics.md`:
- Line 87: The spec line with the property test "apply(device_A, stream) ==
apply(device_B, stream)" currently says "10k shuffled orderings" but the PR
implements ~200+ orderings; update the spec to match the implemented test count
or mark the 10k target as a documented stretch goal. Edit the checklist item
text for the property test in merge-semantics.md to either set the number to 200
(or the exact implemented count) or add a parenthetical note such as "(200+
implemented; 10k planned)"; ensure the unique expression apply(device_A, stream)
== apply(device_B, stream) remains unchanged.
- Line 39: The spec uses a strict inequality (`|Δconfidence| < 0.15`) for Tier 1
while the implementation checklist references `conflict_threshold: 0.15`,
creating an ambiguous boundary at exactly 0.15; update the spec text (the Tier 1
condition) to use a consistent comparison such as `|Δconfidence| <=
conflict_threshold` (or explicitly state that `|Δconfidence| ==
conflict_threshold` falls into Tier 2) and ensure the symbol
`conflict_threshold` is referenced so readers and implementers (and the
checklist) share the same definitive rule.

In `@Gradata/src/gradata/_core.py`:
- Around line 589-595: The provenance event payload is using the constant
INITIAL_CONFIDENCE instead of the actual initial confidence assigned when
approval gating is active; change the event dict built in the
provenance/emission block (the dict with keys
"action","lesson_category","lesson_description","initial_confidence","source_correction_id")
to set "initial_confidence" to the variable that holds the real initial
confidence (e.g. init_conf or whatever local variable is computed when
approval_required=True) rather than INITIAL_CONFIDENCE so the emitted event
matches the created lesson; update references where the event is created in
Gradata/src/gradata/_core.py (look for approval_required, init_conf, and
INITIAL_CONFIDENCE) to use the computed init_conf variable.
- Around line 945-955: The INSERT into lesson_transitions is writing NULL for
the session column (currently passing None); update the values tuple in the
lesson_transitions insert so the session column gets the correct session
identifier (e.g., use lesson.session or the current session variable like
session_id/current_session from the surrounding scope) instead of None; ensure
the symbol sequence around this change (lesson_transitions insert, lesson, _tid,
now) remains unchanged and add a fallback to preserve lineage if the session
variable can be missing.

In `@Gradata/src/gradata/_events.py`:
- Around line 703-710: The loop over new_events currently only calls
_redact_payload for dict/list types, letting scalar queued payloads (e.g.
"data": "alice@example.com") bypass redaction; update the loop to call
_redact_payload for any existing "data" or "tags" field (not just when
isinstance(..., (dict, list))) so that scalar values are passed through
_redact_payload as well; keep the type checks unnecessary because
_redact_payload already no-ops for unsupported types and ensure you reference
the existing variables and functions in this block: new_events, _redact_payload,
and emit() to maintain the defense-in-depth safety net.

In `@Gradata/src/gradata/_migrations/_ulid.py`:
- Around line 46-50: The code currently parses iso_ts with
datetime.fromisoformat and calls dt.timestamp(), which treats naive datetimes as
local time; update the parsing logic so that if dt.tzinfo is None it is treated
as UTC (e.g. dt = dt.replace(tzinfo=timezone.utc)) before computing ts_ms;
modify the block around datetime.fromisoformat/iso_ts/ts_ms in _ulid.py to set
UTC for naive datetimes and ensure timezone is imported so new_ulid() generation
becomes deterministic across hosts.

In `@Gradata/src/gradata/_migrations/device_uuid.py`:
- Around line 47-77: The current flow may return a transient new_did without
persisting it when fpath exists but contains an invalid device id; change the
post-write replacement logic inside the block that writes tmp so that after
writing new_did you atomically replace fpath if it does not exist OR if the
existing fpath content fails _is_valid(), otherwise unlink tmp; in practice
update the branch around os.replace/tmp handling to read fpath (using
fpath.read_text and _is_valid) when fpath.exists() and call os.replace(tmp,
fpath) when the existing value is invalid, falling back to os.unlink(tmp) when
the existing value is valid, so that new_did is persisted whenever the on-disk
id is corrupt.

In `@Gradata/src/gradata/cli.py`:
- Around line 674-675: Replace the call to _resolve_brain_root() and the eager
brain_root.mkdir() with the CLI’s canonical brain resolution via
_get_brain(args) so cloud commands use the same brain as other commands; obtain
the brain root from the returned brain object (e.g., from the brain's .root /
.path attribute) instead of calling _resolve_brain_root(), and remove the
unconditional brain_root.mkdir(parents=True, exist_ok=True) so status/disconnect
do not create the directory.

In `@Gradata/src/gradata/cloud/_sync_state.py`:
- Around line 91-101: The INSERT currently uses tenant_id in place of brain_id
and conflicts only on brain_id, collapsing cursors across devices; change the
VALUES parameter order to pass the actual brain_id variable (use (brain_id,
tenant_id, device_id, cursor, now)) and update the ON CONFLICT target to include
device_id (ON CONFLICT(brain_id, device_id) DO UPDATE SET ...) so each device
keeps its own last_pull_cursor in the sync_state upsert. Ensure the parameter
tuple and conflict target use the symbols brain_id, tenant_id, device_id,
cursor, and now exactly as in this function.

In `@Gradata/src/gradata/cloud/materializer.py`:
- Around line 83-90: The materializer key builder is using the raw category case
causing mismatches with the apply path which collapses categories using upper();
update the key construction in _rule_key (the block that computes category,
pattern_hash, description, ident) to normalize category the same way as apply
(e.g., category = str(data.get("category") or "").strip().upper()) so identical
logical categories (e.g., "style" vs "STYLE") produce the same key and surface
conflicts deterministically; leave ident logic unchanged and return (category,
ident) as before.

In `@Gradata/src/gradata/cloud/pull.py`:
- Around line 218-259: The current code only calls update_pull_cursor (via
update_pull_cursor/db/tenant_id/device_id/cursor) inside the branch guarded by
if apply and (mat.rules or mat.conflicts), so when apply=True but nothing
materialized the watermark is never persisted; move or duplicate the watermark
persistence logic (the block that reads watermark = summary.get("watermark") and
calls update_pull_cursor) so it runs whenever apply is True and a watermark
exists — regardless of whether
apply_to_lessons/emit_conflict_events/write_lessons_safe ran — and keep the
existing try/except that logs debug on failure; reference apply_to_lessons,
emit_conflict_events, write_lessons_safe, and update_pull_cursor to locate the
relevant code.
- Around line 73-76: The branch that checks db.is_file() currently sets
summary["status"] = "error" and summary["reason"] = "no_db", but according to
the function contract it must return the advertised stable status "no_db";
change the assignment so summary["status"] = "no_db" (keeping summary["reason"]
= "no_db" if desired) and return summary so callers can key off status; update
the code around the db.is_file() check in pull.py (the function that constructs
summary) to use "no_db" as the status value instead of "error".
- Around line 99-106: Replace the direct read of config.api_base with the
project’s centralized endpoint resolver and then enforce HTTPS; specifically,
call the shared resolver (the function that resolves GRADATA_ENDPOINT /
GRADATA_CLOUD_API_BASE into the effective API base) to produce api_base instead
of using config.api_base, then call require_https(api_base, "api_base") and keep
the existing error handling that sets summary["status"] and summary["reason"];
update references to api_base and require_https in pull.py accordingly.

In `@Gradata/src/gradata/cloud/push.py`:
- Around line 253-254: Replace the manual open/try/finally pattern for sqlite3
connections with a context manager: use sqlite3.connect(...) as conn so the
connection is automatically closed (and committed on exit); update code that
calls conn.commit() in _write_watermark to account for the with-statement's
commit-on-exit behavior (remove redundant commits or adjust logic if explicit
commits are required), and ensure all uses of conn inside the surrounding block
reference the new context-managed conn variable.
- Around line 130-152: The upsert in _write_watermark is using ON
CONFLICT(brain_id) and inserts brain_id=tenant_id, which causes per-device state
to be overwritten; change the upsert to use ON CONFLICT(tenant_id, device_id) so
watermarks are per (tenant,device), update the INSERT/VALUES to still set
brain_id for backward compatibility but make the conflict target the composite
key (tenant_id, device_id), and ensure the sync_state table has a
UNIQUE(tenant_id, device_id) constraint (or a composite primary key) so the ON
CONFLICT(tenant_id, device_id) target is valid.

In `@Gradata/src/gradata/cloud/sync.py`:
- Around line 187-190: The check for a missing credential in _post() is likely
redundant because enabled already calls _resolved_credential(); either remove
the dead branch or make the defensive intent explicit—choose one: (A) remove the
if not credential: log.debug(...) return None block from _post() and rely on
enabled to gate calls, or (B) keep the check but add a clear comment above it
explaining the race-condition/keyfile-deletion scenario and why we still bail
out, and consider changing log.debug to log.warning if you want visibility;
reference _post(), the enabled property, and _resolved_credential() when making
the change.

In `@Gradata/src/gradata/contrib/enhancements/install_manifest.py`:
- Around line 315-324: The manifest entry was renamed to id="middleware" but
callers may still pass the legacy id "integrations", causing Unknown module
errors; update the install manifest resolution to preserve backward
compatibility by mapping the legacy id "integrations" to the new id "middleware"
(mirror how the existing "carl" alias is handled) — either add "integrations" as
an alias on the manifest entry or add a normalization mapping in the same
resolver/normalizer code path that currently recognizes the "carl" alias so that
lookups for "integrations" resolve to the middleware module.

In `@Gradata/src/gradata/enhancements/self_improvement/_graduation.py`:
- Around line 76-88: The current flow falls back to module-level _events_emit
after a failed brain.emit, which can create duplicate RULE_GRADUATED events;
change the logic so the module-level fallback (_events_emit) is only invoked
when no brain emitter is available (i.e., brain is None or lacks emit).
Specifically, in the block that calls brain.emit("RULE_GRADUATED", "graduate",
payload, []), keep the try/except to log failures from brain.emit but do NOT
call or fall through to _events_emit afterwards—either return or re-raise after
logging to stop further processing; only call _events_emit when the initial
condition (brain is None or not hasattr(brain, "emit")) is true.

In `@Gradata/src/gradata/services/embeddings.py`:
- Around line 166-172: The current embedding cache uses FIFO eviction because
accesses don't update recency; modify the logic around _embedding_cache (the
dict used with _CACHE_MAX_SIZE and the get_client().embed(desc) path) so that
when a description is found in _embedding_cache you call
_embedding_cache.move_to_end(desc) to mark it as recently used, and when
inserting a newly fetched vec you also ensure recency is set before the eviction
loop that pops with _embedding_cache.popitem(last=False); this will implement
true LRU eviction rather than FIFO.
- Around line 44-53: The SSRF guard in _is_trusted_url is too permissive because
hostname.endswith("gradata.ai") allows sibling domains; tighten it by only
allowing parsed.hostname == "gradata.ai" or
parsed.hostname.endswith(".gradata.ai") (note the leading dot) and ensure
parsed.hostname is present; also restrict the localhost/127.0.0.1 exception to
only permit safe schemes (e.g., parsed.scheme in ("http", "https")) instead of
allowing any scheme. Update the checks in _is_trusted_url (using the parsed
variable) accordingly.

In `@Gradata/src/gradata/services/session_history.py`:
- Around line 43-50: on_session_ended currently mutates the passed payload
in-place by assigning payload["rule_effectiveness"] =
self.compute_effectiveness(), which can surprise callers; change it to compute
effectiveness via self.compute_effectiveness(), call self.reset(), and return
the effectiveness data instead of modifying payload, update the on_session_ended
docstring to clearly state it returns the effectiveness dict (and no longer
mutates payload), and update any callers that relied on the in-place mutation to
merge the returned dict into their payloads; reference symbols:
on_session_ended, compute_effectiveness, reset.
- Around line 10-11: Update the usage example in the module docstring to
reference the new canonical import path: replace "from
gradata.integrations.session_history import SessionHistory" with "from
gradata.services.session_history import SessionHistory" (the class symbol is
SessionHistory in gradata.services.session_history); leave any mentions of
forwarding shims intact but ensure the example shows the new module location so
documentation reflects the canonical API.

In `@Gradata/tests/test_cli_cloud.py`:
- Around line 29-36: The helper _run currently returns an empty string and
leaves tests to call capsys.readouterr().out; change _run(monkeypatch, *argv:
str) to accept a capsys fixture parameter (e.g., capsys) and after calling
cli.main() capture and return capsys.readouterr().out; keep the existing
SystemExit handling around cli.main() and preserve monkeypatch.setattr(sys,
"argv", ["gradata", *argv]) so callers can simply call _run(monkeypatch, capsys,
...) to get the CLI output directly.

In `@Gradata/tests/test_cloud_events_pull.py`:
- Around line 30-37: The fixture _isolate_keyfile doesn't clear all
cloud-related env vars, causing non-hermetic tests; in the _isolate_keyfile
fixture add monkeypatch.delenv calls for GRADATA_CLOUD_URL, GRADATA_CLOUD_KEY,
and GRADATA_CLOUD_SYNC (in addition to the existing
GRADATA_CLOUD_SYNC_DISABLE/GRADATA_API_KEY clears) so that _save_cfg() and tests
always control the endpoint and credentials; update references in the fixture
around monkeypatch and _creds (KEYFILE_PATH/KEYFILE_DIR) to ensure the new
deletions run before yield.

In `@Gradata/tests/test_cloud_sync_state.py`:
- Around line 34-39: Seed a non-empty cursor into the DB before calling
update_pull_cursor in test_empty_cursor_is_noop, then assert
update_pull_cursor(db, tenant_id="t", device_id="dev_a", cursor="") returns
False and that the persisted cursor value is unchanged afterwards; locate the
cursor using the existing helper or API (e.g., get_pull_cursor or the same
storage access used elsewhere in tests) and compare the stored value to the
original seeded cursor to ensure the empty-cursor call is a true no-op.

In `@Gradata/tests/test_emit_pii_redaction.py`:
- Around line 103-104: Replace the broad pytest.raises(Exception) with an
explicit expectation for the error types the code may raise: assert
pytest.raises((EventPersistenceError, RuntimeError)) around the brain.emit call
(or only EventPersistenceError if the implementation should always wrap
failures). Ensure EventPersistenceError is imported into
Gradata/tests/test_emit_pii_redaction.py and keep the same call to
brain.emit("SHOULD_NOT_LAND", "test", {"note": SECRET_EMAIL}, []) so the test
fails only for the intended error types.

In `@Gradata/tests/test_migration_002_event_identity.py`:
- Around line 171-184: emit() currently omits the identity columns event_id,
device_id, and content_hash when inserting new events, which will leave
post-migration rows permanently NULL; update the emit() implementation so its
INSERT populates event_id (UUID or deterministic id), device_id (derive from
supplied session/source/device context), and content_hash (compute from
data_json or canonicalized payload) for every new event, ensuring the same
column names are written as Migration 002 backfills; alternatively, if you
prefer the migration approach, change Migration 002 to be idempotent and
rerunnable (detect and backfill only rows with NULL identity columns) so fresh
inserts remain consistent with the historical backfill contract.
- Around line 37-40: The test helper _run_002 currently calls
_apply_numbered(conn, brain.dir) and then directly invokes module.up(conn,
tenant_id="unused"), which bypasses the runner path; modify _run_002 to only
call _apply_numbered(conn, brain.dir) (so the migration is executed via the
runner and marked applied) and remove the direct module.up(...) call, ensuring
the runner's has_applied() logic is exercised and the test reflects production
behavior.

In `@Gradata/tests/test_rule_graduated_events.py`:
- Around line 13-18: The import of Brain is only used for type annotations and
should be guarded by TYPE_CHECKING to avoid importing gradata.brain at test
collection; add "from typing import TYPE_CHECKING" and move "from gradata.brain
import Brain" into an "if TYPE_CHECKING:" block, keeping the rest of the imports
(Lesson, LessonState, graduate, init_brain) unchanged and ensuring any
annotations referencing Brain remain valid (use a forward-reference string
'Brain' or rely on future annotations if present).

---

Outside diff comments:
In `@Gradata/src/gradata/_doctor.py`:
- Around line 252-267: _update _check_cloud_config to use the unified credential
resolver instead of directly reading config.toml: call the shared resolver used
by the runtime client (the credential resolution function that checks keyfile,
env vars, and config) and base the returned status on whether that resolver
yields valid credentials; return "ok" when credentials are resolved, "missing"
when resolver indicates no credentials available, and "fail" when resolver
indicates invalid credentials or an error. Replace uses of
_read_cloud_config/_gradata_config_path in _check_cloud_config and mirror the
exact same replacement in the similar block at lines 346-365 (the
cloud_auth-related check) so both doctor --cloud checks use the same resolver
logic and outcomes. Ensure you reference the resolver function by its existing
name in the codebase when integrating.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8127e2b0-9c06-45b8-b498-2d6dccd4cb7f

📥 Commits

Reviewing files that changed from the base of the PR and between 129c83f and c8a106f.

📒 Files selected for processing (69)

Gradata/docs/api/enhancements.md
Gradata/docs/api/patterns.md
Gradata/docs/architecture/patterns.md
Gradata/docs/changelog.md
Gradata/docs/contributing/deprecation-policy.md
Gradata/docs/errors.md
Gradata/docs/guides/custom-domains.md
Gradata/docs/guides/mcp.md
Gradata/docs/sdk/middleware.md
Gradata/docs/specs/events-pull-contract.md
Gradata/docs/specs/merge-semantics.md
Gradata/docs/specs/retention-clock.md
Gradata/src/gradata/__init__.py
Gradata/src/gradata/_cloud_sync.py
Gradata/src/gradata/_core.py
Gradata/src/gradata/_doctor.py
Gradata/src/gradata/_events.py
Gradata/src/gradata/_migrations/002_add_event_identity.py
Gradata/src/gradata/_migrations/003_add_sync_state.py
Gradata/src/gradata/_migrations/_ulid.py
Gradata/src/gradata/_migrations/device_uuid.py
Gradata/src/gradata/_tag_taxonomy.py
Gradata/src/gradata/brain.py
Gradata/src/gradata/cli.py
Gradata/src/gradata/cloud/__init__.py
Gradata/src/gradata/cloud/_apply_materialized.py
Gradata/src/gradata/cloud/_credentials.py
Gradata/src/gradata/cloud/_sync_state.py
Gradata/src/gradata/cloud/materializer.py
Gradata/src/gradata/cloud/pull.py
Gradata/src/gradata/cloud/push.py
Gradata/src/gradata/cloud/sync.py
Gradata/src/gradata/contrib/enhancements/install_manifest.py
Gradata/src/gradata/enhancements/carl.py
Gradata/src/gradata/enhancements/profiling/__init__.py
Gradata/src/gradata/enhancements/profiling/tone_profile.py
Gradata/src/gradata/enhancements/self_improvement/_graduation.py
Gradata/src/gradata/integrations/__init__.py
Gradata/src/gradata/integrations/anthropic_adapter.py
Gradata/src/gradata/integrations/crewai_adapter.py
Gradata/src/gradata/integrations/embeddings.py
Gradata/src/gradata/integrations/langchain_adapter.py
Gradata/src/gradata/integrations/openai_adapter.py
Gradata/src/gradata/integrations/session_history.py
Gradata/src/gradata/patterns/__init__.py
Gradata/src/gradata/services/__init__.py
Gradata/src/gradata/services/embeddings.py
Gradata/src/gradata/services/session_history.py
Gradata/tests/security/test_device_auth_url.py
Gradata/tests/test_bug_fixes.py
Gradata/tests/test_cli_cloud.py
Gradata/tests/test_cloud_apply_materialized.py
Gradata/tests/test_cloud_credentials.py
Gradata/tests/test_cloud_disaster_recovery.py
Gradata/tests/test_cloud_events_pull.py
Gradata/tests/test_cloud_events_push.py
Gradata/tests/test_cloud_materializer.py
Gradata/tests/test_cloud_sync.py
Gradata/tests/test_cloud_sync_state.py
Gradata/tests/test_doctor_cloud.py
Gradata/tests/test_embeddings.py
Gradata/tests/test_emit_pii_redaction.py
Gradata/tests/test_integrations.py
Gradata/tests/test_migration_002_event_identity.py
Gradata/tests/test_migration_003_sync_state.py
Gradata/tests/test_rule_graduated_events.py
Gradata/tests/test_session_history.py
Gradata/tests/test_spec_compliance.py
Gradata/tests/test_tag_taxonomy.py

💤 Files with no reviewable changes (10)

Gradata/src/gradata/enhancements/profiling/init.py
Gradata/src/gradata/enhancements/carl.py
Gradata/src/gradata/integrations/langchain_adapter.py
Gradata/src/gradata/patterns/init.py
Gradata/src/gradata/integrations/openai_adapter.py
Gradata/src/gradata/integrations/anthropic_adapter.py
Gradata/tests/security/test_device_auth_url.py
Gradata/src/gradata/integrations/crewai_adapter.py
Gradata/tests/test_integrations.py
Gradata/src/gradata/enhancements/profiling/tone_profile.py

📜 Review details

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

📚 Learning: 2026-04-17T17:18:07.439Z

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

Gradata/src/gradata/_doctor.py
Gradata/tests/test_cloud_sync_state.py
Gradata/tests/test_emit_pii_redaction.py
Gradata/tests/test_bug_fixes.py
Gradata/tests/test_cloud_disaster_recovery.py
Gradata/tests/test_cli_cloud.py
Gradata/tests/test_cloud_credentials.py
Gradata/tests/test_migration_002_event_identity.py
Gradata/tests/test_rule_graduated_events.py
Gradata/tests/test_cloud_materializer.py
Gradata/src/gradata/_core.py
Gradata/tests/test_cloud_events_pull.py
Gradata/tests/test_cloud_events_push.py

🪛 LanguageTool

Gradata/docs/specs/events-pull-contract.md

[style] ~12-~12: Use ‘will’ instead of ‘going to’ if the following action is certain.
Context: ...e is the commitment." If /events/pull is going to be the disaster-recovery path for the m...

(GOING_TO_WILL)

[style] ~36-~36: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...vents with emit_ts >= rebuild_from. - If absent, the server uses the requester's...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.22.0)

Gradata/docs/errors.md

[warning] 71-71: Table column count
Expected: 4; Actual: 6; Too many cells, extra data will be missing

(MD056, table-column-count)

Gradata/docs/contributing/deprecation-policy.md

[warning] 3-3: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 8-8: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 22-22: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 29-29: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 34-34: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🪛 Ruff (0.15.10)

Gradata/tests/test_emit_pii_redaction.py

[warning] 103-103: Do not assert blind exception: Exception

(B017)

Gradata/src/gradata/cloud/_sync_state.py

[warning] 54-57: Use contextlib.suppress(sqlite3.OperationalError) instead of try-except-pass

(SIM105)

Gradata/tests/test_migration_002_event_identity.py

[warning] 99-99: Unpacked variable ts is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

Gradata/src/gradata/cloud/_credentials.py

[warning] 53-56: Use contextlib.suppress(OSError) instead of try-except-pass

Replace try-except-pass with with contextlib.suppress(OSError): ...

(SIM105)

Gradata/src/gradata/cloud/materializer.py

[warning] 35-35: Import from collections.abc instead: Iterable

Import from collections.abc

(UP035)

[warning] 307-313: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

Gradata/tests/test_rule_graduated_events.py

[warning] 16-16: Move application import gradata.brain.Brain into a type-checking block

Move into type-checking block

(TC001)

Gradata/src/gradata/cloud/_apply_materialized.py

[warning] 22-22: Move application import gradata.cloud.materializer.MaterializeResult into a type-checking block

Move into type-checking block

(TC001)

Gradata/tests/test_cloud_materializer.py

[error] 320-320: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (36)

Gradata/src/gradata/__init__.py (3)

21-31: Version resolution fallback remains correct.

The importlib.metadata/fallback chain is unchanged functionally; this is safe.

95-99: Pattern import guidance now matches the migrated namespace.

Line 97 correctly points users to gradata.contrib.patterns, consistent with the lazy import map.

145-152: Lazy-load deprecation path remains intact.

No behavioral drift in __getattr__; warning + dynamic import flow is preserved.

Gradata/docs/guides/mcp.md (1)

71-71: Import path update is correct.

Line 71 aligns the example with the new gradata.contrib.patterns.mcp namespace.

Gradata/docs/guides/custom-domains.md (1)

45-45: Behavioral engine import migration looks consistent.

Line 45 updates the docs to the new module path without changing example behavior.

Gradata/docs/api/enhancements.md (1)

94-94: Enhancements import migration is coherent.

Line 94 consistently points CARL-related examples at gradata.enhancements.behavioral_engine.

Gradata/docs/architecture/patterns.md (1)

12-12: Architecture examples are consistently migrated.

These snippet imports correctly reflect gradata.contrib.patterns.* for Layer 0 APIs.

Also applies to: 23-23, 38-38, 66-66, 87-87

Gradata/docs/api/patterns.md (1)

8-104: Most pattern import migrations are correct.

The updated examples for orchestrator/pipeline/parallel/reflection/evaluator/memory/guardrails/human_loop/tools/mcp align with the contrib namespace move.

Also applies to: 149-162

Gradata/tests/test_spec_compliance.py (2)

26-33: Spec compliance migration checks are updated correctly.

The architecture/enhancement importability assertions now match the new module layout.

Also applies to: 77-84

318-322: Good strengthening of zero-dependency import assertions.

Adding explicit reflection and guardrails assertions makes this check less brittle.

Gradata/tests/test_bug_fixes.py (1)

355-388: New import-surface regression test is solid.

This effectively replaces deprecated-shim expectations with direct checks of current canonical namespaces.

Gradata/tests/test_doctor_cloud.py (1)

44-44: Test assertion update is correct and aligned with new CLI guidance.

Gradata/tests/test_embeddings.py (1)

5-5: Import migration is correct; tests now exercise the canonical services module.

Gradata/tests/test_session_history.py (1)

2-3: Session history test updates look good; migration + formatting changes are safe.

Also applies to: 8-15, 26-34

Gradata/src/gradata/brain.py (1)

195-201: Subscriber wiring migration is clean and safely guarded.

Gradata/src/gradata/services/__init__.py (1)

1-9: Good package-level documentation for the services namespace transition.

Gradata/src/gradata/cloud/__init__.py (1)

31-34: Package exports are consistent with the new cloud credentials module surface.

Gradata/tests/test_cloud_sync.py (1)

67-99: Good coverage for conflict_threshold coercion.

These tests lock down the default sentinel, valid override, out-of-range fallback, and garbage-input fallback paths.

Gradata/src/gradata/_cloud_sync.py (1)

229-239: Kill-switch short-circuit looks good.

Checking the shared cloud kill switch before the legacy env gating keeps emergency write-disable behavior centralized and predictable.

Gradata/tests/test_tag_taxonomy.py (1)

49-96: Nice regression coverage for the rule-event tags.

This pins both enrichment behavior and conflict_reason validation, including the non-duplication path for caller-supplied category tags.

Gradata/src/gradata/integrations/session_history.py (1)

1-19: Shim behavior matches the stated deprecation window.

The import-time warning plus re-export keeps legacy imports working while steering callers to gradata.services.session_history.

Gradata/docs/specs/events-pull-contract.md (3)

34-36: Minor: Consider varying sentence structure.

Static analysis flagged three successive sentences beginning with "If". While this is clear and systematic for a spec, you could optionally vary the structure for readability.

118-126: LGTM! Smart Phase 1 stub design.

The intentional NotImplementedError on 200 response is a clever safety mechanism - it ensures nothing accidentally gets merged before the materializer ships, while still allowing the client interface to be released and frozen.

108-115: LGTM! Clear GA readiness invariants.

The property invariants provide a concrete checklist for validating the implementation before GA: idempotent replay, order-independence, watermark monotonicity, dedup invariant, and retention boundary behavior.

Gradata/src/gradata/cloud/sync.py (2)

73-85: LGTM! Defensive threshold coercion.

The _coerce_threshold function properly handles invalid inputs (non-numeric, out-of-range) by collapsing to the 0.0 sentinel, ensuring a broken config never silently changes merge behavior.

151-174: LGTM! Unified credential resolution chain.

The _resolved_credential() method correctly implements the documented precedence (config.token > keyfile > env var), and the enabled property properly gates sync on both sync_enabled flag AND credential availability.

Gradata/src/gradata/cloud/push.py (1)

171-189: LGTM! Robust retry logic with proper backoff.

The _post_batch function correctly:

Distinguishes non-retryable 4xx errors from transient failures

Implements exponential backoff (2**attempt)

Limits retries appropriately

Returns a clear success/failure tuple

Gradata/src/gradata/services/session_history.py (1)

54-69: LGTM! Clean effectiveness computation.

The compute_effectiveness method correctly implements the inverse relationship: a rule is effective if it was NOT corrected during the session. The early return for empty injected_this_session avoids unnecessary computation.

Gradata/tests/test_cli_cloud.py (1)

118-179: LGTM! Comprehensive dry-run test with proper isolation.

The test properly:

Sets up a real Brain with a seeded event to create system.db

Configures cloud sync with a valid token

Mocks the HTTP response with a realistic payload

Verifies dry-run output format and that lessons.md is unchanged

Gradata/docs/specs/merge-semantics.md (1)

67-77: LGTM! Clear convergence property definition.

The formal convergence guarantee with pseudocode clearly defines the required property: any two devices applying the same ordered event stream must produce bit-identical results.

Gradata/docs/specs/retention-clock.md (2)

37-48: LGTM! Clear tier-based retention windows.

The retention windows table is comprehensive and the "indefinite" clarification is important for user expectations and storage planning.

75-81: LGTM! Edge cases well documented.

The spec covers important edge cases: backup restoration, duplicate pushes, cloud-only events, and migration backfill. This documentation will help prevent bugs during implementation.

Gradata/tests/test_cloud_credentials.py (3)

44-63: LGTM! Comprehensive credential precedence tests.

The tests thoroughly cover the credential resolution chain: explicit argument wins over keyfile, keyfile wins over environment variable, and empty string is returned when nothing is set.

105-116: LGTM! Important integration test for kill switch.

This test verifies critical behavior: the kill switch (GRADATA_CLOUD_SYNC_DISABLE) overrides all other sync settings, ensuring users can always disable cloud sync regardless of other configuration.

119-128: LGTM! Verifies unified auth chain in CloudClient.

This test confirms that CloudClient properly consults the keyfile when cloud-config.json has no token, validating the unified credential resolution chain documented in sync.py.

Gradata/tests/test_cloud_events_push.py (1)

24-37: Good hermetic setup for the push contract tests.

Rebinding the keyfile path and clearing the cloud env overrides here keeps the suite deterministic across developer shells and CI.

coderabbitai · 2026-04-22T08:44:23Z


 ```python
-from gradata.patterns.scope import classify_scope, register_task_type
+from gradata.contrib.patterns.scope import classify_scope, register_task_type


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify canonical locations for rule modules fd -i '^(scope|rule_engine|rule_tracker)\.py$' | sort # Verify current docs usage for these imports rg -n --type=md -C1 'from gradata\.(contrib\.patterns|rules)\.(scope|rule_engine|rule_tracker) import' Gradata/docs

Repository: Gradata/gradata

Length of output: 830

🏁 Script executed:

# Find all Python files related to rule modules find Gradata -name "rule_engine.py" -o -name "scope.py" -o -name "rule_tracker.py" 2>/dev/null # Check if contrib.patterns directory exists find Gradata -type d -name "patterns" 2>/dev/null # Search for rule_engine references anywhere in the codebase rg -l "rule_engine" --type=py Gradata 2>/dev/null | head -20

Repository: Gradata/gradata

Length of output: 989

🏁 Script executed:

# List contents of contrib/patterns directory ls -la Gradata/src/gradata/contrib/patterns/ # Also check rules directory structure ls -la Gradata/src/gradata/rules/ # Search specifically for rule_engine.py find Gradata -name "rule_engine.py" -type f

Repository: Gradata/gradata

Length of output: 2343

🏁 Script executed:

# Check the rule_engine directory structure ls -la Gradata/src/gradata/rules/rule_engine/ # Check what's exported from rule_engine module cat -n Gradata/src/gradata/rules/rule_engine/__init__.py | head -50

Repository: Gradata/gradata

Length of output: 2316

🏁 Script executed:

# Check if contrib/patterns has these modules ls -la Gradata/src/gradata/contrib/patterns/ # Check what's in contrib/patterns __init__.py cat -n Gradata/src/gradata/contrib/patterns/__init__.py | head -100 # Check the test_spec_compliance.py at lines 64-66 mentioned in the review sed -n '60,70p' Gradata/tests/test_spec_compliance.py

Repository: Gradata/gradata

Length of output: 7036

🏁 Script executed:

# Verify all three lines in the docs that need fixing sed -n '113,140p' Gradata/docs/api/patterns.md # Double-check the actual module locations cat -n Gradata/src/gradata/rules/__init__.py | head -40

Repository: Gradata/gradata

Length of output: 1876

Update import paths in documentation to use gradata.rules.* instead of gradata.contrib.patterns.*

Lines 114, 127, and 136 import from gradata.contrib.patterns.*, but these modules are located in gradata.rules.*. The contrib/patterns directory contains pattern implementations (Pipeline, RAG, Guards, etc.), not the rule system modules. This will cause copy-paste import failures for users.

📌 Docs fix

-from gradata.contrib.patterns.scope import classify_scope, register_task_type +from gradata.rules.scope import classify_scope, register_task_type ... -from gradata.contrib.patterns.rule_engine import apply_rules, format_rules_for_prompt +from gradata.rules.rule_engine import apply_rules, format_rules_for_prompt ... -from gradata.contrib.patterns.rule_tracker import log_application +from gradata.rules.rule_tracker import log_application

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from gradata.contrib.patterns.scope import classify_scope, register_task_type

from gradata.rules.scope import classify_scope, register_task_type

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Gradata/docs/api/patterns.md` at line 114, The docs import paths are wrong: replace any occurrences of gradata.contrib.patterns.* with gradata.rules.* so the examples import the rule system modules correctly; specifically update the import that currently reads "from gradata.contrib.patterns.scope import classify_scope, register_task_type" to "from gradata.rules.scope import classify_scope, register_task_type" (and do the same for the other imports in this document that reference gradata.contrib.patterns, updating their module paths to gradata.rules.* to match the actual symbols).

coderabbitai · 2026-04-22T08:44:23Z

+## Scope
+Applies to every public symbol, import path, CLI command, and config key in
+`gradata` that is marked for removal via `DeprecationWarning`, a
+`.. deprecated::` docstring tag, or a CHANGELOG entry tagged `DEPRECATED`.
+
+## Lifecycle
+1. **Mark.** The deprecating release MUST:
+   - Emit `DeprecationWarning` at first use (module-level via `__getattr__`
+     where possible, otherwise at function entry).
+   - Add `.. deprecated:: X.Y.Z` to the symbol's docstring.
+   - Add a `DEPRECATED` bullet to CHANGELOG naming the replacement.
+2. **Carry.** The symbol SHALL remain callable through **two additional minor
+   versions** after the deprecating release, then be removed.
+   - Deprecated in 0.6.x → removed in 0.8.0 (earliest).
+   - No removal inside a patch bump.
+3. **Remove.** The removing release MUST:
+   - Add a `BREAKING` bullet to CHANGELOG with the old path → new path.
+   - Delete the shim, its tests, and any docs that reference it.
+
+## Exceptions
+- **Security fixes** may remove a deprecated path earlier. CHANGELOG must
+  state "removed early for security" with a CVE or internal ticket.
+- **Pre-0.7.0 relaxation.** Before 0.7.0 ships, early removal is permitted
+  if the CHANGELOG `BREAKING` section names the symbol explicitly. Once
+  0.7.0 ships, the two-minor-version rule binds.
+
+## Non-deprecation refactors
+Moving a non-deprecated module (rename, relocation) counts as a breaking
+change and follows the same two-minor-version carry rule via a forwarding
+shim. The move itself triggers the deprecation clock.
+
+## Why two versions
+One version is the window users realistically notice the warning in their
+test suite. The second is the window they schedule the migration. Removing
+after one minor version breaks users who followed the warning in good faith.


⚠️ Potential issue | 🟡 Minor

Add blank lines around the section headings.

markdownlint is already flagging MD022 here, so this doc will keep failing style checks until the headings are separated by blank lines.

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 3-3: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 8-8: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 22-22: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 29-29: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

[warning] 34-34: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Gradata/docs/contributing/deprecation-policy.md` around lines 3 - 37, The headings (e.g., "## Scope", "## Lifecycle", "## Exceptions", "## Non-deprecation refactors", "## Why two versions") currently lack surrounding blank lines and trigger markdownlint MD022; update the document by ensuring there is a blank line before and after each section heading so every "## ..." heading is separated from surrounding content by one empty line, then run the linter to verify the MD022 warning is resolved.

coderabbitai · 2026-04-22T08:44:25Z

+        # Migration 001 must land first so the migrations table exists etc.
+        _apply_inline(conn)
+        _apply_numbered(conn, brain.dir)
+        summary = module.up(conn, tenant_id="unused")


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -e # Verify whether `_apply_numbered` includes 002 and how numbered migrations are ordered. rg -n -C4 '\bdef _apply_numbered\b|\b_apply_numbered\(' Gradata/src/gradata fd -a '^[0-9]{3}_.*\.py$' Gradata/src/gradata/_migrations rg -n -C3 '002_add_event_identity' Gradata/src/gradata/_migrations Gradata/tests

Repository: Gradata/gradata

Length of output: 1538

🏁 Script executed:

# Get full _apply_numbered implementation to check if it skips already-applied migrations sed -n '127,150p' Gradata/src/gradata/_migrations/__init__.py # Get the test file context around lines 37-40 to see _run_002() usage sed -n '20,60p' Gradata/tests/test_migration_002_event_identity.py # Find and show the _run_002 helper definition rg -n -A20 'def _run_002' Gradata/tests/test_migration_002_event_identity.py

Repository: Gradata/gradata

Length of output: 3394

🏁 Script executed:

# Show migration 002 implementation cat Gradata/src/gradata/_migrations/002_add_event_identity.py # Show test assertions in test_migration_002_event_identity.py sed -n '60,120p' Gradata/tests/test_migration_002_event_identity.py

Repository: Gradata/gradata

Length of output: 10933

Simplify the test helper to call migration 002 once, via the runner path.

The helper currently calls _apply_numbered(conn, brain.dir) (which marks 002 as applied) and then immediately calls module.up(...) directly, exercising both the runner and a bypass of it. While migration 002's up() is idempotent—columns use add_column_if_missing, indexes use create_index_if_missing, and backfill only processes WHERE event_id IS NULL—the test still diverges from production behavior. In production, the runner's has_applied() check prevents re-entry after the first successful run. Refactor _run_002() to remove the direct module.up() call and rely solely on _apply_numbered() to invoke the migration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Gradata/tests/test_migration_002_event_identity.py` around lines 37 - 40, The test helper _run_002 currently calls _apply_numbered(conn, brain.dir) and then directly invokes module.up(conn, tenant_id="unused"), which bypasses the runner path; modify _run_002 to only call _apply_numbered(conn, brain.dir) (so the migration is executed via the runner and marked applied) and remove the direct module.up(...) call, ensuring the runner's has_applied() logic is exercised and the test reflects production behavior.

coderabbitai · 2026-04-22T08:44:25Z

+def test_new_emit_leaves_identity_columns_null_for_now(tmp_path):
+    """emit() does not yet populate identity columns — only Migration 002 backfill does.
+
+    Wiring emit() to write event_id/device_id/content_hash is deferred; this
+    test pins the current contract so a future change flips it deliberately.
+    """
+    brain = init_brain(tmp_path)
+    brain.emit("FRESH", "src", {"k": "v"}, [])
+
+    with _conn(brain) as conn:
+        row = conn.execute(
+            "SELECT event_id, device_id, content_hash FROM events WHERE type = 'FRESH'"
+        ).fetchone()
+    assert row == (None, None, None)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -e # Verify whether the event insert path populates the new identity columns. rg -n -C4 '\bdef emit\b|INSERT INTO events|event_id|device_id|content_hash' Gradata/src/gradata

Repository: Gradata/gradata

Length of output: 50498

🏁 Script executed:

#!/bin/bash # Find the main brain/emit implementation find Gradata -name "*.py" -type f | xargs grep -l "def emit" | head -5

Repository: Gradata/gradata

Length of output: 241

🏁 Script executed:

#!/bin/bash # Find migration 002 find Gradata -name "*002*" -o -name "*migration*" | grep -E "\.py$|002"

Repository: Gradata/gradata

Length of output: 214

🏁 Script executed:

#!/bin/bash # Search specifically for event insertion logic in the main codebase (not tests) rg -n "INSERT INTO events" Gradata/src --type py -A 5 -B 2

Repository: Gradata/gradata

Length of output: 168

🏁 Script executed:

#!/bin/bash # Examine the brain.emit() implementation wc -l Gradata/src/gradata/brain.py

Repository: Gradata/gradata

Length of output: 92

🏁 Script executed:

#!/bin/bash # Look at the emit() function in brain.py rg -n "def emit" Gradata/src/gradata/brain.py -A 30

Repository: Gradata/gradata

Length of output: 1132

🏁 Script executed:

#!/bin/bash # Examine _events.py for emit implementation cat Gradata/src/gradata/_events.py | head -100

Repository: Gradata/gradata

Length of output: 3534

🏁 Script executed:

#!/bin/bash # Look at migration 002 cat Gradata/src/gradata/_migrations/002_add_event_identity.py

Repository: Gradata/gradata

Length of output: 8904

🏁 Script executed:

#!/bin/bash # Find the emit() function implementation in _events.py rg -n "^def emit" Gradata/src/gradata/_events.py -A 100

Repository: Gradata/gradata

Length of output: 8323

🏁 Script executed:

#!/bin/bash # Search for event_id population in the insert logic rg -n "event_id.*=" Gradata/src/gradata/_events.py | head -20

Repository: Gradata/gradata

Length of output: 41

🏁 Script executed:

#!/bin/bash # Look at the SQLite insert statement in emit() - lines 254-300 sed -n '254,300p' Gradata/src/gradata/_events.py

Repository: Gradata/gradata

Length of output: 2087

🏁 Script executed:

#!/bin/bash # Check if there's any code that populates event_id/device_id/content_hash during emit rg -n "INSERT INTO events" Gradata/src/gradata/_events.py -A 10

Repository: Gradata/gradata

Length of output: 41

Update emit() to populate identity columns before migration 002 is applied.

The test correctly documents current behavior but locks in a critical bug. Migration 002 backfills historical rows with event_id, device_id, and content_hash, but once marked applied, it never reruns. Meanwhile, emit() does not populate these columns during INSERT—it only provides ts, session, type, source, data_json, tags_json, valid_from, valid_until, tenant_id, and schema_version. All events inserted after migration 002 is applied will have NULL identity columns permanently, breaking the push/dedup/conflict logic this PR is shipping.

Either wire emit() to generate and populate these fields, or revise the migration contract so it remains idempotent and can backfill fresh events on each run.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Gradata/tests/test_migration_002_event_identity.py` around lines 171 - 184, emit() currently omits the identity columns event_id, device_id, and content_hash when inserting new events, which will leave post-migration rows permanently NULL; update the emit() implementation so its INSERT populates event_id (UUID or deterministic id), device_id (derive from supplied session/source/device context), and content_hash (compute from data_json or canonicalized payload) for every new event, ensuring the same column names are written as Migration 002 backfills; alternatively, if you prefer the migration approach, change Migration 002 to be idempotent and rerunnable (detect and backfill only rows with NULL identity columns) so fresh inserts remain consistent with the historical backfill contract.

coderabbitai · 2026-04-22T08:44:25Z

+from pathlib import Path
+
+from gradata._types import Lesson, LessonState
+from gradata.brain import Brain
+from gradata.enhancements.self_improvement import graduate
+from tests.conftest import init_brain


🧹 Nitpick | 🔵 Trivial

Move Brain behind TYPE_CHECKING.

Brain is only used in annotations here, so importing gradata.brain at runtime adds an unnecessary application import during test collection. Ruff is already flagging this as TC001.

♻️ Proposed cleanup

from pathlib import Path +from typing import TYPE_CHECKING from gradata._types import Lesson, LessonState -from gradata.brain import Brain from gradata.enhancements.self_improvement import graduate from tests.conftest import init_brain + +if TYPE_CHECKING: + from gradata.brain import Brain

🧰 Tools

🪛 Ruff (0.15.10)

[warning] 16-16: Move application import gradata.brain.Brain into a type-checking block

Move into type-checking block

(TC001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Gradata/tests/test_rule_graduated_events.py` around lines 13 - 18, The import of Brain is only used for type annotations and should be guarded by TYPE_CHECKING to avoid importing gradata.brain at test collection; add "from typing import TYPE_CHECKING" and move "from gradata.brain import Brain" into an "if TYPE_CHECKING:" block, keeping the rest of the imports (Lesson, LessonState, graduate, init_brain) unchanged and ensuring any annotations referencing Brain remain valid (use a forward-reference string 'Brain' or rely on future annotations if present).

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Gradata/src/gradata/_doctor.py`:
- Around line 279-302: The cloud doctor probes currently build probe URLs from
api_url/GRADATA_API_URL causing mismatches with env-only overrides; update the
probe functions _check_cloud_reachable, _check_cloud_auth, and
_check_cloud_has_data to call the central resolve_endpoint() (from
cloud/_credentials.py) to compute the effective endpoint and use that for all
probe URLs so the probes honor GRADATA_ENDPOINT / GRADATA_CLOUD_API_BASE and any
other resolution logic in resolve_endpoint().
- Around line 282-286: The disable flag in gradata._doctor (the local `disabled
= os.environ.get("GRADATA_CLOUD_SYNC_DISABLE", "").strip().lower() in (...)`
logic) duplicates parsing and disagrees with the canonical parser; replace this
ad-hoc logic by calling the shared utility
gradata.cloud._credentials.kill_switch_set() (importing it into gradata._doctor)
so that `"on"` and all other truthy variants are interpreted consistently with
the rest of the codebase.

In `@Gradata/src/gradata/_migrations/003_add_sync_state.py`:
- Around line 37-50: The migration currently creates sync_state with brain_id as
the primary key which collapses per-device watermarks; change SYNC_STATE_SQL to
define a composite primary key (tenant_id, device_id) or add a UNIQUE(tenant_id,
device_id) and include tenant_id and device_id columns in the table definition
so watermarks are per-device, and adjust NEW_COLUMNS as needed; then update the
upsert writers referenced (Gradata/src/gradata/cloud/push.py and
Gradata/src/gradata/cloud/_sync_state.py) to use ON CONFLICT (tenant_id,
device_id) instead of ON CONFLICT(brain_id) (and update any code that
sets/assumes brain_id as the PK to use the tenant_id/device_id pair or a
device-qualified brain_id).
- Around line 26-33: Ensure the migrations tracker exists before
checking/marking migration 003 by creating or ensuring the migrations table
up-front in _main(); call a helper (or add logic using resolve_brain_db +
table_exists/create operations) before invoking has_applied() and
mark_applied(). Specifically, in _main() (the function that calls has_applied()
and later mark_applied()), add a step that uses resolve_brain_db() and
table_exists() (or an explicit create-migrations step) and create the migrations
table if missing so mark_applied() never runs without the tracker present; use
the existing helpers (add_column_if_missing/create_index_if_missing) as needed
to create the table schema.

In `@Gradata/tests/test_embeddings_integration.py`:
- Around line 35-36: The nested patch.context managers around the test client
should be flattened: replace the nested with patch.object(client, "_embed_api",
return_value=[0.5, 0.6]) as mock_api: with patch.object(client, "_embed_local")
as mock_local: blocks by a single with statement using comma-separated context
managers (e.g., with patch.object(client, "_embed_api", return_value=[0.5, 0.6])
as mock_api, patch.object(client, "_embed_local") as mock_local:) for both
occurrences so the mocks for client._embed_api and client._embed_local are
created in one line and used in the test body.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6f42ec87-a591-4762-a489-d9457f006e8d

📥 Commits

Reviewing files that changed from the base of the PR and between c8a106f and 4bed87c.

📒 Files selected for processing (7)

Gradata/src/gradata/_cloud_sync.py
Gradata/src/gradata/_doctor.py
Gradata/src/gradata/_migrations/003_add_sync_state.py
Gradata/tests/test_cloud_credentials.py
Gradata/tests/test_cloud_row_push.py
Gradata/tests/test_doctor_cloud.py
Gradata/tests/test_embeddings_integration.py

💤 Files with no reviewable changes (2)

Gradata/tests/test_cloud_row_push.py
Gradata/src/gradata/_cloud_sync.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pytest windows-latest / py3.11
GitHub Check: pytest windows-latest / py3.12

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

📚 Learning: 2026-04-17T17:18:07.439Z

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

Gradata/tests/test_doctor_cloud.py
Gradata/src/gradata/_doctor.py

🪛 Ruff (0.15.10)

Gradata/tests/test_embeddings_integration.py

[warning] 35-36: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

[warning] 43-44: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

🔇 Additional comments (1)

Gradata/tests/test_embeddings_integration.py (1)

3-3: Good migration to the canonical embeddings module path.

Line 3 and Line 29 are aligned with the service-layer move and keep test behavior unchanged.

Also applies to: 29-29

coderabbitai · 2026-04-22T08:50:38Z

+        with patch.object(client, "_embed_api", return_value=[0.5, 0.6]) as mock_api:
+            with patch.object(client, "_embed_local") as mock_local:


🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="$(fd -i '^test_embeddings_integration\.py$' | head -n1)" python - <<'PY' "$FILE" import re import sys from pathlib import Path p = Path(sys.argv[1]) lines = p.read_text().splitlines() for i in range(len(lines) - 1): if re.match(r'^\s*with patch\.object$.*$:\s*$', lines[i]) and re.match(r'^\s+with patch\.object$.*$:\s*$', lines[i + 1]): print(f"Nested patch context managers at Line {i+1}-{i+2}") PY

Repository: Gradata/gradata

Length of output: 41

🏁 Script executed:

fd -i 'test_embeddings_integration.py' --type f

Repository: Gradata/gradata

Length of output: 103

🏁 Script executed:

sed -n '30,50p' Gradata/tests/test_embeddings_integration.py

Repository: Gradata/gradata

Length of output: 1015

🌐 Web query:

SIM117 flake8-simplify nested with statement

💡 Result:

SIM117 is a rule from the flake8-simplify plugin (also ported to Ruff as SIM117) that detects nested with statements using the same scope and suggests merging them into a single with statement with multiple context managers. This reduces indentation and improves readability, as supported in Python 3+. Example of violation: with A as a: with B as b: print("hello") Suggested fix: with A as a, B as b: print("hello") The rule was added via GitHub issue #35 and PR #36 in flake8-simplify, implemented by MartinThoma based on suggestion by Skylion007. It is documented in the plugin's README and PyPI page. Ruff's documentation notes exemptions for certain async context managers like those from anyio, asyncio, and trio when used standalone. Fixes may be unavailable if the merged line exceeds configured line length.

Citations:

1: https://github.com/MartinThoma/flake8-simplify

2: [SIM117] Merge with statements for context managers that have same scope MartinThoma/flake8-simplify#35

3: https://pypi.org/project/flake8_simplify/

4: https://docs.astral.sh/ruff/rules/multiple-with-statements/

Flatten nested patch context managers into a single with statement.

Lines 35-36 and 43-44 contain nested patch.object calls. Merge them using comma-separated context managers to improve readability and eliminate the SIM117 violation:

♻️ Proposed refactor

- with patch.object(client, "_embed_api", return_value=[0.5, 0.6]) as mock_api: - with patch.object(client, "_embed_local") as mock_local: - client.embed("test") - mock_api.assert_called_once() - mock_local.assert_not_called() + with patch.object(client, "_embed_api", return_value=[0.5, 0.6]) as mock_api, patch.object( + client, "_embed_local" + ) as mock_local: + client.embed("test") + mock_api.assert_called_once() + mock_local.assert_not_called() @@ - with patch.object(client, "_embed_api", side_effect=Exception("down")): - with patch.object(client, "_embed_local", return_value=[0.1]) as mock_local: - client.embed("test") - mock_local.assert_called_once() + with patch.object(client, "_embed_api", side_effect=Exception("down")), patch.object( + client, "_embed_local", return_value=[0.1] + ) as mock_local: + client.embed("test") + mock_local.assert_called_once()

🧰 Tools

🪛 Ruff (0.15.10)

[warning] 35-36: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Gradata/tests/test_embeddings_integration.py` around lines 35 - 36, The nested patch.context managers around the test client should be flattened: replace the nested with patch.object(client, "_embed_api", return_value=[0.5, 0.6]) as mock_api: with patch.object(client, "_embed_local") as mock_local: blocks by a single with statement using comma-separated context managers (e.g., with patch.object(client, "_embed_api", return_value=[0.5, 0.6]) as mock_api, patch.object(client, "_embed_local") as mock_local:) for both occurrences so the mocks for client._embed_api and client._embed_local are created in one line and used in the test body.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (7)

Gradata/src/gradata/_migrations/_runner.py (1)

114-135: ⚠️ Potential issue | 🟠 Major

Prevent dropping an index that is not attached to the target table.

At Line 122, existing_unique defaults to False. If Line 114 finds a global index name match but that index is not listed for table in PRAGMA index_list, Line 133 can still drop it and recreate it on the wrong table.

Suggested fix

     if index_exists(conn, index):
         if not unique:
             return False
@@
-        existing_unique = False
+        existing_unique: bool | None = None
         for _, idx_name, is_unique, *_rest in conn.execute(
             f"PRAGMA index_list({table})"
         ).fetchall():
             if idx_name == index:
                 existing_unique = bool(is_unique)
                 break
+        if existing_unique is None:
+            raise sqlite3.OperationalError(
+                f"Index '{index}' exists but is not attached to table '{table}'"
+            )
         if existing_unique:
             return False
         # IF EXISTS closes the TOCTOU window between index_exists() and the

#!/bin/bash
# Repro SQLite behavior showing why a table-ownership sentinel is required.
python - <<'PY'
import sqlite3

def create_index_if_missing_current(conn, index, table, columns, *, unique=False):
    def table_exists(c, t):
        return c.execute(
            "SELECT 1 FROM sqlite_master WHERE type='table' AND name=?", (t,)
        ).fetchone() is not None
    def index_exists(c, i):
        return c.execute(
            "SELECT 1 FROM sqlite_master WHERE type='index' AND name=?", (i,)
        ).fetchone() is not None

    if not table_exists(conn, table):
        return False
    if index_exists(conn, index):
        if not unique:
            return False
        existing_unique = False
        for _, idx_name, is_unique, *_ in conn.execute(f"PRAGMA index_list({table})").fetchall():
            if idx_name == index:
                existing_unique = bool(is_unique)
                break
        if existing_unique:
            return False
        conn.execute(f"DROP INDEX IF EXISTS {index}")
    kw = "UNIQUE INDEX" if unique else "INDEX"
    conn.execute(f"CREATE {kw} {index} ON {table} ({columns})")
    return True

conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE a(id INTEGER)")
conn.execute("CREATE TABLE b(id INTEGER)")
conn.execute("CREATE INDEX idx_shared ON b(id)")

print("before:", conn.execute("SELECT name, tbl_name FROM sqlite_master WHERE type='index'").fetchall())
create_index_if_missing_current(conn, "idx_shared", "a", "id", unique=True)
print("after: ", conn.execute("SELECT name, tbl_name FROM sqlite_master WHERE type='index'").fetchall())
PY

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/_migrations/_runner.py` around lines 114 - 135, The code
can drop a same-named index from a different table because existing_unique
defaults to False and the PRAGMA check only inspects indexes listed for the
target table; change the logic to verify the index actually belongs to the
target table before dropping it: after index_exists(conn, index) query
sqlite_master for the index row (SELECT tbl_name FROM sqlite_master WHERE
type='index' AND name=?), compare tbl_name to the target table, and only proceed
to DROP INDEX IF EXISTS {index} (or return False/raise) when they match;
alternatively set a boolean found_on_table when iterating PRAGMA
index_list({table}) and only drop/recreate when found_on_table is true (and
preserve the existing unique-check logic using existing_unique).

Gradata/src/gradata/services/embeddings.py (3)

45-55: ⚠️ Potential issue | 🔴 Critical

Restrict the localhost exception to HTTP(S).

Line 50 still returns True for any scheme on localhost/127.0.0.1, so a misconfigured GRADATA_API_URL like file://localhost/... bypasses the SSRF guard and will still carry the bearer token. Gate the localhost exception on parsed.scheme in {"http", "https"}.

🔒 Proposed fix

         parsed = urlparse(url)
-        if parsed.hostname in ("localhost", "127.0.0.1"):
-            return True
         host = parsed.hostname or ""
+        if host in ("localhost", "127.0.0.1"):
+            return parsed.scheme in {"http", "https"}
         return bool(
             parsed.scheme == "https" and (host == "gradata.ai" or host.endswith(".gradata.ai"))
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/services/embeddings.py` around lines 45 - 55, The
_is_trusted_url function currently treats any scheme as trusted for
localhost/127.0.0.1; update the localhost check so it only returns True when
parsed.hostname is "localhost" or "127.0.0.1" AND parsed.scheme is one of
{"http", "https"} (use parsed.scheme and parsed.hostname variables), then keep
the existing HTTPS check for gradata.ai and subdomains unchanged; this prevents
non-HTTP schemes like file:// from bypassing the SSRF guard.

87-90: ⚠️ Potential issue | 🟡 Minor

Fix the documented range for semantic_similarity.

gradata._math.cosine_similarity() can return values in [-1.0, 1.0], so the current [0.0, 1.0] contract is incorrect.

📝 Proposed fix

     def semantic_similarity(self, text_a: str, text_b: str) -> float:
         """Cosine similarity between embeddings of two texts.
-        Returns 0.0 if either text is empty, otherwise [0.0, 1.0].
+        Returns 0.0 if either text is empty, otherwise a value in [-1.0, 1.0].
         """

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/services/embeddings.py` around lines 87 - 90, The
docstring for semantic_similarity incorrectly states the similarity range as
[0.0, 1.0]; update the triple-quoted docstring in the semantic_similarity method
to reflect that it returns values in [-1.0, 1.0] (or 0.0 if either input is
empty), and mention that it delegates to gradata._math.cosine_similarity so the
contract matches that function's output.

111-116: ⚠️ Potential issue | 🟠 Major

Normalize nullable descriptions before embedding.

lesson.get("description", "") does not protect the case where the key exists with None; in local mode that reaches text.lower() and aborts clustering for the whole batch. Normalize to a string and keep None for blank values so the guard at Line 140 actually skips them.

🛠️ Proposed fix

     if not lessons:
         return []
     if client is None:
         client = get_client()
-    vectors = [client.embed(l.get("description", "")) for l in lessons]
+    vectors = []
+    for lesson in lessons:
+        desc = lesson.get("description")
+        desc = desc.strip() if isinstance(desc, str) else ""
+        vectors.append(client.embed(desc) if desc else None)
     parent = list(range(len(lessons)))
     rank = [0] * len(lessons)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/services/embeddings.py` around lines 111 - 116, In
cluster_lessons_by_similarity, the list comprehension using l.get("description",
"") allows non-string or empty values to reach the embedder and crash (e.g.,
None causing .lower()), so replace that comprehension with normalization: for
each lesson get desc = lesson.get("description"); if desc is None keep None (so
the later guard at Line 140 can skip it), otherwise convert to str(desc).strip()
and if that yields an empty string set to None; then call client.embed only with
the normalized non-None strings. Ensure you update the variable used later
(vectors/descriptions) to reflect this normalization and preserve lesson
ordering.

Gradata/src/gradata/cloud/materializer.py (1)

355-365: ⚠️ Potential issue | 🟠 Major

Defensively parse winner numerics before building MaterializedRule.

RULE_CONFLICT_RESOLVED can supply winning_event / winning_snapshot payloads that bypass _apply_tier(). The direct float(...) / int(...) casts here will still raise on malformed values and abort the whole pull/materialization.

Suggested fix

     for key, evt in winners.items():
         data = evt.get("data") or {}
+        try:
+            confidence = float(data.get("confidence") or 0.0)
+        except (TypeError, ValueError):
+            confidence = 0.0
+        try:
+            fire_count = int(data.get("fire_count") or 0)
+        except (TypeError, ValueError):
+            fire_count = 0
         result.rules[key] = MaterializedRule(
             category=key[0],
             description=str(data.get("description") or ""),
             state=str(data.get("new_state") or ""),
-            confidence=float(data.get("confidence") or 0.0),
-            fire_count=int(data.get("fire_count") or 0),
+            confidence=confidence,
+            fire_count=fire_count,
             winning_event_ts=str(evt.get("ts") or ""),
             winning_device_id=str(data.get("device_id") or ""),
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/cloud/materializer.py` around lines 355 - 365, The loop
that constructs MaterializedRule from winners uses direct float(...) and
int(...) casts which can raise on malformed values; update the construction in
the for key, evt in winners.items() block to defensively parse confidence and
fire_count (from evt.get("data") or {}) using safe conversion (e.g., try/except
or helper functions like safe_float/safe_int) that return a default 0.0 / 0 on
failure or non-numeric input, then pass those sanitized values into
MaterializedRule(...); ensure winning_event_ts and winning_device_id are also
coerced to strings safely as currently done.

Gradata/tests/test_cloud_materializer.py (1)

385-388: ⚠️ Potential issue | 🟡 Minor

Drop the unused f-string prefix.

This literal has no interpolation, so Ruff will keep flagging F541.

Suggested fix

-                    ts=f"2026-04-20T00:30:00Z",
+                    ts="2026-04-20T00:30:00Z",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/tests/test_cloud_materializer.py` around lines 385 - 388, The ts
argument in the _evt(...) call is using an unnecessary f-string
(ts=f"2026-04-20T00:30:00Z"); remove the leading "f" so it becomes a plain
string literal while leaving any truly interpolated args (e.g.,
category=f"cat{k}") unchanged; update the _evt invocations in the test where ts
is a static timestamp to eliminate the unused f-string prefix.

Gradata/src/gradata/cloud/pull.py (1)

198-210: ⚠️ Potential issue | 🟠 Major

Reject malformed events/watermark/end_of_stream values instead of coercing them.

parsed.get("events") or [] silently accepts falsy non-lists like ""/0, and bool(parsed.get("end_of_stream", True)) turns "false" into True. A non-scalar watermark also flows into pagination/persistence. Those 200s should still return {"status": "error", "reason": "malformed_response"} rather than looking successful.

Suggested fix

-        page_events = parsed.get("events") or []
-        if not isinstance(page_events, list) or any(
+        if "events" in parsed:
+            page_events = parsed["events"]
+        else:
+            page_events = []
+        if not isinstance(page_events, list) or any(
             not isinstance(evt, dict) for evt in page_events
         ):
             log.warning("events/pull: events field is not a list of objects")
             summary["status"] = "error"
             summary["reason"] = "malformed_response"
             return summary
         all_events.extend(page_events)
         page_watermark = parsed.get("watermark")
-        if page_watermark:
-            final_watermark = page_watermark
-        end_of_stream = bool(parsed.get("end_of_stream", True))
+        if page_watermark is not None and not isinstance(
+            page_watermark, (str, int, float)
+        ):
+            log.warning("events/pull: watermark is not scalar")
+            summary["status"] = "error"
+            summary["reason"] = "malformed_response"
+            return summary
+        eos = parsed.get("end_of_stream", True)
+        if not isinstance(eos, bool):
+            log.warning("events/pull: end_of_stream is not boolean")
+            summary["status"] = "error"
+            summary["reason"] = "malformed_response"
+            return summary
+        if page_watermark is not None:
+            final_watermark = str(page_watermark)
+        end_of_stream = eos

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/cloud/pull.py` around lines 198 - 210, The code currently
coerces falsy/non-boolean values; instead validate types strictly: when reading
parsed.get("events") check whether "events" is present and is a list and every
item is a dict (use the page_events variable and the existing any(...) check)
and if not, set summary["status"]="error" and
summary["reason"]="malformed_response" and return; for watermark, read
parsed.get("watermark") into page_watermark and reject (error/return) if it is
present but not a scalar type you expect (e.g., not instance of (str, int,
float)); for end_of_stream, do not coerce with bool(...): if "end_of_stream" in
parsed then require isinstance(parsed["end_of_stream"], bool) and assign to
end_of_stream, otherwise treat it as missing and continue; update uses of
final_watermark and all_events accordingly so malformed inputs are rejected
rather than coerced.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Gradata/src/gradata/cloud/materializer.py`:
- Around line 141-145: The current try/except around parsing e_conf and i_conf
causes both to be reset to 0.0 if either cast fails; change this to parse each
confidence independently (e.g., separate try/except blocks or a small helper
like safe_parse_float) so that a malformed e_data["confidence"] does not wipe
out i_conf and vice versa; update the parsing for e_conf and i_conf (referencing
e_data, i_data, e_conf, i_conf) inside the same function in materializer.py to
preserve the valid confidence value when the other fails.

In `@Gradata/src/gradata/cloud/pull.py`:
- Around line 151-159: The User-Agent is hard-coded in the Request construction
(see req variable in pull.py) which causes incorrect SDK version reporting;
change the header to use a central version constant or package metadata instead
of "gradata-sdk/0.6" (e.g., import the shared VERSION or __version__ used
elsewhere in the package and set User-Agent to f"gradata-sdk/{VERSION}") so it
automatically reflects the current release; update the import at the top of the
module to reference that constant and use it in the headers for the Request
creation.

In `@Gradata/src/gradata/cloud/push.py`:
- Around line 52-73: _fetch_events_since currently returns events across all
tenants which can cause cross-tenant uploads; add a tenant_id parameter to
_fetch_events_since(conn, last_event_id, limit, tenant_id) and include a WHERE
condition filtering on tenant_id in the SELECT (e.g. "AND tenant_id = ?") and
pass tenant_id as a bound param before the LIMIT; update every call site that
invokes _fetch_events_since (including the batch-stamping callsite that later
sets tenant_id) to pass the active tenant_id so the query only returns rows for
that tenant.
- Around line 199-205: Validate push_pending_events input parameters
(backoff_base, timeout, max_retries, chunk_size) at the top of the function
before any network or sleep calls and return a structured error summary dict
instead of letting exceptions propagate; specifically check backoff_base >= 0,
timeout >= 0, max_retries >= 0 (and chunk_size > 0 if applicable), and if any
check fails return the same summary shape used on other failures with an error
message and non-success status. Update push_pending_events to perform these
checks before entering the try/except that currently only catches sqlite3.Error
so that invalid inputs never reach time.sleep or urllib.request.urlopen and
violate the "never raise" contract.

In `@Gradata/src/gradata/services/embeddings.py`:
- Around line 65-68: The code currently returns data["embedding"] directly;
update the embedding-returning method (the function that builds req =
Request(str(self.api_url), ...) and currently returns data["embedding"]) to
validate the payload: confirm the "embedding" key exists, that its value is a
list/sequence of numeric types, and that its length matches the expected
dimensionality (use the class field used elsewhere, e.g., self.dim or the
expected_dim variable) and raise a ValueError (or similar) on any mismatch so
callers (like embed()) will trigger the existing fallback path instead of
failing later in clustering/similarity.

---

Duplicate comments:
In `@Gradata/src/gradata/_migrations/_runner.py`:
- Around line 114-135: The code can drop a same-named index from a different
table because existing_unique defaults to False and the PRAGMA check only
inspects indexes listed for the target table; change the logic to verify the
index actually belongs to the target table before dropping it: after
index_exists(conn, index) query sqlite_master for the index row (SELECT tbl_name
FROM sqlite_master WHERE type='index' AND name=?), compare tbl_name to the
target table, and only proceed to DROP INDEX IF EXISTS {index} (or return
False/raise) when they match; alternatively set a boolean found_on_table when
iterating PRAGMA index_list({table}) and only drop/recreate when found_on_table
is true (and preserve the existing unique-check logic using existing_unique).

In `@Gradata/src/gradata/cloud/materializer.py`:
- Around line 355-365: The loop that constructs MaterializedRule from winners
uses direct float(...) and int(...) casts which can raise on malformed values;
update the construction in the for key, evt in winners.items() block to
defensively parse confidence and fire_count (from evt.get("data") or {}) using
safe conversion (e.g., try/except or helper functions like safe_float/safe_int)
that return a default 0.0 / 0 on failure or non-numeric input, then pass those
sanitized values into MaterializedRule(...); ensure winning_event_ts and
winning_device_id are also coerced to strings safely as currently done.

In `@Gradata/src/gradata/cloud/pull.py`:
- Around line 198-210: The code currently coerces falsy/non-boolean values;
instead validate types strictly: when reading parsed.get("events") check whether
"events" is present and is a list and every item is a dict (use the page_events
variable and the existing any(...) check) and if not, set
summary["status"]="error" and summary["reason"]="malformed_response" and return;
for watermark, read parsed.get("watermark") into page_watermark and reject
(error/return) if it is present but not a scalar type you expect (e.g., not
instance of (str, int, float)); for end_of_stream, do not coerce with bool(...):
if "end_of_stream" in parsed then require isinstance(parsed["end_of_stream"],
bool) and assign to end_of_stream, otherwise treat it as missing and continue;
update uses of final_watermark and all_events accordingly so malformed inputs
are rejected rather than coerced.

In `@Gradata/src/gradata/services/embeddings.py`:
- Around line 45-55: The _is_trusted_url function currently treats any scheme as
trusted for localhost/127.0.0.1; update the localhost check so it only returns
True when parsed.hostname is "localhost" or "127.0.0.1" AND parsed.scheme is one
of {"http", "https"} (use parsed.scheme and parsed.hostname variables), then
keep the existing HTTPS check for gradata.ai and subdomains unchanged; this
prevents non-HTTP schemes like file:// from bypassing the SSRF guard.
- Around line 87-90: The docstring for semantic_similarity incorrectly states
the similarity range as [0.0, 1.0]; update the triple-quoted docstring in the
semantic_similarity method to reflect that it returns values in [-1.0, 1.0] (or
0.0 if either input is empty), and mention that it delegates to
gradata._math.cosine_similarity so the contract matches that function's output.
- Around line 111-116: In cluster_lessons_by_similarity, the list comprehension
using l.get("description", "") allows non-string or empty values to reach the
embedder and crash (e.g., None causing .lower()), so replace that comprehension
with normalization: for each lesson get desc = lesson.get("description"); if
desc is None keep None (so the later guard at Line 140 can skip it), otherwise
convert to str(desc).strip() and if that yields an empty string set to None;
then call client.embed only with the normalized non-None strings. Ensure you
update the variable used later (vectors/descriptions) to reflect this
normalization and preserve lesson ordering.

In `@Gradata/tests/test_cloud_materializer.py`:
- Around line 385-388: The ts argument in the _evt(...) call is using an
unnecessary f-string (ts=f"2026-04-20T00:30:00Z"); remove the leading "f" so it
becomes a plain string literal while leaving any truly interpolated args (e.g.,
category=f"cat{k}") unchanged; update the _evt invocations in the test where ts
is a static timestamp to eliminate the unused f-string prefix.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9aee0fca-35ea-4a54-8834-7760ddcca719

📥 Commits

Reviewing files that changed from the base of the PR and between 7f5b558 and 15ab7a1.

📒 Files selected for processing (8)

Gradata/src/gradata/_doctor.py
Gradata/src/gradata/_migrations/_runner.py
Gradata/src/gradata/cloud/materializer.py
Gradata/src/gradata/cloud/pull.py
Gradata/src/gradata/cloud/push.py
Gradata/src/gradata/services/embeddings.py
Gradata/tests/test_cloud_events_push.py
Gradata/tests/test_cloud_materializer.py

📜 Review details

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

📚 Learning: 2026-04-17T17:18:07.439Z

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

Gradata/src/gradata/_doctor.py
Gradata/tests/test_cloud_materializer.py
Gradata/tests/test_cloud_events_push.py

🪛 Ruff (0.15.10)

Gradata/tests/test_cloud_materializer.py

[error] 387-387: f-string without any placeholders

Remove extraneous f prefix

(F541)

Gradata/src/gradata/cloud/materializer.py

[warning] 35-35: Import from collections.abc instead: Iterable

Import from collections.abc

(UP035)

[warning] 369-375: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (9)

Gradata/src/gradata/_doctor.py (7)

24-24: LGTM!

The urllib.parse import is correctly added to support the urlsplit() call in _check_cloud_reachable().

253-274: LGTM!

The error messages correctly guide users to the new gradata cloud enable CLI command, consistent with the legacy login/logout removal documented in the PR objectives.

277-296: LGTM!

The _resolve_doctor_endpoint() function correctly delegates to resolve_endpoint() from _credentials.py, ensuring doctor probes honor the same endpoint precedence chain as the rest of the SDK. The defensive exception handling with fallback is appropriate for a diagnostics tool.

299-332: LGTM!

The function correctly imports and uses kill_switch_set() from the credentials module, with a fallback that exactly matches the canonical parser's truthy set ("1", "true", "yes", "on"). This ensures consistent behavior between doctor checks and the rest of the SDK.

335-353: LGTM!

The function correctly uses urlsplit() to parse the endpoint URL and extract the actual port, ensuring that custom endpoints like https://host:8443/api/v1 are probed at the correct port rather than always defaulting to :443.

379-405: LGTM!

The auth check now uses _resolve_doctor_endpoint() for consistent endpoint resolution, and the 401/403 error message correctly directs users to gradata cloud enable.

408-443: LGTM!

The data check now uses _resolve_doctor_endpoint(), completing the consistent endpoint resolution across all cloud probe functions.

Gradata/src/gradata/_migrations/_runner.py (2)

43-44: Behavior-preserving cleanup looks good.

The compact execute(...).fetchone() form keeps the same semantics and improves readability.

105-111: Docstring clarification is solid.

Good addition explaining why UNIQUE backing is required for ON CONFLICT targets.

Addresses 12 review items across two incremental CR rounds. Major correctness / schema: - _runner.create_index_if_missing: verify index belongs to target table via sqlite_master before dropping. Prevents an index of the same name on a different table from being silently dropped during the UNIQUE upgrade. - push._fetch_events_since: thread tenant_id into the SELECT with "(tenant_id = ? OR tenant_id IS NULL)" so foreign-tenant rows are never uploaded under the current tenant's identity. Legacy NULL rows still push so pre-migration work is not stranded. - push.push_pending_events: validate chunk_size / max_retries / backoff_base / timeout up front and return reason=invalid_params instead of letting negative values reach time.sleep or the network loop. - materializer._apply_tier: parse e_conf / i_conf independently; a bad cast on one side no longer resets both to 0.0 and mask legitimate Tier 2 drift as zero-delta Tier 1. - services.embeddings._embed_api: validate response shape (dict, embedding list, dim match, numeric entries). Malformed payloads raise ValueError so embed() falls through to the local path instead of poisoning the LRU cache. Minor hardening: - materializer: defensive numeric coercion in the MaterializedRule builder so a non-numeric confidence/fire_count cannot abort the whole pass. - materializer: collapse _MATERIALIZER_EVENT_TYPES duplicate into the existing _MATERIALIZABLE_EVENT_TYPES source of truth. - materializer: switch Iterable import to collections.abc. - services.embeddings.semantic_similarity: docstring range is [-1, 1]. - services.embeddings.cluster_lessons_by_similarity: skip blank/whitespace descriptions instead of embedding empty strings. - pull/push: User-Agent now embeds the resolved SDK version rather than the hardcoded 0.6 placeholder. - tests/test_cloud_materializer.py: drop extraneous f"" prefix on literal ts. New regression tests (+13): - test_cloud_events_push: invalid-tunables short-circuit, foreign-tenant rows are filtered, legacy NULL-tenant rows still push. - test_migration_runner_helpers: cross-table index drop refused, same-table non-UNIQUE -> UNIQUE upgrade works, idempotent when already UNIQUE. - test_embeddings: API shape validation (missing key, wrong dim, non-numeric) raises; embed() falls back to local on malformed response; cluster skips blank descriptions. - test_cloud_materializer: independent confidence parse still detects drift; materialize survives non-numeric winner fields. Full suite: 4036 passed, 2 skipped. Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Gradata · 2026-04-22T16:45:18Z

@coderabbitai review

coderabbitai · 2026-04-22T16:45:28Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Gradata/src/gradata/cloud/materializer.py`:
- Around line 97-108: The category normalization in the materializer currently
uses .lower() (variable category) which mismatches the apply path’s .upper() and
causes failed lookups; update the normalization so category =
str(data.get("category") or "").strip().upper() (keeping pattern_hash,
description and ident logic unchanged) so the returned tuple (category, ident)
matches the keys built in the apply path and prevents duplicate lesson creation.
- Around line 383-389: The exported names in the __all__ list
(CONFLICT_THRESHOLD, Conflict, MaterializedRule, MaterializeResult, materialize)
are not alphabetically sorted; update the __all__ assignment so the strings are
ordered lexicographically (e.g., "CONFLICT_THRESHOLD", "Conflict",
"MaterializedRule", "MaterializeResult", "materialize") to satisfy
static-analysis/style checks and maintain consistency.

In `@Gradata/src/gradata/cloud/push.py`:
- Around line 287-289: tenant_for(brain) and get_or_create_device_id(brain) are
invoked unguarded inside push_pending_events which violates the "never raises"
contract; wrap the calls to tenant_for and get_or_create_device_id in a
try/except that catches OSError (or a broader Exception if necessary), log the
error with context (including brain) using the module logger, and return/exit
push_pending_events gracefully (or skip pushing events) rather than allowing the
exception to propagate; update the code paths around tenant_id and device_id
initialization in push_pending_events to rely on the guarded values so no file
I/O errors escape.

In `@Gradata/src/gradata/services/embeddings.py`:
- Around line 113-120: get_client() has a race when multiple threads call it
concurrently and may create multiple EmbeddingClient instances; add a
module-level threading.Lock (e.g., _default_client_lock) and use a
double-checked locking pattern in get_client: if _default_client is None,
acquire the lock, check again, then instantiate EmbeddingClient and assign to
_default_client; ensure you import threading and only hold the lock briefly
while constructing/assigning to avoid contention.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 81374656-a56f-4359-a2d4-58a70d836403

📥 Commits

Reviewing files that changed from the base of the PR and between 15ab7a1 and 0705be6.

📒 Files selected for processing (9)

Gradata/src/gradata/_migrations/_runner.py
Gradata/src/gradata/cloud/materializer.py
Gradata/src/gradata/cloud/pull.py
Gradata/src/gradata/cloud/push.py
Gradata/src/gradata/services/embeddings.py
Gradata/tests/test_cloud_events_push.py
Gradata/tests/test_cloud_materializer.py
Gradata/tests/test_embeddings.py
Gradata/tests/test_migration_runner_helpers.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pytest windows-latest / py3.11
GitHub Check: pytest macos-latest / py3.12

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

📚 Learning: 2026-04-17T17:18:07.439Z

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

Gradata/src/gradata/cloud/push.py
Gradata/tests/test_cloud_materializer.py
Gradata/tests/test_cloud_events_push.py

🪛 Ruff (0.15.10)

Gradata/src/gradata/cloud/materializer.py

[warning] 383-389: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (45)

Gradata/src/gradata/_migrations/_runner.py (3)

43-43: LGTM!

Formatting cleanup that collapses the query onto a single line while preserving the same row is None / row is not None logic.

97-111: Clean API extension with backward compatibility.

The keyword-only unique parameter maintains full backward compatibility with existing migration callers (all current usages in 001/002/003 use default unique=False). The docstring clearly documents the SQLite-specific constraint requirement.

114-145: All prior review issues addressed; logic is correct.

The implementation properly handles the three flagged scenarios:

Cross-table guard (lines 119–124): Queries sqlite_master.tbl_name to refuse mutation when the index belongs to a different table.

Reliable UNIQUE detection (lines 127–138): Uses PRAGMA index_list rather than substring matching on DDL text.

TOCTOU window (line 143): DROP INDEX IF EXISTS closes the race between existence check and drop.

The logic flow is sound—after confirming table ownership, the existing_unique flag from PRAGMA is authoritative.

Gradata/tests/test_migration_runner_helpers.py (3)

16-39: Solid regression test for the cross-table bug.

This test directly reproduces the scenario flagged in prior reviews—an index with the same name on a different table must not be dropped. The assertions verify both the return value and the index's continued ownership.

42-58: Good coverage of the happy-path upgrade.

Verifies the non-UNIQUE → UNIQUE upgrade still works on the same table. The for/else pattern with AssertionError ensures the test fails explicitly if the index disappears.

61-69: Covers the idempotent no-op case.

Ensures the function returns False without mutation when the index is already UNIQUE.

Consider adding a simple test for the base case (index doesn't exist, gets created) for completeness, though the core regression scenarios are well covered.
[approve_code_changes, suggest_optional_refactor]

Gradata/src/gradata/cloud/pull.py (6)

72-76: LGTM — no_db status now properly returned.

The status handling for missing system.db correctly returns "no_db" as the stable status value, matching the documented contract.

99-111: LGTM — Endpoint resolution now uses the shared resolver.

The code correctly routes through _creds.resolve_endpoint() so GRADATA_ENDPOINT / GRADATA_CLOUD_API_BASE env overrides take effect consistently with push and CLI.

191-206: LGTM — Response shape validation is thorough.

The code now validates that parsed is a dict and that events is a list of dicts before proceeding, preventing crashes from malformed server responses.

266-292: LGTM — Watermark persistence failure now blocks success telemetry.

The code correctly captures the return value of update_pull_cursor(), checks for persistence failure, and returns an error status before emitting CLOUD_SYNC_COMPLETED. This prevents callers from treating the run as successful when the cursor wasn't persisted.

247-264: Apply path only writes lessons when there are rules or conflicts to merge.

When apply=True but the materialized result has neither rules nor conflicts (e.g., page contained only tombstones or events from other devices), the lessons file is left untouched but the watermark is still persisted (lines 271-292). This is correct behavior per the comment on lines 266-270.

294-312: Telemetry emission is best-effort and doesn't affect return status.

The CLOUD_SYNC_COMPLETED event emission is wrapped in a try/except that logs failures without changing the summary status. This is appropriate for telemetry that shouldn't block the success path.

Gradata/tests/test_cloud_materializer.py (6)

23-46: LGTM — Test helper is well-structured.

The _evt() helper provides sensible defaults for all required fields while allowing tests to override specific values. The conditional pattern_hash inclusion keeps test events minimal.

96-128: Good Tier 2 conflict coverage.

The tests correctly validate:

Large confidence drift (≥0.15) triggers conflict

State disagreement triggers conflict

Post-conflict events are held (not advancing state)

The assertion result.events_skipped >= 1 on line 128 appropriately allows for implementation flexibility.

131-241: Comprehensive RULE_CONFLICT_RESOLVED test coverage.

The tests cover all resolution paths:

Resolution with matching graduation history

Resolution with embedded winning_event (incremental pull case)

Resolution with winning_snapshot reconstruction

Mismatched winning_ts clears hold but allows subsequent graduation

This aligns with the materializer's fallback chain implementation.

338-404: Convergence property tests are valuable ship-gate coverage.

The tests verify that 200+ shuffled orderings yield identical final state after sorting by ts. The second test (test_convergence_with_injected_conflicts) validates that conflict keys remain stable across permutations—important for deterministic Tier 2 behavior.

407-449: Good regression coverage for malformed confidence handling.

test_independent_confidence_parse_still_detects_drift validates that a malformed confidence on one side doesn't zero out the other side's valid confidence. test_materialize_survives_non_numeric_confidence_on_winner confirms the defensive parsing in the winner-building loop.

333-335: LGTM — Threshold constant pinned to spec.

This assertion ensures CONFLICT_THRESHOLD stays at 0.15 per the spec, catching accidental changes.

Gradata/src/gradata/cloud/materializer.py (6)

141-151: LGTM — Independent confidence parsing implemented.

Each confidence value is now parsed in its own try/except block, preventing a malformed value on one side from zeroing out the other side's valid confidence. This addresses the previously flagged Tier 2 bypass issue.

241-255: LGTM — Event type gating is correct.

Non-materializable event types are properly skipped (lines 241-246), and META_RULE_SYNTHESIZED is explicitly held per spec §4 (lines 247-255). This prevents non-graduation events from influencing winner selection.

267-321: LGTM — RULE_CONFLICT_RESOLVED handles all resolution paths.

The implementation correctly:

Looks up matching history by (winning_ts, winning_device_id)

Falls back to embedded winning_event for incremental pulls

Falls back to winning_snapshot reconstruction

Clears the hold even if no winner can be reconstructed (allowing next graduation to materialize)

357-379: LGTM — Defensive numeric coercion in winner building.

The try/except blocks around confidence and fire_count parsing prevent malformed data from crashing the entire materialization pass.

170-204: LGTM — DB event loading is robust.

The function handles missing DB files, SQLite errors, and malformed JSON gracefully by returning partial results rather than raising. The comment on lines 181-184 explains the consolidation of event type sets.

351-355: LGTM — Conflict pruning prevents stale events.

Conflicts resolved later in the same stream are correctly filtered out before returning, preventing callers from emitting RULE_CONFLICT events for already-adjudicated keys.

Gradata/src/gradata/cloud/push.py (5)

53-123: LGTM!

The _fetch_events_since function correctly implements tenant-scoped event fetching with proper handling of legacy NULL tenant rows, safe JSON parsing with fallbacks, and correct parameter binding order.

126-139: LGTM!

The watermark read correctly queries by the (tenant_id, device_id) composite key and handles missing table/row gracefully.

142-170: LGTM!

The watermark upsert correctly uses ON CONFLICT(tenant_id, device_id) to target the composite unique index, maintaining per-device isolation while preserving brain_id for backward compatibility.

173-207: LGTM!

The batch POST implementation correctly handles retry logic with exponential backoff, distinguishes between fatal 4xx and retriable 5xx/transport errors, and includes proper headers including a versioned User-Agent.

291-368: LGTM!

The database workflow is correctly wrapped with sqlite3.Error handling, the watermark only advances after full batch acceptance, and the connection is properly closed in the finally block. The partial 2xx safeguards (rejected list and accepted count mismatch) are well-implemented.

Gradata/tests/test_cloud_events_push.py (6)

24-40: LGTM!

The isolation fixture correctly rebinds keyfile paths and clears all relevant environment variables including the new endpoint overrides.

84-96: Test schema relies on production _ensure_sync_state_schema for composite key index.

The test fixture creates sync_state without the UNIQUE INDEX idx_sync_state_tenant_device that push.py requires for ON CONFLICT(tenant_id, device_id). This works because push_pending_events calls _ensure_sync_state_schema() before any watermark operations, which idempotently adds the missing index. This is the intended behavior per the comment in _sync_state.py ("lets a callsite use the cursor API on brains that predate 003").

No change needed, but documenting this coupling for future maintainers.

138-153: LGTM!

The mock response class correctly implements the context manager protocol and the default empty body aligns with the push client's backward-compatible handling of missing accepted keys.

189-218: LGTM!

Excellent end-to-end test that verifies the complete happy path including request body structure, summary counters, and watermark persistence in the database.

358-409: LGTM!

The partial 2xx tests provide essential regression coverage for the watermark-advancement gate. Both scenarios (rejected list and count mismatch) correctly assert that the watermark remains unset, preventing permanently skipped events.

434-516: LGTM!

The tenant isolation tests correctly verify that foreign tenant rows are excluded and legacy NULL tenant rows are included, matching the _fetch_events_since SQL logic in the production code.
Gradata/src/gradata/services/embeddings.py (6)
50-51: Localhost SSRF guard still allows arbitrary schemes.

The _is_trusted_url method returns True for localhost/127.0.0.1 without validating the scheme. A URL like file://localhost/etc/passwd would pass the trust check. Restrict localhost to safe schemes.
🔒 Proposed fix
-        if parsed.hostname in ("localhost", "127.0.0.1"):
-            return True
+        if parsed.hostname in ("localhost", "127.0.0.1"):
+            return parsed.scheme in {"http", "https"}
1-28: LGTM!

Module structure, imports, and constants are well-organized with clear documentation.

57-80: LGTM!

The API response validation is thorough — correctly rejects non-dict responses, wrong dimensions, non-numeric entries, and booleans. Exceptions properly trigger the local embedding fallback.

82-97: LGTM!

Good use of MD5 for deterministic trigram hashing with clear documentation explaining why the built-in hash() is unsuitable. The normalization correctly handles edge cases.

123-171: LGTM!

Clean union-find implementation with path splitting and union-by-rank for near-O(α(n)) amortized complexity. The blank-description filtering avoids unnecessary embedding calls, and the pairwise similarity check is appropriate for single-linkage clustering.

174-225: LGTM!

Thread-safe LRU cache implementation with proper lock handling — released during the blocking embed() call to avoid serializing the thread pool. The move_to_end() on cache hits ensures true LRU eviction.
Gradata/tests/test_embeddings.py (4)

1-31: LGTM!

Import updates correctly reflect the module relocation to gradata.services.embeddings. Existing similarity tests are preserved.

33-45: LGTM!

Clean context manager helper for mocking urlopen responses in tests.

47-88: LGTM!

Good coverage of malformed API response scenarios including missing keys, wrong dimensions, non-numeric values, and the local fallback path.

91-104: LGTM!

Effective test for blank-description handling — confirms both empty strings and whitespace-only strings are excluded from clustering while identical descriptions properly merge.

Addresses 4 items from CR review at 16:53:31Z (commit 0705be6). - materializer._rule_key: normalize category to upper() instead of lower(), matching the apply path (_apply_materialized.py) and the lessons.md convention. Functionally equivalent today — both sides uppercase on lookup — but aligns MaterializedRule.category with Lesson.category so external readers no longer need to re-normalize. - materializer.__all__: alphabetical sort (Ruff RUF022). - push.push_pending_events: guard tenant_for() / get_or_create_device_id() with try/except OSError so a corrupted brain dir returns a summary instead of raising — the public contract is "never raises". - pull.pull_events: same identity-resolution guard for symmetry. - embeddings.get_client: double-checked locking around _default_client initialization so concurrent async event-bus handlers can't race into creating two clients. New regression tests (+2): - test_identity_resolution_oserror_returns_summary: forces OSError out of tenant_for() and asserts summary.reason == "identity_error". - test_get_client_is_threadsafe_singleton: 16 threads racing on the first get_client() call all receive the same instance. Full suite: 4038 passed, 2 skipped. Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Gradata · 2026-04-22T17:28:12Z

@coderabbitai review

coderabbitai · 2026-04-22T17:28:17Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

Gradata/src/gradata/cloud/pull.py (1)

219-222: ⚠️ Potential issue | 🟠 Major

Reject non-boolean end_of_stream and non-string watermark.

The response-shape validation is still incomplete here: bool(parsed.get("end_of_stream", True)) turns values like "false" into True, and a dict/list watermark will later be stringified and persisted. These should fail as malformed_response instead of being coerced.

Proposed fix

         all_events.extend(page_events)
         page_watermark = parsed.get("watermark")
+        if page_watermark is not None and not isinstance(page_watermark, str):
+            log.warning("events/pull: watermark field is not a string")
+            summary["status"] = "error"
+            summary["reason"] = "malformed_response"
+            return summary
+        page_end_of_stream = parsed.get("end_of_stream", True)
+        if not isinstance(page_end_of_stream, bool):
+            log.warning("events/pull: end_of_stream field is not a boolean")
+            summary["status"] = "error"
+            summary["reason"] = "malformed_response"
+            return summary
         if page_watermark:
             final_watermark = page_watermark
-        end_of_stream = bool(parsed.get("end_of_stream", True))
+        end_of_stream = page_end_of_stream
         pages_fetched += 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Gradata/src/gradata/cloud/pull.py` around lines 219 - 222, The code coercing
response fields is accepting wrong types: instead of
bool(parsed.get("end_of_stream", True)) and implicitly stringifying watermark,
validate types explicitly — check parsed.get("end_of_stream") is a bool and
parsed.get("watermark") is a str (or absent) before assigning to end_of_stream
and final_watermark (symbols: parsed, end_of_stream, page_watermark,
final_watermark); if either type check fails, raise or return the existing
malformed_response error path (use the same malformed_response mechanism used
elsewhere) so non-boolean end_of_stream and non-string watermark are rejected
rather than coerced.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Gradata/src/gradata/cloud/materializer.py`:
- Around line 201-206: The loop that parses rows currently appends any valid
JSON value (in for loop over rows) which can be non-object types and later
breaks functions like _rule_key() and materialize(db_path=...) that expect
evt["data"].get(); after json.loads(data_json) in the loop (the block handling
data = json.loads(...)), check that the decoded value is a mapping/dict (e.g.,
isinstance(data, dict)) and skip (continue) non-dict values so only object JSON
rows are appended to out.

In `@Gradata/src/gradata/cloud/pull.py`:
- Around line 173-175: The code calls resp.read().decode("utf-8") which can
raise UnicodeDecodeError and bypass the JSON/error-guarding logic; wrap the byte
read and decode (the resp.read() / .decode("utf-8") operation around the
urllib.request.urlopen block) in a try/except that catches UnicodeDecodeError
and handles it exactly like a malformed_response (e.g., populate the stable
summary/error state and return or raise the same malformed_response path), and
apply the same change to the second occurrence of resp.read().decode("utf-8")
around the block referenced at lines 194-200 so both places consistently treat
bad UTF-8 as malformed_response.

In `@Gradata/tests/test_cloud_events_push.py`:
- Around line 215-218: Replace the manual open/close pattern with a context
manager: use "with sqlite3.connect(brain / 'system.db') as conn:" and run wm =
conn.execute("SELECT last_push_event_id FROM sync_state").fetchone() inside that
block (no explicit conn.close()), then assert wm[0] ==
"01HN000000000000000000000C"; this makes the code consistent with other tests
that use the same pattern.

---

Duplicate comments:
In `@Gradata/src/gradata/cloud/pull.py`:
- Around line 219-222: The code coercing response fields is accepting wrong
types: instead of bool(parsed.get("end_of_stream", True)) and implicitly
stringifying watermark, validate types explicitly — check
parsed.get("end_of_stream") is a bool and parsed.get("watermark") is a str (or
absent) before assigning to end_of_stream and final_watermark (symbols: parsed,
end_of_stream, page_watermark, final_watermark); if either type check fails,
raise or return the existing malformed_response error path (use the same
malformed_response mechanism used elsewhere) so non-boolean end_of_stream and
non-string watermark are rejected rather than coerced.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e1a8adfb-da09-4dc2-8c29-1a8489442092

📥 Commits

Reviewing files that changed from the base of the PR and between 0705be6 and ed32b6b.

📒 Files selected for processing (6)

Gradata/src/gradata/cloud/materializer.py
Gradata/src/gradata/cloud/pull.py
Gradata/src/gradata/cloud/push.py
Gradata/src/gradata/services/embeddings.py
Gradata/tests/test_cloud_events_push.py
Gradata/tests/test_embeddings.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pytest macos-latest / py3.11
GitHub Check: pytest windows-latest / py3.11
GitHub Check: pytest windows-latest / py3.12

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

📚 Learning: 2026-04-17T17:18:07.439Z

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

Gradata/src/gradata/cloud/push.py
Gradata/tests/test_cloud_events_push.py

🔇 Additional comments (28)

Gradata/src/gradata/services/embeddings.py (7)

50-51: Localhost SSRF guard still accepts any scheme.

The localhost check returns True for any URL scheme, including potentially dangerous ones like file://localhost/etc/passwd. This concern was already raised in a previous review.

1-28: LGTM!

Module structure is clean. The import of cosine_similarity at line 27 after the constants is slightly unconventional (typically all imports are grouped at the top), but this is a minor style preference and doesn't affect correctness.

57-80: LGTM!

Thorough validation of API response shape. The explicit not isinstance(v, bool) check correctly handles Python's bool-is-int subtyping quirk. Any validation failure surfaces as an exception that the outer embed() catches for clean fallback.

82-97: LGTM!

The MD5 usage is correctly justified in the comment—avoiding PYTHONHASHSEED non-determinism. Short texts (< 3 chars) produce zero vectors, which cosine_similarity handles gracefully by returning 0.0.

110-129: LGTM!

Double-checked locking pattern correctly implemented. The outer check avoids lock contention on the hot path after initialization; the inner check prevents duplicate instantiation under concurrent first-call races.

132-180: LGTM!

Union-find implementation is well-optimized with path splitting and union-by-rank for near-O(α(n)) amortized operations. Blank descriptions correctly produce None vectors and remain as singleton clusters.

183-234: LGTM!

Thread-safe LRU cache implementation is solid. The lock is correctly released during the embed() call to avoid serializing the thread pool on network I/O. The move_to_end() on cache hits implements true LRU semantics, and no redundant call is made on inserts since OrderedDict.__setitem__ already places new keys at the end.

Gradata/tests/test_embeddings.py (5)

3-9: LGTM!

Import updates correctly reference the relocated gradata.services.embeddings module.

33-44: LGTM!

Clean test helper that correctly implements the context manager protocol needed to mock urlopen responses.

47-88: LGTM!

Comprehensive coverage of _embed_api validation paths. The tests correctly verify that malformed responses raise ValueError and that embed() gracefully falls back to local embeddings.

91-104: LGTM!

Good coverage of blank-description handling. The assertion correctly expects two singleton clusters for the blank entries and one cluster of size 2 for the identical descriptions.

107-132: LGTM!

Well-structured concurrency test. The Event barrier ensures threads race simultaneously on get_client(), and the final assertion correctly verifies singleton semantics by checking all returned instances share the same id().

Gradata/src/gradata/cloud/pull.py (1)

113-122: Nice contract hardening for corrupt brain dirs.

Catching OSError around tenant_for() / get_or_create_device_id() keeps pull_events() on the documented summary-return path instead of surfacing filesystem corruption as an exception. Based on learnings, PR #102 addressed identity helpers returning a summary on corruption instead of raising.

Gradata/src/gradata/cloud/materializer.py (1)

144-154: Independent confidence parsing is the right fix.

A malformed confidence on one side no longer wipes out the valid value from the other side before Tier 2 comparison.

Gradata/src/gradata/cloud/push.py (6)

1-51: Module structure and imports look good.

The module docstring clearly documents the endpoint contract, safety properties (never raises, honors kill switch, watermark advancement semantics), and the overall design. The constants for chunking and retry are reasonable defaults.

53-123: Tenant scoping correctly addresses cross-tenant upload risk.

The query filter (tenant_id = ? OR tenant_id IS NULL) ensures:

Only current tenant's rows are pushed

Legacy pre-tenant rows (NULL) are still included for migration continuity

JSON parsing handles malformed data gracefully with fallback values.

Based on learnings, _events.py required tenant-scoped fallback SELECT for dedup — this follows the same pattern.

126-170: Watermark read/write correctly scoped to (tenant_id, device_id).

The _read_watermark query uses WHERE tenant_id = ? AND device_id = ?, and _write_watermark uses ON CONFLICT(tenant_id, device_id) targeting the composite unique index created by _ensure_sync_state_schema. This correctly isolates watermarks per-device for multi-device scenarios.

173-207: Retry logic with exponential backoff is well-implemented.

The distinction between 4xx (client error, fatal) and 5xx/transport errors (transient, retry) is correct. Backoff is applied only between retries (if attempt < max_retries), avoiding unnecessary delay after the final failure.

210-298: Comprehensive input validation maintains the "never raises" contract.

The validation sequence properly handles:

Invalid tunable parameters (returns invalid_params)

Missing database, kill switch, disabled sync, missing credential, non-HTTPS endpoint

Identity resolution failures (OSError caught and returns identity_error)

All paths return a structured summary dict.

300-377: Main push loop correctly handles partial acceptance and database errors.

The partial 2xx safeguard (lines 338-355) prevents watermark advancement when rejected is non-empty or accepted != len(events), ensuring rejected events aren't permanently skipped. The sqlite3.Error handler maintains the "never raises" contract for database failures.

Gradata/tests/test_cloud_events_push.py (8)

24-40: Isolation fixture properly includes endpoint override vars.

The fixture now clears GRADATA_ENDPOINT and GRADATA_CLOUD_API_BASE (lines 31-32), preventing CI/shell environment leakage from affecting test behavior.

43-135: Test helper creates a well-structured minimal brain.

The helper creates the necessary schema and identity files. The sync_state table intentionally lacks the UNIQUE index on (tenant_id, device_id) — this tests that _ensure_sync_state_schema correctly upgrades the schema at runtime.

156-187: Early-exit tests correctly verify all short-circuit paths.

Tests cover: missing database, kill switch, disabled config, missing credential, and non-HTTPS rejection — all returning the expected structured error summaries.

189-259: Happy path and batching tests provide solid coverage.

The tests verify:

Correct request body shape and event count

Watermark persistence after successful push

Resume semantics (skip already-pushed events)

Chunked batching produces expected batch sizes [2, 2, 1] for 5 events with chunk_size=2

262-325: Retry tests correctly verify 4xx fatal vs 5xx transient distinction.

The tests assert:

4xx errors trigger immediate failure (1 call only)

5xx errors exhaust retries (1 + max_retries calls)

Transient failure followed by success yields status=ok

Sleep is monkeypatched for fast execution.

358-409: Partial 2xx regression tests verify watermark safety.

These critical tests ensure the watermark isn't advanced when:

Server returns a non-empty rejected list

accepted count doesn't match events sent

Both correctly assert status=error, reason=batch_rejected, and verify no watermark row exists.

434-516: Tenant scoping tests verify cross-tenant isolation and legacy row handling.

The tests confirm:

Foreign tenant rows (different tenant_id) are excluded from push

Legacy rows (tenant_id IS NULL) are included under the current tenant

This matches the (tenant_id = ? OR tenant_id IS NULL) filter in _fetch_events_since.

519-529: Identity resolution error test verifies the "never raises" contract.

The test confirms that OSError from corrupted brain files is caught and returns a structured error summary (reason=identity_error) rather than propagating.

…materializer rows CodeRabbit round on batch-5 flagged four residual hardening gaps; this lands them together so the PR stays on the narrow cloud-sync footprint. pull.py — response decode: - ``resp.read().decode("utf-8")`` could raise ``UnicodeDecodeError`` that bypassed the HTTP / transport / JSON guards, escaping the "never raise — return summary" contract. Now caught and mapped to ``reason=malformed_response`` like other payload failures. pull.py — end_of_stream / watermark type validation: - ``bool(parsed.get("end_of_stream", True))`` coerced the string ``"false"`` to True, silently claiming the stream was drained. - A dict / list ``watermark`` would later be ``str()``-ified and persisted into ``sync_state`` as garbage, poisoning every subsequent incremental pull. Both now fail fast as ``malformed_response`` before the value is consulted. materializer._load_events_from_db — skip non-dict ``data`` payloads: - A legacy events row whose ``data_json`` decoded to a scalar / list / string would reach ``_rule_key`` / ``_apply_tier`` where ``evt["data"].get(...)`` crashes. Drop non-mapping payloads alongside the existing JSONDecodeError skip. test_cloud_events_push.py — sqlite context manager: - Replaced the manual ``connect`` / ``close`` pair in the happy-path watermark assertion with ``with sqlite3.connect(...) as conn:`` so a failing assertion can't leak the connection handle and mask the real failure behind a Windows file-lock on teardown. Regression coverage (+4 tests, full suite 4042 passed): - test_server_200_non_utf8_body_returns_malformed_response - test_server_200_non_string_watermark_is_rejected - test_server_200_non_bool_end_of_stream_is_rejected - test_load_events_from_db_skips_non_dict_data Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Gradata · 2026-04-22T18:07:14Z

@coderabbitai review

coderabbitai · 2026-04-22T18:07:20Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Gradata/src/gradata/cloud/materializer.py`:
- Around line 184-192: The default event types tuple is built from
_MATERIALIZABLE_EVENT_TYPES which includes META_RULE_SYNTHESIZED but the
materialize loop (which checks for META_RULE_SYNTHESIZED and always skips it)
makes fetching those rows wasted I/O; change the default building of types in
_load_events_from_db (the variable types and placeholders/query construction) to
filter out META_RULE_SYNTHESIZED (e.g., remove it from the tuple when
event_types is None) or explicitly pass a filtered tuple so the SQL query never
includes META_RULE_SYNTHESIZED, ensuring the SELECT only requests materializable
types that will be processed.

In `@Gradata/src/gradata/cloud/pull.py`:
- Around line 282-297: The review points out that summary["applied"] is only set
when lessons were actually written (inside the mat.rules or mat.conflicts
branch), so callers can't distinguish "apply ran successfully but nothing to
change" from "apply not run"; change the logic so that when the apply path
completes successfully (i.e., apply=True and you persist the watermark and no
exception occurred), you set summary["applied"] = True regardless of whether
mat.rules or mat.conflicts were present. Locate the apply path code handling mat
(symbols: mat.rules, mat.conflicts, parse_lessons, apply_to_lessons,
write_lessons_safe, summary) and move or add the summary["applied"] = True
assignment to the successful completion point of that apply branch (after
persisting the watermark and after any lessons write try/except), keeping the
existing error handling that sets summary["status"] = "error" and returns on
exceptions.

In `@Gradata/tests/test_cloud_events_pull.py`:
- Around line 23-27: The brain fixture creates a Brain instance and emits an
event but never persists cloud config, so tests that expect the config to exist
may hit the "config load failed" path; update the fixture (or tests) so the
Brain's configuration is saved by calling its _save_cfg() with sensible defaults
after instantiation (e.g., call Brain(...); b.emit(...); b._save_cfg() )—or
alternatively add a brief comment on the fixture documenting that callers must
call _save_cfg() explicitly; reference the fixture name brain and the
Brain._save_cfg method when making the change.

In `@Gradata/tests/test_cloud_events_push.py`:
- Around line 84-96: The test schema for sync_state is missing the
UNIQUE(tenant_id, device_id) index that production upsert logic in
_write_watermark() expects; update the test table creation in the test helper so
the CREATE TABLE or a subsequent CREATE UNIQUE INDEX ensures UNIQUE(tenant_id,
device_id) exists (i.e., add UNIQUE constraint on tenant_id and device_id or
create a unique index for those columns) so ON CONFLICT(tenant_id, device_id)
behavior in _write_watermark() is exercised correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d21de32d-e687-4e67-af8d-78702e0d1b85

📥 Commits

Reviewing files that changed from the base of the PR and between ed32b6b and e9ebdba.

📒 Files selected for processing (5)

Gradata/src/gradata/cloud/materializer.py
Gradata/src/gradata/cloud/pull.py
Gradata/tests/test_cloud_events_pull.py
Gradata/tests/test_cloud_events_push.py
Gradata/tests/test_cloud_materializer.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: pytest macos-latest / py3.11
GitHub Check: pytest ubuntu-latest / py3.12
GitHub Check: pytest windows-latest / py3.12
GitHub Check: pytest windows-latest / py3.11
GitHub Check: pytest ubuntu-latest / py3.11
GitHub Check: pytest macos-latest / py3.12

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

📚 Learning: 2026-04-17T17:18:07.439Z

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

Gradata/tests/test_cloud_events_pull.py
Gradata/tests/test_cloud_materializer.py
Gradata/tests/test_cloud_events_push.py

🔇 Additional comments (6)

Gradata/tests/test_cloud_events_push.py (1)

1-528: LGTM!

Comprehensive test coverage for push_pending_events() including:

Early-exit status mappings (no_db, kill_switch, disabled, no_credential, https_required)

Happy path with watermark advancement

Resume from persisted watermark

Chunked batching across multiple POSTs

Retry semantics (4xx fatal, 5xx retried with backoff)

Partial 2xx rejection handling

Tenant scoping (foreign tenant rows excluded, legacy NULL rows included)

Identity resolution error handling

Past review comments have been addressed (env var isolation, partial 2xx test, context manager usage).

Gradata/src/gradata/cloud/pull.py (1)

1-349: LGTM overall — well-structured pull client with comprehensive error handling.

The implementation correctly:

Uses shared endpoint resolver for symmetric env override handling

Validates response JSON shape (dict, events list, watermark type, end_of_stream type)

Handles UTF-8 decode failures as malformed_response

Persists watermark only after successful apply and checks return value

Emits telemetry only after watermark persistence succeeds

Uses dynamic SDK version in User-Agent header

All past review comments have been addressed.

Gradata/tests/test_cloud_events_pull.py (1)

1-583: LGTM!

Comprehensive contract test suite for pull_events() covering:

All early-exit status mappings (no_db, kill_switch, disabled, no_credential, https_required, 501 → disabled_server_side, 410 → rewind_beyond_retention)

Dry-run vs. apply behavior (lessons.md mutation, applied flag)

Watermark persistence after successful apply

Malformed response handling (non-JSON, non-UTF8, wrong types)

Request composition (brain_id, device_id, Authorization header, limit clamping)

Auto-resume from persisted cursor

Pagination (end_of_stream, cursor advancement, defensive stop on missing watermark)

Configurable conflict threshold

Credential resolution from keyfile

Telemetry emission (CLOUD_SYNC_COMPLETED on apply only)

Past review comments have been addressed (env var isolation expanded).

Gradata/tests/test_cloud_materializer.py (2)

452-496: Excellent regression test for _load_events_from_db skipping non-dict data.

This test directly validates the fix for the past review comment about rows with scalar/list data_json crashing the materializer. The test correctly inserts well-formed and malformed rows (scalar 42, list [1,2,3], string "hi") and verifies only the dict row is returned.

1-496: LGTM!

Thorough test coverage for the materializer including:

Single-device graduation and promotion semantics

Tier 1 LWW (ts wins, device_id tiebreak, small drift)

Tier 2 conflict detection (confidence drift, state disagreement, post-conflict hold)

Conflict resolution via RULE_CONFLICT_RESOLVED with various payload variants (winning_ts lookup, winning_event, winning_snapshot)

Tier 3 RULE_OVERRIDE clearing conflict holds

Property invariants (idempotent replay, duplicate noop, order-independence)

Convergence across 200+ shuffled orderings

Regression tests for independent confidence parsing and non-dict data skipping

Past review comment (extraneous f-string) has been addressed.

Gradata/src/gradata/cloud/materializer.py (1)

1-397: LGTM overall — well-structured materializer with comprehensive Tier 1/2/3 handling.

The implementation correctly:

Normalizes category to uppercase for consistent key matching with the apply path

Parses confidence values independently to avoid masking legitimate Tier 2 conflicts

Filters event types to only process materializable events

Handles RULE_CONFLICT_RESOLVED with multiple fallback strategies (history lookup, winning_event, winning_snapshot)

Defensively parses numeric fields when building MaterializedRule

Skips non-dict data_json values from the database

Prunes conflicts that were resolved within the same stream

All past review comments have been addressed.

coderabbitai · 2026-04-22T18:13:26Z

+    conn.execute(
+        """
+        CREATE TABLE sync_state (
+            brain_id TEXT PRIMARY KEY,
+            last_push_at TEXT,
+            updated_at TEXT,
+            device_id TEXT,
+            last_push_event_id TEXT,
+            last_pull_cursor TEXT,
+            tenant_id TEXT
+        )
+        """
+    )


🧹 Nitpick | 🔵 Trivial

Test sync_state schema lacks the unique index required by production code.

The test helper creates sync_state without the UNIQUE(tenant_id, device_id) index that _write_watermark() relies on (see Gradata/src/gradata/cloud/push.py lines 155-161 using ON CONFLICT(tenant_id, device_id)). While SQLite will accept the ON CONFLICT clause without the index (falling back to no-op or row replacement depending on the conflict), this divergence means tests won't catch regressions if the production upsert logic changes.

♻️ Proposed fix to align test schema with production

conn.execute( """ CREATE TABLE sync_state ( brain_id TEXT PRIMARY KEY, last_push_at TEXT, updated_at TEXT, device_id TEXT, last_push_event_id TEXT, last_pull_cursor TEXT, - tenant_id TEXT + tenant_id TEXT, + UNIQUE(tenant_id, device_id) ) """ )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Gradata/tests/test_cloud_events_push.py` around lines 84 - 96, The test schema for sync_state is missing the UNIQUE(tenant_id, device_id) index that production upsert logic in _write_watermark() expects; update the test table creation in the test helper so the CREATE TABLE or a subsequent CREATE UNIQUE INDEX ensures UNIQUE(tenant_id, device_id) exists (i.e., add UNIQUE constraint on tenant_id and device_id or create a unique index for those columns) so ON CONFLICT(tenant_id, device_id) behavior in _write_watermark() is exercised correctly.

materializer._load_events_from_db — drop META_RULE_SYNTHESIZED from the default types tuple. The materialize() loop always skips that type (see line 255), so including it in the SQL IN (...) list was pure I/O waste — every pulled meta-rule row was fetched just to be discarded in the next loop iteration. Callers that genuinely want every type can still pass ``event_types`` explicitly. Also guards against an empty types tuple so a malformed override can't produce a ``WHERE type IN ()`` query. pull.py — ``applied`` now reflects "apply path ran end-to-end", not "lessons write touched at least one byte": - Previous behavior: a pull that materialized to an empty delta kept ``applied=False``, indistinguishable from apply=False preview mode. Callers scheduling follow-ups based on that flag would re-run the same drained page forever. - Fix: set ``summary["applied"] = True`` only after every hard failure path (lessons_write_failed, watermark_persist_failed) has had its chance to short-circuit, but unconditionally on the empty- delta happy path. test_cloud_events_pull.py — document that the ``brain`` fixture intentionally omits cloud-config; individual tests call ``_save_cfg`` with per-test flags so the unconfigured path stays exercisable. test_cloud_events_push.py — document that the in-test ``sync_state`` table intentionally lacks UNIQUE(tenant_id, device_id). That's the regression test for ``_ensure_sync_state_schema`` — if the migration ever regresses, the first upsert will fail with "ON CONFLICT clause does not match any PRIMARY KEY or UNIQUE constraint" and every push test breaks loudly. Regression coverage (+1 test, full suite 4043 passed): - test_apply_true_on_empty_delta_still_reports_applied Co-Authored-By: Gradata <noreply@gradata.ai>

Gradata · 2026-04-22T18:27:40Z

@coderabbitai review

coderabbitai · 2026-04-22T18:27:46Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Gradata/src/gradata/cloud/pull.py`:
- Around line 282-298: The summary dict currently only gets
"conflict_events_emitted" when the branch for mat.rules or mat.conflicts runs,
causing an inconsistent schema; ensure summary always contains this key by
initializing summary["conflict_events_emitted"] = 0 before the if (or assigning
it after the block) so that when apply=True and there are no rules/conflicts
callers can rely on summary; update the code around the summary variable in
pull.py where mat and emit_conflict_events are used (referencing summary, mat,
and emit_conflict_events) to set a default zero value and overwrite it with
emit_conflict_events(mat) inside the existing branch.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c294d02f-2303-4d15-b5a9-c6cf9c8738eb

📥 Commits

Reviewing files that changed from the base of the PR and between e9ebdba and a404a6b.

📒 Files selected for processing (4)

Gradata/src/gradata/cloud/materializer.py
Gradata/src/gradata/cloud/pull.py
Gradata/tests/test_cloud_events_pull.py
Gradata/tests/test_cloud_events_push.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: pytest windows-latest / py3.11
GitHub Check: pytest ubuntu-latest / py3.12
GitHub Check: pytest macos-latest / py3.12
GitHub Check: pytest macos-latest / py3.11
GitHub Check: pytest windows-latest / py3.12

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

📚 Learning: 2026-04-17T17:18:07.439Z

Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

Gradata/tests/test_cloud_events_push.py
Gradata/tests/test_cloud_events_pull.py

🔇 Additional comments (13)

Gradata/tests/test_cloud_events_push.py (3)

84-105: Test schema design is intentional and well-documented.

The deliberate omission of UNIQUE(tenant_id, device_id) serves as a regression test for the schema migration in _ensure_sync_state_schema(). The inline comment (lines 84-92) clearly documents this rationale. This is a sound testing strategy that will catch migration regressions early.

198-226: Comprehensive happy-path test with proper assertions.

This test validates the complete push flow: event capture, HTTP request shaping, watermark advancement, and summary fields. The use of context manager for SQLite (line 224) is consistent with other tests. Good coverage of the core contract.

366-394: Regression test for partial acceptance correctly validates watermark safety.

This test ensures that when the server returns a 2xx with rejected events, the watermark is not advanced. This prevents permanently skipping rejected events on subsequent runs. The assertion row is None or row[0] is None correctly handles both "no row inserted" and "row exists but watermark is NULL" cases.

Gradata/src/gradata/cloud/pull.py (2)

1-36: Module structure and imports are well-organized.

The module docstring clearly documents the contract behavior for different HTTP status codes and the apply=False default. Imports are clean and the use of __version__ for User-Agent ensures version consistency.

144-148: Good defensive bounds on pagination.

PAGE_CAP = 50 prevents runaway servers from causing infinite loops. Combined with limit=1000, this bounds a single pull_events() call to at most 50,000 events — a reasonable safety limit that still allows large syncs.

Gradata/tests/test_cloud_events_pull.py (3)

34-54: Comprehensive environment isolation fixture.

The _isolate_keyfile fixture correctly clears all cloud-related environment variables including the newer GRADATA_ENDPOINT, GRADATA_CLOUD_API_BASE, and GRADATA_API_URL. This ensures tests remain hermetic regardless of the runner's environment.

225-255: Excellent regression test for empty delta behavior.

This test validates a critical fix: applied=True must reflect "the apply path ran" not "bytes were written". The test correctly verifies both the summary flag and watermark persistence, ensuring empty-delta pulls don't re-fetch the same page indefinitely.

578-602: Telemetry test correctly validates emission and payload structure.

The test verifies that CLOUD_SYNC_COMPLETED is emitted exactly once with all expected fields. The mock recorder pattern cleanly captures events for assertion. Good coverage of the telemetry contract.

Gradata/src/gradata/cloud/materializer.py (5)

39-55: Event type allowlist is well-documented.

The _MATERIALIZABLE_EVENT_TYPES frozenset with its inline comment clearly explains why certain event types are excluded from winner selection. This prevents spec-excluded types like CORRECTION from affecting graduation state.

144-154: Independent confidence parsing prevents cross-contamination.

The separate try/except blocks for e_conf and i_conf ensure that a malformed confidence on one side doesn't zero out both values. This was correctly refactored from the previous coupled parsing.

281-335: RULE_CONFLICT_RESOLVED handling is robust with multiple fallback mechanisms.

The implementation correctly handles resolver events that arrive in incremental pulls without the original graduation:

First checks graduation_history for events in the same batch

Falls back to embedded winning_event payload

Falls back to winning_snapshot scalar fields and reconstructs a synthetic event

This addresses the previous critical issue about resolvers becoming no-ops on incremental pulls.

184-194: Default event types correctly exclude META_RULE_SYNTHESIZED to avoid wasted I/O.

Since the materialize loop always skips META_RULE_SYNTHESIZED events, excluding them from the default DB query prevents fetching rows that will never be processed. The guard at line 193-194 for empty types also prevents malformed SQL.

371-393: Defensive numeric coercion in result building handles malformed data gracefully.

The separate try/except blocks for confidence and fire_count ensure that malformed values in winning events don't crash the entire materialization pass. Defaults to 0.0/0 on failure, which is a safe fallback.

…ys present CodeRabbit batch-7 flagged a subtle schema inconsistency: ``summary[ "conflict_events_emitted"]`` was only populated inside the ``if mat.rules or mat.conflicts:`` branch, so callers on the empty- delta apply path had to fall back to ``.get(..., 0)`` while callers on the non-empty path could index directly. Initialize the key to ``0`` at the top of the ``if apply:`` branch so the summary schema is stable across both paths and ``emit_conflict_ events(mat)`` just overwrites it when rules/conflicts actually ran. Regression lock-in: extend the existing empty-delta test to assert ``result["conflict_events_emitted"] == 0``. Co-Authored-By: Gradata <noreply@gradata.ai>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Gradata · 2026-04-22T18:45:28Z

@coderabbitai review

coderabbitai · 2026-04-22T18:45:34Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Gradata and others added 20 commits April 21, 2026 21:38

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

coderabbitai Bot added the feature label Apr 22, 2026

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 22, 2026

View reviewed changes

coderabbitai Bot added docs refactor and removed feature labels Apr 22, 2026

coderabbitai Bot requested changes Apr 22, 2026

View reviewed changes

Comment thread Gradata/src/gradata/cloud/materializer.py Outdated

Comment thread Gradata/src/gradata/cloud/pull.py

Comment thread Gradata/src/gradata/cloud/push.py

Comment thread Gradata/src/gradata/cloud/push.py

Comment thread Gradata/src/gradata/services/embeddings.py Outdated

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 22, 2026

View reviewed changes

Comment thread Gradata/src/gradata/cloud/materializer.py

Comment thread Gradata/src/gradata/cloud/materializer.py

Comment thread Gradata/src/gradata/cloud/push.py Outdated

Comment thread Gradata/src/gradata/services/embeddings.py

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 22, 2026

View reviewed changes

Comment thread Gradata/src/gradata/cloud/materializer.py

Comment thread Gradata/src/gradata/cloud/pull.py Outdated

Comment thread Gradata/tests/test_cloud_events_push.py Outdated

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 22, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 22, 2026

View reviewed changes

Comment thread Gradata/src/gradata/cloud/pull.py

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Gradata merged commit 5635a66 into main Apr 22, 2026
9 checks passed

Gradata deleted the feat/cloud-sync-phase-2 branch April 22, 2026 18:47

This was referenced Apr 25, 2026

feat: M1 Skills export + cloud sync fixes (batch of 41 commits) #143

Closed

feat(m1): skill export + cloud sync fixes #144

Merged

feat(cloud): Phase 1 cloud sync foundation #139

Closed

	from gradata.contrib.patterns.scope import classify_scope, register_task_type
	from gradata.rules.scope import classify_scope, register_task_type

		with patch.object(client, "_embed_api", return_value=[0.5, 0.6]) as mock_api:
		with patch.object(client, "_embed_local") as mock_local:

Conversation

Gradata commented Apr 22, 2026

Summary

Cloud sync (Phase 2)

Legacy bloat cleanup (c8a106f9)

Supporting

Test plan

Breaking changes

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Reviews paused

Cloud Sync Phase 2 & Legacy Cleanup (v0.7.0)

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Gradata commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Legacy bloat cleanup (`c8a106f9`)

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading