feat(dedup): fingerprint-based observation dedup with seen_count persistence#67
feat(dedup): fingerprint-based observation dedup with seen_count persistence#67
Conversation
…e semantics Problem: 10 sessions of "don't use em-dashes" produced 10 lesson reinforcements and inflated confidence/fire_count 10x. Ties into gap-analysis gap #4 (survival-bonus bypass inflating confidence). Solution: A single sha1 fingerprint on (category, normalized(draft||final)) is persisted in a new observation_dedup SQLite table with seen_count, first/last session, and timestamps. The ingestion path in brain_correct (_core.py) calls check_and_register at exactly one point; if the fingerprint was seen inside a recent 10-session window the correction is tagged observation_deduped and the lesson-create / lesson-reinforce branch is skipped, preventing confidence inflation while preserving pattern extraction, FTS indexing, bus emit, and the raw CORRECTION event. Semantics: default is DROP within window. seen_count is still tracked in the persisted row so MERGE (bump fire_count by seen_count at window rollover) can be wired in later without data loss. True MERGE wiring requires extending update_confidence in self_improvement.py, which is off-limits for this worktree — flagged for polyclaude policy review. Files: - src/gradata/enhancements/dedup.py (new) - src/gradata/_core.py (single hook point in brain_correct) - tests/test_dedup.py (14 tests: fingerprint stability, normalization, category-awareness, window boundaries, end-to-end via Brain.correct) Tests: 2271 pass, 23 skipped. ruff clean. Co-Authored-By: Gradata <noreply@gradata.ai>
- Pull the 29-line inline dedup block in brain_correct into a single annotate_event_with_dedup() helper inside dedup.py. Call site in _core.py is now 7 lines including the import — one seam, one signature. - Reuse gradata._db.get_connection and ensure_table instead of hand-rolling a sqlite3.connect + PRAGMA busy_timeout + CREATE TABLE path. Aligns dedup.py with the project's standard schema-creation pattern without touching _migrations.py. - Collapse _normalize_text into two statements. No behavior change. All 14 dedup tests pass, full suite still at 2271 passed / 23 skipped. ruff clean on the touched files. Co-Authored-By: Gradata <noreply@gradata.ai>
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 9 minutes and 25 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Deploying gradata-dashboard with
|
| Latest commit: |
8ca5457
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://8173d841.gradata-dashboard.pages.dev |
| Branch Preview URL: | https://wt-observation-dedup.gradata-dashboard.pages.dev |
Summary
Dedupes near-identical correction observations within a rolling window so repeated corrections don't inflate confidence N-fold. Policy: DROP (not MERGE), validated via 5-perspective polyclaude council (unanimous HIGH).
seen_countis persisted either way so the policy is reversible via a backfill worker.Council verdict (observation-dedup council, 2026-04-15)
Unanimous DROP across SRE, Statistician, Red-Team, Product/UX, Pipeline-Architect:
Files
src/gradata/enhancements/dedup.py(new) — fingerprint,is_duplicate,register_observation,annotate_event_with_dedupsrc/gradata/_core.py— single 7-line hook at the ingestion seamtests/test_dedup.py— 14 testsTests
2271 pass (+14), ruff clean.
Commits
771cf66feat(dedup): fingerprint-based observation dedup8ca5457refactor(dedup): extract hook helper, reuse_db.get_connectionFollow-up flagged
observation_dedupDDL should move to_migrations._BASE_TABLESonce this worktree's NO-TOUCH list expiresCo-Authored-By: Gradata noreply@gradata.ai