fix(staged_tasks): dedup promote_staged_task against active action_items by builtbycnob · Pull Request #7092 · BasedHardware/omi

builtbycnob · 2026-04-29T15:39:01Z

Symptom

Users accumulate 5–6 duplicate action_items per task description within a few hours of active use. Confirmed on a production account: 8 distinct task descriptions × 5–6 backend duplicates each = ~50 phantom rows, all with distinct backendIds (the local UNIQUE constraint is honoured — these are genuinely separate Firestore documents).

backend copies (per-description, completed=false): 5–6 rows, source=NULL, distinct ids
local manual orphan:                                1 row,    source="manual", backendId=NULL

The desktop client's TaskPromotionService runs a 5-minute safety-net timer (and fires on task complete/delete + app startup), each tick calling POST /v1/staged-tasks/promote.

Root cause

database.staged_tasks.promote_staged_task calls database.action_items.create_action_item unconditionally:

action_id = action_items_db.create_action_item(uid, action_data)

create_action_item does action_items_ref.add(action_item_data) → Firestore allocates a fresh document id on every call. There is no idempotency, no content-hash key, no description-based lookup.

create_staged_task already dedups its own collection (case-insensitive description match), so a single description never spawns more than one staged task at a time. But the promotion handoff loses that property: every conversation that re-mentions the same task → new staged task → eventual promote → fresh action_item, even if an active action_item with the same description already exists.

This is the active duplicating path. The reproducible loop:

Conversation A mentions task X → extracted → staged X1 → promoted → action_item A1
Staged X1 marked completed
Conversation B mentions task X → extracted → no active staged found (X1 is completed) → staged X2 → promoted → action_item A2 (duplicate of A1)
...repeats per conversation

Fix

Two pieces in database/:

1. `database/action_items.py` — new helper

def get_active_action_item_by_description(uid: str, description: str) -> Optional[dict]:
    """Find an active (uncompleted, undeleted) action_item with a matching
    description for the given user, or return None.

    Match is case-insensitive and ignores `[screen]` markers + surrounding
    whitespace, mirroring `database.staged_tasks.create_staged_task`'s
    dedup logic.
    """

Streams active items (Firestore completed == false) and applies a normalized comparison in Python. Bounded set per user; the streaming approach matches what create_staged_task already does for its own collection.

2. `database/staged_tasks.py` — promote_staged_task guard

existing = action_items_db.get_active_action_item_by_description(uid, staged['description'])
if existing is not None:
    col.document(staged['id']).update({
        'completed': True,
        'promoted_at': datetime.now(timezone.utc),
        'promotion_skipped': 'duplicate',
        'promoted_to': existing['id'],
    })
    logger.info(...)
    return existing

The staged task is closed (so it doesn't re-attempt promotion on the next tick) and tagged with breadcrumbs (promotion_skipped, promoted_to) for forensics.

Tests

12 new unit tests in backend/tests/unit/test_staged_tasks_dedup.py, wired into backend/test.sh. Covers:

_normalize_description: whitespace, case, [screen] prefix/suffix, None/empty
get_active_action_item_by_description: empty set, case-insensitive match, deleted skip, unrelated skip, [screen]-marker normalization on both sides of comparison
promote_staged_task: skip on duplicate (returns existing, tags staged with promotion_skipped + promoted_to), happy-path creation, empty staged returns None

$ pytest tests/unit/test_staged_tasks_dedup.py -v
============================== 12 passed in 0.07s ==============================

Heavy deps (firebase_admin, google.cloud.firestore*) are stubbed at import-time, matching the pattern in tests/unit/test_action_item_dedup.py.

What this fixes (and what it does not)

✅ The active duplicate-creating path (staged → action_items promotion) is now closed. Users running the affected client (TaskPromotionService 5-min timer) will stop accumulating new phantom action_items.
✅ Companion paths that already dedup (create_staged_task) are not regressed — the helper is consistent with their normalization.
❌ Does not retroactively clean existing duplicates. A separate one-shot dedup migration is recommended (mark all but one canonical row per description as deleted=true for affected users).
❌ Does not add idempotency to create_action_item itself. Other callers (routers/action_items.py:create_action_item, routers/developer.py:create_action_item, tools/action_item_tools.py:create_action_item_tool) can still produce content-equivalent duplicates if the client retries a request. A follow-up adding optional content-hash idempotency is described in fix(desktop/sync): adopt orphan action_items by description only #7091's body.

Companion PR (desktop client): #7091 — fixes the orphan-adoption mismatch that produces the 1 manual + N shape on the local SQLite.

Risk

Low. The dedup is read-only against the same collection that promote_staged_task already reads (it streams active items via get_action_items upstream too). The fallback is the existing path. promotion_skipped/promoted_to are net-new fields, opt-in to read.

If two staged tasks with the same description race to promote concurrently, both will see "no existing match" before either commits — at most one extra duplicate slips through compared to the pre-fix state, and it'll be caught on the next tick (one of them now has an active action_item).

Test plan

pytest tests/unit/test_staged_tasks_dedup.py -v (12 passed)
black --line-length 120 --skip-string-normalization clean
Manual on staging: create staged task with description matching an existing active action_item; trigger /v1/staged-tasks/promote; confirm response returns existing item, no new Firestore document, staged doc has promotion_skipped='duplicate' and promoted_to=<id>.
Regression: same flow with no matching action_item; confirm normal creation path runs and promotion_skipped is not set.

🤖 Generated with Claude Code

Adds a normalized lookup that returns the first active (non-completed, non-deleted) action_item whose description matches a given string. Normalization is case-insensitive and strips legacy `[screen]` markers, mirroring the dedup rule used by `database.staged_tasks.create_staged_task` so the staged → action_item promotion path can avoid creating semantic duplicates. Streams active items (typically a small bounded set per user) rather than relying on a Firestore equality filter — Firestore cannot do case-insensitive matching without a normalized companion field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-04-29T15:44:07Z

Greptile Summary

This PR closes the active duplicate action_item creation path by adding a case-insensitive description lookup (get_active_action_item_by_description) in database/action_items.py and inserting a dedup guard at the top of promote_staged_task in database/staged_tasks.py. When a matching active action_item is found, the staged task is marked complete with forensic breadcrumbs (promotion_skipped, promoted_to) and the existing item is returned instead of creating a fresh Firestore document. Twelve unit tests exercise both the normalization helper and the guard's skip/create branches.

Confidence Score: 4/5

Safe to merge — the dedup guard is read-only before any write, the fallback is the existing creation path, and the change is well-tested.

No P0/P1 issues found. Two P2 findings: staged-task metadata (due_at, priority) is silently dropped on dedup match, and the normalization helper is slightly more aggressive than create_staged_task's own dedup. Neither affects correctness of the primary fix.

backend/database/staged_tasks.py — dedup skip path drops staged-task metadata; backend/database/action_items.py — normalization divergence from create_staged_task.

Important Files Changed

Filename	Overview
backend/database/action_items.py	Adds `_normalize_description` helper and `get_active_action_item_by_description` — correct logic but normalization is more aggressive than `create_staged_task`'s own dedup (strips `[screen]` markers), which can silently close a staged task whose description differs only by that marker.
backend/database/staged_tasks.py	Adds dedup guard to `promote_staged_task`: checks for an active action_item with matching description before creating a new one; closes the staged task with forensic breadcrumbs on match. Staged-task metadata (due_at, priority, category) is silently dropped rather than merged into the existing action_item.
backend/tests/unit/test_staged_tasks_dedup.py	12 new unit tests covering `_normalize_description`, `get_active_action_item_by_description`, and the happy/skip paths of `promote_staged_task`; well-structured with clean stub wiring that matches the existing test patterns.
backend/test.sh	Adds the new test file to the CI test suite in the correct position alongside other staged-task tests.

Sequence Diagram

sequenceDiagram
    participant Client as Desktop Client (TaskPromotionService)
    participant Router as POST /v1/staged-tasks/promote
    participant StagedDB as database.staged_tasks
    participant ActionDB as database.action_items
    participant Firestore

    Client->>Router: POST /promote
    Router->>StagedDB: promote_staged_task(uid)
    StagedDB->>Firestore: query top staged_task (completed=false, asc relevance)
    Firestore-->>StagedDB: staged doc

    StagedDB->>ActionDB: get_active_action_item_by_description(uid, desc)
    ActionDB->>Firestore: query action_items (completed=false)
    Firestore-->>ActionDB: stream docs

    alt Duplicate found (normalized desc matches)
        ActionDB-->>StagedDB: existing action_item
        StagedDB->>Firestore: update staged_task {completed, promotion_skipped, promoted_to}
        StagedDB-->>Router: existing action_item
        Router-->>Client: {promoted: true, promoted_task: existing}
    else No duplicate
        ActionDB-->>StagedDB: None
        StagedDB->>ActionDB: create_action_item(uid, data)
        ActionDB->>Firestore: add() new doc id
        StagedDB->>Firestore: update staged_task {completed, promoted_at}
        StagedDB-->>Router: new action_item
        Router-->>Client: {promoted: true, promoted_task: new}
    end

Comments Outside Diff (1)

backend/routers/staged_tasks.py, line 93-97 (link)

Dedup-skipped path is indistinguishable in the API response

When promote_staged_task skips creation due to a duplicate, the router still returns {"promoted": true, "reason": null, "promoted_task": <existing>}. This is identical to a genuine promotion response. The desktop client's TaskPromotionService cannot distinguish the two cases from the HTTP response alone — dedup breadcrumbs (promotion_skipped, promoted_to) are written only to the Firestore staged-task doc.

Consider adding an optional skipped_reason field to the envelope:
```
return {'promoted': True, 'reason': None, 'skipped_reason': 'duplicate', 'promoted_task': action_item}
```

_{Reviews (1): Last reviewed commit: "feat(action_items): add get_active_actio..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T15:44:11Z

+    existing = action_items_db.get_active_action_item_by_description(uid, staged['description'])
+    if existing is not None:
+        col.document(staged['id']).update(
+            {
+                'completed': True,
+                'promoted_at': datetime.now(timezone.utc),
+                'promotion_skipped': 'duplicate',
+                'promoted_to': existing['id'],
+            }
+        )
+        logger.info(
+            "Skipped promotion of staged task %s for user %s — duplicate of action_item %s",
+            staged['id'],
+            uid,
+            existing['id'],
+        )
+        return existing


Staged task metadata silently dropped on dedup

When a duplicate is detected, the staged task's richer fields (due_at, priority, category, relevance_score) are discarded — the existing action_item is returned as-is without merging any updated context. This is observable when a subsequent conversation extracts the same task but with a due date or priority the original action_item lacks; the user never sees that scheduling information applied.

Consider a lightweight merge step before returning existing:

update_fields = {} for field in ('due_at', 'priority', 'category'): if staged.get(field) is not None and not existing.get(field): update_fields[field] = staged[field] if update_fields: update_fields['updated_at'] = datetime.now(timezone.utc) action_items_db.update_action_item(uid, existing['id'], update_fields)

Or at minimum document that these fields are intentionally not merged.

greptile-apps · 2026-04-29T15:44:13Z

+def _normalize_description(desc: Optional[str]) -> str:
+    """Normalize a task description for case-insensitive duplicate matching.
+
+    Strips whitespace + the legacy ``[screen]`` prefix/suffix marker that the
+    AI promotion pipeline used to add to AI-generated tasks (still appears in
+    historical data and on tasks that round-tripped through ``migrate_ai_tasks``).
+    """
+    if not desc:
+        return ''
+    s = desc.strip()
+    if s.startswith('[screen] '):
+        s = s[len('[screen] ') :]
+    if s.endswith(' [screen]'):
+        s = s[: -len(' [screen]')]
+    return s.strip().lower()


Normalization diverges from create_staged_task

create_staged_task deduplicates with description.strip().lower() — no [screen] stripping. _normalize_description goes further by also stripping [screen] prefix/suffix. This means two staged tasks with descriptions "Email John" and "[screen] Email John" can coexist (different strings in create_staged_task's check), but at promotion time both normalize to "email john" and match the same action_item. The second staged task is silently closed as a duplicate rather than promoting its own doc.

In practice the result is correct (both descriptions refer to the same task), but the PR description's claim that this "mirrors create_staged_task's dedup logic" is slightly inaccurate, and the order of promotion determines which description ends up on the live action_item — a [screen]-tagged description could be promoted first, leaving a marker-polluted description in the user's action list.

Greptile review on BasedHardware#7092: P2 — Staged metadata silently dropped on dedup hit. When a later conversation extracts the same task with richer context (a due_at the user mentioned later, a priority assignment), the existing action_item never sees that information because we just return it as-is. Now we gather the staged-task fields the existing item is missing (due_at, priority, category) and call update_action_item before returning. The merge is best-effort — a write failure logs a warning but does not block the dedup itself. P2 — Normalization divergence with create_staged_task. The promote-time helper (_normalize_description) strips [screen] markers, but create_staged_task was using a plain strip().lower(). That meant "[screen] Email John" and "Email John" could coexist as two staged candidates and only one would actually promote — leaving a marker- polluted description on whichever one won the race. Use the same helper in both places so the dedup boundary is uniform. Cross-module use of action_items_db._normalize_description is fine within the database/ package; the underscore is a soft convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three new cases: - test_promote_merges_missing_fields_on_dedup: a staged task carrying due_at/priority/category populates the existing action_item only for fields that are MISSING (must not overwrite a priority the user already set). - test_promote_dedup_no_merge_when_existing_already_has_fields: no update_action_item call when the existing item is already populated. - test_create_staged_task_uses_normalized_dedup: regression for the divergence note — "[screen] Email John" matches an existing "Email John" staged task instead of producing a duplicate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

builtbycnob mentioned this pull request Apr 29, 2026

fix(action_items): content-hash idempotency on POST /v1/action-items #7093

Open

5 tasks

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

builtbycnob and others added 2 commits April 29, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(staged_tasks): dedup promote_staged_task against active action_items#7092

fix(staged_tasks): dedup promote_staged_task against active action_items#7092
builtbycnob wants to merge 3 commits intoBasedHardware:mainfrom
builtbycnob:fix/promote-staged-task-dedup

builtbycnob commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

builtbycnob commented Apr 29, 2026

Symptom

Root cause

Fix

1. database/action_items.py — new helper

2. database/staged_tasks.py — promote_staged_task guard

Tests

What this fixes (and what it does not)

Risk

Test plan

Uh oh!

greptile-apps Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `database/action_items.py` — new helper

2. `database/staged_tasks.py` — promote_staged_task guard

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading