Skip to content

Bug: Heartbeat merge fails due to SQL query mismatch between get_events and replace_last_event #554

@Sinity

Description

@Sinity

Summary

There's a critical bug causing duplicate events and failed heartbeat merges. The root cause is a mismatch between two SQL queries that should target the same event but use different selection criteria.

Root Cause

get_events (datastore.rs:736):

ORDER BY starttime DESC LIMIT 1

replace_last_event (datastore.rs:563):

WHERE endtime = (SELECT max(endtime) FROM events WHERE bucketrow = ?1)

When events have the same starttime but different durations (common during AFK status transitions), these queries target different events.

Impact

In my database:

  • 284,995 AFK events reduced to 10,816 after deduplication + merge (96% were duplicates/unmerged)
  • 92% of events had zero duration (heartbeats never merged)
  • 14,013 events on a single day (should be ~50 status transitions)

Reproduction

  1. AFK watcher sends two rapid heartbeats on status change (e.g., becoming idle):

    • ping(not-afk, T, duration=0)
    • ping(afk, T, duration=X)
  2. Both events have same starttime T but different data/duration

  3. Next heartbeat arrives:

    • get_events(LIMIT 1) with ORDER BY starttime DESC returns one event (order undefined for same starttime)
    • If data doesn't match, merge fails → INSERT new event
    • If merge succeeds, replace_last_event targets event with max(endtime) - potentially a different event
  4. Cache and database diverge, duplicates accumulate

Additional Issue

replace_last_event doesn't check rows affected (datastore.rs:584-590):

match stmt.execute([...]) {
    Ok(_) => self.update_endtime(&mut bucket, event),  // Ignores rows_affected!
    Err(err) => return Err(...)
}

If UPDATE matches 0 rows, it returns Ok(()) anyway, corrupting the cache.

Proposed Fixes

Fix 1: Consistent query selection (Required)

Either:

  • Pass event ID to replace_last_event and use WHERE id = ?
  • Or change get_events to use ORDER BY endtime DESC

Fix 2: Verify row count (Required)

match stmt.execute([...]) {
    Ok(0) => return Err(DatastoreError::InternalError(
        "replace_last_event matched 0 rows".into()
    )),
    Ok(_) => self.update_endtime(&mut bucket, event),
    Err(err) => return Err(...)
}

Fix 3: Add tiebreaker to ORDER BY (Recommended)

ORDER BY starttime DESC, id DESC LIMIT 1

This ensures deterministic ordering when starttime is identical.

Environment

  • aw-server-rust: latest (cloned today)
  • awatcher client
  • ~8 months of data

Workaround

Data can be recovered by:

  1. Deduplicating (keep longest duration per starttime+data)
  2. Merging consecutive same-status events
  3. Reconstructing durations from next event's starttime

Happy to provide more details or test patches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions