Skip to content

Read transaction stays open indefinitely under high write load (renewingRefCount stuck) #684

@kriszyp

Description

@kriszyp

Under high write throughput in a sharded cluster, read transactions are not closing and report errors every second until manual intervention:

[error]: Read transaction detected that has been open too long (over one minute)
Txn {
  address: 140427120637888,
  timerTracked: true,
  refCount: 1,
  renewingRefCount: 1,  ← stuck
  notCurrent: true,
  openTimer: 58
}

The renewingRefCount: 1, notCurrent: true combination suggests the transaction renewal cycle is failing to complete — the transaction knows it's stale but can't close itself because of a pending renewer.

Reproduction context

  • 5-node sharded cluster, 50k req/s write throughput
  • 1 KB record size with blobs
  • Seen on 4.6.0-alpha.3

Related

  • HarperFast/harper PR #304 — "make read txn timeout configurable and set default to 1min" (in review) adds a configurable timeout, but doesn't appear to address the stuck-renewer root cause.

Acceptance criteria

  • Read transactions close or are forcibly recycled even when renewingRefCount stays positive.
  • Under high write load, the "open too long" error storm doesn't recur.

🤖 Filed by Claude on behalf of Kris.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:storageStorage engine, LMDB/RocksDB, compactionbugSomething isn't workingfrom-jiraMigrated or originated from a Jira ticket

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions