Replies: 2 comments
-
Update: Initial Analysis & Benchmark PlanAfter reviewing the concurrency issue more carefully, here's my current thinking on the three questions: Q1: asyncio.Lock vs SQLite timeoutLeaning towards Rationale: stdlib # Proposed pattern
async with self._write_lock:
conn.executemany(...)
conn.commit()
# audit runs here, same connection, same lock
self._audit_sliding_window(conn)Q3: Audit isolationPlanning to try a separate read-only connection for the sliding window audit: # Write connection (locked)
write_conn = sqlite3.connect(db_path, timeout=5)
# Read-only connection (no lock needed, WAL allows concurrent reads)
read_conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)This way, the audit query doesn't block writes, and writes don't block the audit. Next StepsI'll put together a minimal benchmark script comparing:
Will share the results here once I have numbers. @Ikalus1988 — does this direction align with the project's architecture? Specifically, is the separate read-only connection pattern acceptable given the zero-dep constraint? |
Beta Was this translation helpful? Give feedback.
-
|
Great write-up @zsxh1990 — these are exactly the questions I've been turning over since the merge. A few thoughts from the maintainer perspective: On asyncio.Lock vs sqlite3 timeout: I lean toward an On connection lifecycle: Single persistent connection per consumer instance. The overhead of open/close per batch becomes non-trivial under the multi-node telemetry throughput we're targeting for federation mode. The consumer start/shutdown lifecycle is well-defined, so we can open at On audit isolation: Personally I'd keep them on the same connection but use a separate cursor. A read-only connection adds another moving part to the lifecycle without clear benefit here — the audit is a fast local query (last 10 rows) that shouldn't meaningfully contend with the batch write. If we see contention in production, separating reads is a cheap refactor later. That said, these are my opinions from reading the code — actual production profiling would tell a more complete story. Always welcome alternative perspectives or benchmark data from anyone running similar patterns. Thanks again for kicking off this discussion — this is exactly the kind of collaborative engineering I want to see more of in the community. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
While working on migrating synchronous audit logic into an async TelemetryPipeline consumer (PR #147), I hit an interesting concurrency pattern that I think is worth discussing.
The Problem
The pipeline uses an asyncio producer-consumer pattern:
The issue: under high-throughput multi-node emulation, the
_drain_queue()call (flushing remaining events during shutdown) races with the consumer loop's batch-write + audit cycle, causing sporadic SQLite lock contention.Questions for the Community
asyncio.Lock vs sqlite3 timeout: Is it better to gate the entire consumer loop with an
asyncio.Lock, or rely on SQLite's built-intimeoutparameter for WAL mode concurrency?Connection lifecycle: Should the consumer create one persistent connection (opened at
start(), closed atshutdown()), or open/close per batch? The current implementation opens per batch for safety, but this adds overhead.Audit isolation: Should the sliding window audit run in the same SQLite connection as the batch write, or use a separate read-only connection to avoid blocking writes?
Environment
Looking forward to hearing how others have solved similar patterns in agent telemetry or observability pipelines.
Beta Was this translation helpful? Give feedback.
All reactions