Asyncio + SQLite concurrency patterns in agent telemetry pipelines #152

zsxh1990 · 2026-06-05T03:51:18Z

zsxh1990
Jun 5, 2026

Context

While working on migrating synchronous audit logic into an async TelemetryPipeline consumer (PR #147), I hit an interesting concurrency pattern that I think is worth discussing.

The Problem

The pipeline uses an asyncio producer-consumer pattern:

Producer: search queries emit telemetry events to an async queue
Consumer: batch-writes events to SQLite every 1s or every 10 events
Audit: sliding window check runs after each batch write

The issue: under high-throughput multi-node emulation, the _drain_queue() call (flushing remaining events during shutdown) races with the consumer loop's batch-write + audit cycle, causing sporadic SQLite lock contention.

async def _consumer_loop(self):
    while not self._shutdown_event.is_set():
        batch = await self._drain_queue()
        if batch:
            self._sync_write(batch)
            self._audit_sliding_window()  # ← races with drain

Questions for the Community

asyncio.Lock vs sqlite3 timeout: Is it better to gate the entire consumer loop with an asyncio.Lock, or rely on SQLite's built-in timeout parameter for WAL mode concurrency?
Connection lifecycle: Should the consumer create one persistent connection (opened at start(), closed at shutdown()), or open/close per batch? The current implementation opens per batch for safety, but this adds overhead.
Audit isolation: Should the sliding window audit run in the same SQLite connection as the batch write, or use a separate read-only connection to avoid blocking writes?

Environment

Python 3.11+, asyncio only (no threading)
SQLite with WAL mode enabled
stdlib only constraint (no aiosqlite)

Looking forward to hearing how others have solved similar patterns in agent telemetry or observability pipelines.

zsxh1990 · 2026-06-05T04:04:14Z

zsxh1990
Jun 5, 2026
Author

Update: Initial Analysis & Benchmark Plan

After reviewing the concurrency issue more carefully, here's my current thinking on the three questions:

Q1: asyncio.Lock vs SQLite timeout

Leaning towards asyncio.Lock to protect the entire write + audit cycle.

Rationale: stdlib sqlite3 is synchronous — even with WAL mode, the timeout parameter only controls how long Python waits for the lock, not how long SQLite holds it. In an async context, a blocking conn.commit() inside run_in_executor can still starve other coroutines.

# Proposed pattern
async with self._write_lock:
    conn.executemany(...)
    conn.commit()
    # audit runs here, same connection, same lock
    self._audit_sliding_window(conn)

Q3: Audit isolation

Planning to try a separate read-only connection for the sliding window audit:

# Write connection (locked)
write_conn = sqlite3.connect(db_path, timeout=5)

# Read-only connection (no lock needed, WAL allows concurrent reads)
read_conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)

This way, the audit query doesn't block writes, and writes don't block the audit.

Next Steps

I'll put together a minimal benchmark script comparing:

Per-batch connection (current)
Persistent connection + asyncio.Lock
Persistent connection + separate read-only for audit

Will share the results here once I have numbers.

@Ikalus1988 — does this direction align with the project's architecture? Specifically, is the separate read-only connection pattern acceptable given the zero-dep constraint?

0 replies

Ikalus1988 · 2026-06-05T04:05:55Z

Ikalus1988
Jun 5, 2026
Maintainer

Great write-up @zsxh1990 — these are exactly the questions I've been turning over since the merge.

A few thoughts from the maintainer perspective:

On asyncio.Lock vs sqlite3 timeout:

I lean toward an asyncio.Lock wrapping the consumer cycle rather than relying on SQLite's timeout alone. WAL mode handles concurrent reads well, but the contention here is writer-side: _sync_write and _audit_sliding_window both hit the same connection under the hood. An asyncio.Lock gives us explicit gating at the task level without blocking the event loop, while SQLite timeout is a last-resort safety net.

On connection lifecycle:

Single persistent connection per consumer instance. The overhead of open/close per batch becomes non-trivial under the multi-node telemetry throughput we're targeting for federation mode. The consumer start/shutdown lifecycle is well-defined, so we can open at start() and close at shutdown() — just need to ensure WAL mode is set on the connection and proper checkpointing happens on shutdown.

On audit isolation:

Personally I'd keep them on the same connection but use a separate cursor. A read-only connection adds another moving part to the lifecycle without clear benefit here — the audit is a fast local query (last 10 rows) that shouldn't meaningfully contend with the batch write. If we see contention in production, separating reads is a cheap refactor later.

That said, these are my opinions from reading the code — actual production profiling would tell a more complete story. Always welcome alternative perspectives or benchmark data from anyone running similar patterns.

Thanks again for kicking off this discussion — this is exactly the kind of collaborative engineering I want to see more of in the community.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asyncio + SQLite concurrency patterns in agent telemetry pipelines #152

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Asyncio + SQLite concurrency patterns in agent telemetry pipelines #152

Uh oh!

zsxh1990 Jun 5, 2026

Context

The Problem

Questions for the Community

Environment

Replies: 2 comments

Uh oh!

zsxh1990 Jun 5, 2026 Author

Update: Initial Analysis & Benchmark Plan

Q1: asyncio.Lock vs SQLite timeout

Q3: Audit isolation

Next Steps

Uh oh!

Ikalus1988 Jun 5, 2026 Maintainer

zsxh1990
Jun 5, 2026

zsxh1990
Jun 5, 2026
Author

Ikalus1988
Jun 5, 2026
Maintainer