Environment
- asyncpg version: 0.31.0
- PostgreSQL version: 16 (also reproduces against a trivial in-process fake server — no real PostgreSQL needed)
- Do you use a PostgreSQL SaaS? Can you reproduce locally?: Original incident was self-hosted PostgreSQL over plaintext (no SSL). Yes — fully reproducible locally, both with a local PostgreSQL + network blackhole and with a self-contained fake server (below).
- Python version: 3.11.11 (CPython)
- Platform: Linux x86_64 (6.8.0)
- Do you use pgbouncer?: No
- Did you install asyncpg with pip?: Yes (binary wheel)
- If built locally, Cython version: N/A
- Reproduced under both asyncio and uvloop?: Yes, both. The affected code path is event-loop agnostic.
Summary
When a command times out (via command_timeout) or is cancelled, asyncpg sends a PostgreSQL CancelRequest by opening a brand-new connection in connect_utils._cancel. Neither that new connection (loop.create_connection, connect_utils.py:1279) nor the subsequent await pr.on_disconnect (connect_utils.py:1288) is bounded by any timeout. If the server is unreachable or wedged at that moment (network partition / blackhole), connection teardown hangs indefinitely — including Connection.close(timeout=N), which ignores its documented timeout.
Precise framing — what works vs. what doesn't
This is not a "command_timeout doesn't fire" bug. command_timeout works correctly: the query is aborted at the deadline.
The defect is in the cancel / teardown path:
command_timeout fires → the query coroutine correctly raises TimeoutError on time.
- asyncpg schedules a background cancel task (
_cancel_current_command, connection.py:1681) that opens a new connection to deliver the CancelRequest.
- That cancel connection is subject to no timeout at all.
- When you then close the connection,
close() blocks waiting for that in-flight cancel — and no timeout you can set (command_timeout, connect timeout, or close(timeout=)) bounds it.
Documented-behavior contradiction
Connection.close() docstring (connection.py:1502-1510):
Close the connection gracefully.
:param float timeout: Optional timeout value in seconds.
close(timeout=2) is documented to bound the graceful close. It does not — see repro: close(timeout=2) still hangs > 15s.
Self-contained reproduction (no PostgreSQL, no root, single file)
"""asyncpg query-cancel path has no timeout -> connection teardown hangs.
Zero deps beyond asyncpg: a tiny in-process fake server speaks just enough of the
PG v3 wire protocol to let asyncpg connect, stays silent on the query so
command_timeout fires, and holds the resulting CANCEL connection open forever.
conn.close(timeout=2) then hangs despite the explicit timeout.
pip install asyncpg # 0.31.0
python repro.py
"""
import asyncio, struct, time
import asyncpg
CODE_SSLREQUEST = 80877103
CODE_CANCELREQUEST = 80877102
def _param(k, v):
p = k + b"\x00" + v + b"\x00"
return struct.pack("!I", 4 + len(p)) + p
async def _read(reader):
(length,) = struct.unpack("!I", await reader.readexactly(4))
body = await reader.readexactly(length - 4)
return struct.unpack("!I", body[:4])[0]
async def _hold(writer):
try:
while True:
await asyncio.sleep(3600)
finally:
writer.close()
async def handler(reader, writer):
code = await _read(reader)
if code == CODE_SSLREQUEST: # asyncpg ssl='disable' -> won't hit this
writer.write(b"N"); await writer.drain()
code = await _read(reader)
if code == CODE_CANCELREQUEST:
# The brand-new connection asyncpg opened just to send CancelRequest.
# Accept it and never respond/close -> `await pr.on_disconnect` never resolves.
print(" [server] CANCEL connection received -> holding open forever", flush=True)
await _hold(writer)
return
# StartupMessage on the DATA connection: minimal handshake.
writer.write(b"R" + struct.pack("!II", 8, 0)) # AuthenticationOk
for k, v in [(b"server_version", b"16.0"), (b"client_encoding", b"UTF8"),
(b"server_encoding", b"UTF8"), (b"integer_datetimes", b"on")]:
writer.write(b"S" + _param(k, v)) # ParameterStatus
writer.write(b"K" + struct.pack("!III", 12, 4242, 9999)) # BackendKeyData
writer.write(b"Z" + struct.pack("!I", 5) + b"I") # ReadyForQuery
await writer.drain()
# Stay silent on the query (command_timeout fires), then close cleanly after 3s.
# Closing the DATA socket proves the later hang is the CANCEL conn, not this one.
await asyncio.sleep(3)
writer.close()
async def main():
server = await asyncio.start_server(handler, "127.0.0.1", 0)
host, port = server.sockets[0].getsockname()[:2]
conn = await asyncio.wait_for(
asyncpg.connect(host=host, port=port, user="x", password="x", database="x",
timeout=5, command_timeout=1, ssl="disable"),
timeout=10)
print("connected (command_timeout=1)", flush=True)
t0 = time.monotonic()
try:
await conn.fetchval("SELECT 1")
except asyncio.TimeoutError:
print(f"fetchval timed out at {time.monotonic()-t0:.1f}s (command_timeout works)", flush=True)
t1 = time.monotonic()
try:
await asyncio.wait_for(conn.close(timeout=2), timeout=15) # documented timeout
print(f"conn.close(timeout=2) returned after {time.monotonic()-t1:.1f}s", flush=True)
except asyncio.TimeoutError:
print("BUG: conn.close(timeout=2) HUNG > 15s despite explicit timeout=2", flush=True)
finally:
server.close()
if __name__ == "__main__":
asyncio.run(main())
Actual output (CPython 3.11.11 and uvloop, asyncpg 0.31.0):
connected (command_timeout=1)
fetchval timed out at 1.0s (command_timeout works)
[server] CANCEL connection received -> holding open forever
BUG: conn.close(timeout=2) HUNG > 15s despite explicit timeout=2
Note the DATA socket is closed cleanly after 3s, so the teardown hang is attributable solely to the held-open cancel connection.
Faithful reproduction (real PostgreSQL + true network blackhole)
The fake server above demonstrates the await pr.on_disconnect (line 1288) variant. A real network blackhole hits the loop.create_connection (line 1279) variant — the cancel connection's SYN is silently dropped:
docker run -d --name pg -e POSTGRES_HOST_AUTH_METHOD=trust postgres:16
IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' pg)
# establish an asyncpg/SQLAlchemy connection to $IP, then blackhole it:
sudo iptables -I OUTPUT -d $IP -j DROP # silent drop, no RST
# issue a query/checkout -> command_timeout=5 fires, cancel connection's SYN is
# dropped -> tcp_syn_retries (~127s) -> total hang ~140s
sudo iptables -D OUTPUT -d $IP -j DROP # cleanup
Through SQLAlchemy 2.0.44 (create_async_engine(..., pool_pre_ping=True, connect_args={"command_timeout":5,"timeout":5})), session.connection() hangs 139.6s (the pre_ping BEGIN times out at 5s, then the cancel connection's SYN blackholes for ~127s, while SQLAlchemy's _terminate_graceful_close → close(timeout=2) is stuck behind it). This matches a production incident where a transient network partition caused ~140s stalls per DB write despite a 5s command_timeout.
Root cause
connect_utils.py:
async def _cancel(*, loop, addr, params, backend_pid, backend_secret):
...
if params.ssl and params.sslmode != SSLMode.allow:
tr, pr = await _create_ssl_connection(CancelProto, *addr, loop=loop, ...) # no timeout
else:
tr, pr = await loop.create_connection(CancelProto, *addr) # line 1279, no timeout
_set_nodelay(_get_socket(tr))
msg = struct.pack('!llll', 16, 80877102, backend_pid, backend_secret)
try:
tr.write(msg)
await pr.on_disconnect # line 1288, no timeout
finally:
tr.close()
Both the connect and the on_disconnect wait are unbounded. The connection's command_timeout / connect timeout are not propagated here, and Connection.close(timeout=...) cannot interrupt it.
Relationship to existing issues (not a duplicate)
I'm happy to open a PR if this is accepted as a bug.
Environment
Summary
When a command times out (via
command_timeout) or is cancelled, asyncpg sends a PostgreSQLCancelRequestby opening a brand-new connection inconnect_utils._cancel. Neither that new connection (loop.create_connection, connect_utils.py:1279) nor the subsequentawait pr.on_disconnect(connect_utils.py:1288) is bounded by any timeout. If the server is unreachable or wedged at that moment (network partition / blackhole), connection teardown hangs indefinitely — includingConnection.close(timeout=N), which ignores its documentedtimeout.Precise framing — what works vs. what doesn't
This is not a "command_timeout doesn't fire" bug.
command_timeoutworks correctly: the query is aborted at the deadline.The defect is in the cancel / teardown path:
command_timeoutfires → the query coroutine correctly raisesTimeoutErroron time._cancel_current_command, connection.py:1681) that opens a new connection to deliver theCancelRequest.close()blocks waiting for that in-flight cancel — and no timeout you can set (command_timeout, connecttimeout, orclose(timeout=)) bounds it.Documented-behavior contradiction
Connection.close()docstring (connection.py:1502-1510):close(timeout=2)is documented to bound the graceful close. It does not — see repro:close(timeout=2)still hangs > 15s.Self-contained reproduction (no PostgreSQL, no root, single file)
Actual output (CPython 3.11.11 and uvloop, asyncpg 0.31.0):
Note the DATA socket is closed cleanly after 3s, so the teardown hang is attributable solely to the held-open cancel connection.
Faithful reproduction (real PostgreSQL + true network blackhole)
The fake server above demonstrates the
await pr.on_disconnect(line 1288) variant. A real network blackhole hits theloop.create_connection(line 1279) variant — the cancel connection's SYN is silently dropped:Through SQLAlchemy 2.0.44 (
create_async_engine(..., pool_pre_ping=True, connect_args={"command_timeout":5,"timeout":5})),session.connection()hangs 139.6s (the pre_pingBEGINtimes out at 5s, then the cancel connection's SYN blackholes for ~127s, while SQLAlchemy's_terminate_graceful_close→close(timeout=2)is stuck behind it). This matches a production incident where a transient network partition caused ~140s stalls per DB write despite a 5scommand_timeout.Root cause
connect_utils.py:Both the connect and the
on_disconnectwait are unbounded. The connection'scommand_timeout/ connecttimeoutare not propagated here, andConnection.close(timeout=...)cannot interrupt it.Relationship to existing issues (not a duplicate)
ConnectionErrorthat escapes. This report is about the cancel connection not failing at all — hanging because it has no timeout. ConnectionError from _cancel() during CancelledError not caught, crashes callers #1310's suggestedexcept OSErrorcannot help a hang (there is nothing to catch). A complete fix for_cancelshould address both (wrong exception type and missing timeout).command_timeoutset, yet__rollbackhangs — likely the same root cause via the transaction rollback path), Upon postgres segfault, connections inasync for record in conn.cursorfail to detect loss of connectivity and hang indefinitely #549 (hang on connectivity loss).I'm happy to open a PR if this is accepted as a bug.