Background
The "wait events for server logging" patch (thread) is one instance of a broader pattern: a backend blocks on something, but pg_stat_activity.wait_event is NULL, so it reads as "on CPU" and the stall is invisible. This issue catalogs the similar gaps so they can be tracked/prioritized.
Confirmed against the current tree: src/backend/libpq/auth.c and contrib/postgres_fdw contain zero pgstat_report_wait_start / WAIT_EVENT_* instrumentation.
Genuine gaps — backend blocks but reports NULL
-
Outbound libpq / FDW network waits. postgres_fdw, dblink, and libpqwalreceiver block on a remote server through libpq with no wait event — there is no ClientRead/ClientWrite equivalent for outbound connections. A backend stuck on a slow foreign server looks idle-on-CPU. Likely the biggest real-world gap.
-
Authentication. auth.c is entirely uninstrumented: LDAP (ldap_search), PAM conversation, RADIUS socket, Kerberos/GSSAPI, SSPI, plus DNS (getaddrinfo) and reverse lookups during pg_hba matching. These block on external services, often for seconds. (Partly because auth runs before stats are fully set up.)
-
TLS handshake. SSL_accept and cert/CRL loading during connection setup — the negotiation itself is not a wait event (only steady-state ClientRead/Write are).
-
archive_command / restore_command. The archiver/startup process shells out via system() and blocks for the entire external command with no wait event. Recovery stalled on a slow restore_command is invisible.
-
Non-VFD file ops. Directory scans and metadata syscalls outside the smgr/File layer — opendir/readdir over pg_wal, tablespaces; unlink/rename/stat during startup and checkpoint segment recycling; config-file and pg_hba/SSL-file reads on reload. Most are not wrapped.
-
The syslogger's own disk writes. write_syslogger_file() fwrite() to the on-disk log file, plus rotation (fopen/fclose) and the current_logfiles metainfo write. This is where a slow log device actually blocks hardest. Caveat: the syslogger does not attach to shared memory and has no pg_stat_activity entry, so it currently cannot report wait events at all — fixing this needs a bigger change.
By-design NULL, but arguably the same problem
-
Pure CPU work (sort, hash, aggregation, expression/qual eval, (de)compression). Genuinely on-CPU, so NULL is "correct" — but indistinguishable from an uninstrumented wait. This ambiguity is the meta-gap behind all of the above.
-
Memory pressure — malloc/mmap/page faults/swap-in. Backend blocked in the kernel, reports NULL. Hard to capture from userspace.
-
Spinlocks. LWLocks are instrumented; spinlock spin-delays under contention are not (by design — meant to be short, but pathological cases hide here).
Structural root cause
NULL is overloaded: it means both "on CPU" and "blocked somewhere we didn't instrument." Two recurring directions worth pursuing:
- Give auxiliary/early processes (syslogger, archiver, some auth contexts) real backend-status entries so they can report.
- Add a way to positively signal "actually running" so
NULL stops being a catch-all.
There is a generic Extension wait event extensions can adopt, but in-core libpq users like postgres_fdw don't, which is why #1 stays dark.
Catalog compiled while reviewing the server-logging wait-events patch; each item is a candidate for its own focused patch.
Background
The "wait events for server logging" patch (thread) is one instance of a broader pattern: a backend blocks on something, but
pg_stat_activity.wait_eventisNULL, so it reads as "on CPU" and the stall is invisible. This issue catalogs the similar gaps so they can be tracked/prioritized.Confirmed against the current tree:
src/backend/libpq/auth.candcontrib/postgres_fdwcontain zeropgstat_report_wait_start/WAIT_EVENT_*instrumentation.Genuine gaps — backend blocks but reports
NULLOutbound libpq / FDW network waits.
postgres_fdw,dblink, andlibpqwalreceiverblock on a remote server through libpq with no wait event — there is noClientRead/ClientWriteequivalent for outbound connections. A backend stuck on a slow foreign server looks idle-on-CPU. Likely the biggest real-world gap.Authentication.
auth.cis entirely uninstrumented: LDAP (ldap_search), PAM conversation, RADIUS socket, Kerberos/GSSAPI, SSPI, plus DNS (getaddrinfo) and reverse lookups duringpg_hbamatching. These block on external services, often for seconds. (Partly because auth runs before stats are fully set up.)TLS handshake.
SSL_acceptand cert/CRL loading during connection setup — the negotiation itself is not a wait event (only steady-state ClientRead/Write are).archive_command/restore_command. The archiver/startup process shells out viasystem()and blocks for the entire external command with no wait event. Recovery stalled on a slowrestore_commandis invisible.Non-VFD file ops. Directory scans and metadata syscalls outside the smgr/File layer —
opendir/readdiroverpg_wal, tablespaces;unlink/rename/statduring startup and checkpoint segment recycling; config-file andpg_hba/SSL-file reads on reload. Most are not wrapped.The syslogger's own disk writes.
write_syslogger_file()fwrite()to the on-disk log file, plus rotation (fopen/fclose) and thecurrent_logfilesmetainfo write. This is where a slow log device actually blocks hardest. Caveat: the syslogger does not attach to shared memory and has nopg_stat_activityentry, so it currently cannot report wait events at all — fixing this needs a bigger change.By-design
NULL, but arguably the same problemPure CPU work (sort, hash, aggregation, expression/qual eval, (de)compression). Genuinely on-CPU, so
NULLis "correct" — but indistinguishable from an uninstrumented wait. This ambiguity is the meta-gap behind all of the above.Memory pressure —
malloc/mmap/page faults/swap-in. Backend blocked in the kernel, reportsNULL. Hard to capture from userspace.Spinlocks. LWLocks are instrumented; spinlock spin-delays under contention are not (by design — meant to be short, but pathological cases hide here).
Structural root cause
NULLis overloaded: it means both "on CPU" and "blocked somewhere we didn't instrument." Two recurring directions worth pursuing:NULLstops being a catch-all.There is a generic
Extensionwait event extensions can adopt, but in-core libpq users likepostgres_fdwdon't, which is why #1 stays dark.Catalog compiled while reviewing the server-logging wait-events patch; each item is a candidate for its own focused patch.