Skip to content

Do not attribute logical decoding aborts to xact_rollback#28

Open
NikolayS wants to merge 3 commits intomasterfrom
fix/xact-rollback-spike-logical-decoding
Open

Do not attribute logical decoding aborts to xact_rollback#28
NikolayS wants to merge 3 commits intomasterfrom
fix/xact-rollback-spike-logical-decoding

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

@NikolayS NikolayS commented Apr 17, 2026

Summary

v2 — reworked after a thorough synthetic review pass. Key shape changes vs. v1:

  • API: the fix no longer exposes pgstat_cancel_xact_rollback() as a "decrement the counter" helper. It now uses a module-local flag in pgstat_database.c (pgstat_begin_internal_xact_cleanup() / pgstat_end_internal_xact_cleanup()), which AtEOXact_PgStat_Database() reads to skip the bump entirely. No more bump-and-undo.
  • Covers more sites: also fixes SnapBuildClearExportedSnapshot() (same-shape bug, hit on replication-snapshot cleanup).
  • Test: folded into src/test/subscription/t/100_bugs.pl as a new bug section (not a new file). Asserts a delta-from-baseline (not a literal 0 after pg_stat_reset), runs only 5 INSERTs, and polls walsender-gone filtered by application_name.
  • Commit message: "Do not …" subject, Backpatch-through: 15 trailer, no Co-Authored-By in the PG commit body.

Fixes report CAG0ozMo_xWQn+Avv8jzbbhePGp5OnhdO+YWTkdg4faWSXz0Jzg@mail.gmail.com (Rafael Thofehrn Castro, 2024-06-14), still present in master as of 2026-04-16. See also postgres-ai/tests-and-benchmarks#39 for the production-incident context.

Root cause

  • ReorderBufferProcessTXN() ends each decoded transaction with AbortCurrentTransaction() for catalog cleanup.
  • In the walsender (using_subtxn == false, entered via StartTransactionCommand()), that abort is top-level, so AtEOXact_PgStat_Database(isCommit=false) bumps the backend-local pgStatXactRollback. Counts flush to shared stats on walsender exit → the spike.
  • In the pg_logical_slot_get_changes() path (using_subtxn == true), the abort is of an internal subtransaction and goes through AtEOSubXact_PgStat, which never touches the counter.

Fix

  • New module-local pgStatInternalXactCleanup flag in pgstat_database.c, set/cleared by two public helpers.
  • AtEOXact_PgStat_Database() early-returns when the flag is set (same slot as the existing parallel skip).
  • Two call sites in ReorderBufferProcessTXN() bracket AbortCurrentTransaction() with the begin/end pair, gated on !using_subtxn.
  • SnapBuildClearExportedSnapshot() gets the same bracket (unconditionally — that path is always top-level).
  • applyparallelworker.c:1458 is intentionally left alone: that is a real user-visible rollback on the subscriber side.

Reproduction evidence

Two postgres:17 Docker containers:

                            xact_commit | xact_rollback
baseline (after reset)     |          1 |             0
after 1000 autocommit INS  |       1001 |             0    <- walsender still running
after ALTER SUBSCRIPTION   |       1005 |          1000    <- walsender gone, spike

With the patch, the post-DISABLE rollback value stays at baseline.

TAP test (RED / GREEN)

New section in src/test/subscription/t/100_bugs.pl: 5 autocommit INSERTs, wait_for_catchup, ALTER SUBSCRIPTION xrb_sub DISABLE, poll walsender-gone filtered by application_name='xrb_sub', assert cmp_ok($final - $baseline, '==', 0).

  • Against unpatched master: got: 5 / expected: 0 — FAIL.
  • With the patch: PASS.

Test plan

  • meson test --suite subscription --suite test_decoding — 42/42 + 3/3 pass (incl. the new 100_bugs.pl section).
  • meson test postgresql:recovery/006_logical_decoding postgresql:recovery/010_logical_decoding_timelines postgresql:pg_basebackup/030_pg_recvlogical — pass.
  • meson test postgresql:recovery/029_stats_restart postgresql:pg_stat_statements/regress — pass.
  • meson test postgresql:regress/regress — 248/248 pass.

Review trail

  • v1 review round — three agents, verdicts: needs revision / accept with minor / needs work. Consensus blockers: API shape (don't expose pgStatXactRollback through a public helper), > 0 silent-flooring guard, too-tight is(…, '0') assertion, missing fix in snapbuild.c. All addressed here.

🤖 Generated with Claude Code

@NikolayS NikolayS force-pushed the fix/xact-rollback-spike-logical-decoding branch from 26e23d8 to 2c80918 Compare April 17, 2026 02:53
@NikolayS NikolayS changed the title Don't attribute logical decoding aborts to xact_rollback Do not attribute logical decoding aborts to xact_rollback Apr 17, 2026
@NikolayS NikolayS force-pushed the fix/xact-rollback-spike-logical-decoding branch 3 times, most recently from c992cfb to 11b70a0 Compare April 17, 2026 03:44
@NikolayS NikolayS force-pushed the fix/xact-rollback-spike-logical-decoding branch 2 times, most recently from f58166e to 130c588 Compare April 22, 2026 17:37
@NikolayS NikolayS force-pushed the fix/xact-rollback-spike-logical-decoding branch from 130c588 to e4c9697 Compare April 22, 2026 18:19
Logical decoding aborts the current transaction after decoding committed
transactions to clean up catalog access and other transaction-local state. In a
logical walsender this can be a top-level abort, so
pg_stat_database.xact_rollback is incremented even though no user-visible
transaction rolled back.

Keep these internal cleanup aborts out of xact_rollback by adding
AbortCurrentTransactionWithoutXactStats(), a narrow wrapper around
AbortCurrentTransaction() that suppresses only the pg_stat_database
xact_commit/xact_rollback counter update while preserving the rest of
transaction cleanup.

Add a TAP test that fails without the fix: five committed transactions decoded
by a subscription produce a publisher xact_rollback delta of 5 when the
walsender exits. With the fix, the delta remains 0.

Reported-by: Rafael Thofehrn Castro
Discussion: https://postgr.es/m/CAG0ozMo_xWQn%2BAvv8jzbbhePGp5OnhdO%2BYWTkdg4faWSXz0Jzg%40mail.gmail.com
@NikolayS NikolayS force-pushed the fix/xact-rollback-spike-logical-decoding branch from e4c9697 to 72a73f5 Compare April 30, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant