|
| 1 | +Originally from: [tweet](https://twitter.com/samokhvalov/status/1721036531580973494), [LinkedIn post](). |
| 2 | + |
| 3 | +--- |
| 4 | + |
| 5 | +# How to break a database, Part 2: Simulate infamous transaction ID wraparound |
| 6 | + |
| 7 | +> I post a new PostgreSQL "howto" article every day. Join me in this |
| 8 | +> journey – [subscribe](https://twitter.com/samokhvalov/), provide feedback, share! |
| 9 | +
|
| 10 | +See also [Part 1: How to Corrupt](0039_how_to_break_a_database_part_1_how_to_corrupt.md). |
| 11 | + |
| 12 | +## Straightforward simulation method |
| 13 | + |
| 14 | +The method requires some time but works pretty well. It has been discussed in a couple of places: |
| 15 | + |
| 16 | +- My tweet [A question from Telegram chat](https://twitter.com/samokhvalov/status/1415575072081809409) |
| 17 | +- Article |
| 18 | + [How to simulate the deadly, infamous, misunderstood & complex ‘Transaction Wraparound Problem’ in PostgreSQL](https://fatdba.com/2021/07/20/how-to-simulate-the-deadly-transaction-wraparound-problem-in-postgresql/) |
| 19 | + |
| 20 | +First, open a "long-running transaction" (with normal XID assigned to it, as it would be a writing transaction) – and |
| 21 | +keep the session open (we'll use a named pipe a.k.a. FIFO to "hang" our `psql` session): |
| 22 | + |
| 23 | +```bash |
| 24 | +mkfifo dummy |
| 25 | + |
| 26 | +psql -Xc " |
| 27 | + set idle_in_transaction_session_timeout = 0; |
| 28 | + begin; |
| 29 | + select pg_current_xact_id() |
| 30 | +" \ |
| 31 | +-f dummy & |
| 32 | +``` |
| 33 | + |
| 34 | +Make sure the session is open and in transaction: |
| 35 | + |
| 36 | +```bash |
| 37 | +❯ psql -Xc "select state |
| 38 | + from pg_stat_activity |
| 39 | + where |
| 40 | + pid <> pg_backend_pid() |
| 41 | + and query ~ 'idle_in_tran' |
| 42 | +" |
| 43 | + state |
| 44 | +--------------------- |
| 45 | + idle in transaction |
| 46 | +(1 row) |
| 47 | +``` |
| 48 | + |
| 49 | +Now, having an open long-running transaction with assigned XID, we just need to shift transaction ID at high speed – we |
| 50 | +can use the same function for it, and `pgbench` with multiple connections: |
| 51 | + |
| 52 | +```bash |
| 53 | +pgbench -c8 -j8 -P60 -T36000 -rn \ |
| 54 | + -f - <<< 'select pg_current_xact_id()' |
| 55 | +``` |
| 56 | + |
| 57 | +This should move the current XID at very high pace (100-200k TPS). And since we have an open a long-running |
| 58 | +transaction, `autovacuum` cannot operate. |
| 59 | + |
| 60 | +While `pgbench` is running, we can observe the state of the database using a monitoring tool that includes XID/multiXID |
| 61 | +wraparound checks (every tool must, but not every tool does), or just some snippet – e.g. |
| 62 | +from [Managing Transaction ID Exhaustion (Wraparound) in PostgreSQL](https://crunchydata.com/blog/managing-transaction-id-wraparound-in-postgresql). |
| 63 | + |
| 64 | +After a few hours: |
| 65 | + |
| 66 | +``` |
| 67 | +WARNING: database "nik" must be vacuumed within 39960308 transactions |
| 68 | +HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. |
| 69 | +You might also need to commit or roll back old prepared transactions, or drop stale replication slots. |
| 70 | +``` |
| 71 | + |
| 72 | +And if we proceed, then this: |
| 73 | + |
| 74 | +```bash |
| 75 | +nik=# create table t3(); |
| 76 | +ERROR: database is not accepting commands to avoid wraparound data loss in database "nik" |
| 77 | +HINT: Stop the postmaster and vacuum that database in single-user mode. |
| 78 | +You might also need to commit or roll back old prepared transactions, or drop stale replication slots. |
| 79 | +``` |
| 80 | + |
| 81 | +Good, the harm is done! It's time to release FIFO (this will close our long-lasting idle-in-transaction session): |
| 82 | + |
| 83 | +```bash |
| 84 | +exec 3>dummy && exec 3>&- |
| 85 | +``` |
| 86 | + |
| 87 | +## How to escape – the basics |
| 88 | + |
| 89 | +How to escape? We'll probably discuss this in a separate howto, but let's talk about some basics. |
| 90 | + |
| 91 | +See real-life use cases to learn from the other people's mistakes: |
| 92 | + |
| 93 | +- [Sentry's case](https://blog.sentry.io/transaction-id-wraparound-in-postgres/), |
| 94 | +- [Mailchimp's one](https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/) |
| 95 | + |
| 96 | +are good lessons. |
| 97 | + |
| 98 | +Escaping from this takes, in general case, takes a lot of time. Traditionally, as the `HINT` above suggests, it is done |
| 99 | +in the single-user mode. But I very much like the ideas by |
| 100 | +[@PostSQL](https://twitter.com/PostSQL) |
| 101 | +from the GCP team – see the talk [Do you vacuum every day?](https://youtube.com/watch?v=JcRi8Z7rkPg) and discussion in |
| 102 | +the `pgsql-hackers` mailing list |
| 103 | +[We should stop telling users to "vacuum that database in single-user mode"](https://postgresql.org/message-id/flat/CAMT0RQTmRj_Egtmre6fbiMA9E2hM3BsLULiV8W00stwa3URvzA%40mail.gmail.com). |
| 104 | + |
| 105 | +## Alternative method to simulate wraparound |
| 106 | + |
| 107 | +There is another method for simulation which works much faster – no need to wait many hours, we can use `pg_resetwal`'s |
| 108 | +option `-x` ([docs](https://postgresql.org/docs/current/app-pgresetwal.html)) to fast-forward the transaction ID. This |
| 109 | +method is described in [Transaction ID wraparound: a walk on the wild |
| 110 | +side](https://cybertec-postgresql.com/en/transaction-id-wraparound-a-walk-on-the-wild-side/). It's quite interesting |
| 111 | +and looks like this: |
| 112 | + |
| 113 | +```bash |
| 114 | +pg_ctl stop -D $PGDATA |
| 115 | + |
| 116 | +pg_resetwal -x 2137483648 -D testdb ### 2^31 - 10000000 |
| 117 | + |
| 118 | +dd if=/dev/zero of=$PGDATA/pg_xact/07F6 bs=8192 count=15 |
| 119 | +``` |
| 120 | + |
| 121 | +## MultiXact ID wraparound |
| 122 | + |
| 123 | +MultiXact ID is also possible: |
| 124 | +[Multixacts and Wraparound](https://postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND). |
| 125 | +This risk is often overlooked. But it exists, especially if you have FKs and/or use |
| 126 | +`SELECT ... FOR SHARE` (and `SELECT ... FOR UPDATE`) combined with substransactions). |
| 127 | + |
| 128 | +Some articles to learn about it: |
| 129 | + |
| 130 | +- [A foreign key pathology to avoid](https://thebuild.com/blog/2023/01/18/a-foreign-key-pathology-to-avoid/) |
| 131 | +- [Subtransactions considered harmful: Problem 3: unexpected use of Multixact IDs](https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful#problem-3-unexpected-use-of-multixact-ids) |
| 132 | +- [Notes on some PostgreSQL implementation details](https://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/) |
| 133 | + |
| 134 | +Simulation of MultiXact ID wraparound is also possible – for this, we need to create multiple overlapping transactions |
| 135 | +that lock the same rows (for example, using explicit `SELECT ... FOR SHARE`). |
| 136 | + |
| 137 | +> 🎯 **TODO:** working example – left for future edits. |
| 138 | +
|
| 139 | +Everyone should not only monitor (with alerting) for traditional XID wraparound risks, but also for MultiXID wraparound |
| 140 | +risks. And it should be included in the snippets showing how much of the "capacity" is used. |
| 141 | + |
| 142 | +> 🎯 **TODO:** snippet to show both XID w-d and MultiXID w-d, at both DB and table level |
| 143 | +> Obviously, multixid wraparrounds are encountered less often in the wild – I don't see people have `datminmxid` and |
| 144 | +> `relminmxid` used in the snippets. |
| 145 | +> Basic version: |
| 146 | +> |
| 147 | +> ```sql |
| 148 | +> select |
| 149 | +> datname, |
| 150 | +> age(datfrozenxid), |
| 151 | +> mxid_age(datminmxid) |
| 152 | +> from pg_database; |
| 153 | +> ``` |
0 commit comments