Allow declaration after statement #1

JelteF · 2020-12-07T21:17:48Z

What is this?

It's a draft patch that replaces code like this:

pg_file_unlink(PG_FUNCTION_ARGS)
{
	char	   *filename;
	requireSuperuser();
	filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));

With this shorter version:

pg_file_unlink(PG_FUNCTION_ARGS)
{
	requireSuperuser();
	char      *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));

Why would we want this?

It removes 22434 lines of code that don't have any impact on how the code behaves [1]. This is roughly 2.7% of all the lines of code in the codebase.
Declarations are closer to the actual usage. This is advised by the "Code Complete" book [2] and has the following advantages:
a. This limits variable scope to what is necessary. Which in turn makes the mental model you have to keep of a function when reading the code simpler.
b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function. (as seen in the example above and the changes for pg_file_rename[3])
Declaring variables after statements is allowed in C99. Postgres already requires C99, so it might as well use this feature too.

How was this changeset created?

I created a script that modifies all C files in the following way:

Find a non static declaration.
If it has an initializer and it is not a single variable/constant, don't consider replacing it. (reason: a function call initializer might have sideffects).
Otherwise (it's a simple initializer or it has no initializer at all), take the type and variable from that declaration.
Find the next use of the variable.
If the next use is not an assignment, don't do anything (the value of the original initialization is used).
If the next use is an assignment:
1. Remove the old declaration
2. Prefix the found assignment with the type
3. Unless the variable is also used in the same line of the new initialization, e.g:
```
int a = 1;
a = a + a;
```

How does this script work?

It uses a Perl regex to search and replace! (obligatory jokes at the bottom of the email) This regex is based on the ones used in this PR to citus[4] and the similar PR to pg_auto_failover[5]. The script is in the 3rd commit of this patchset.

To visualize the regex in the script in a reasonable way, copy paste the search part of it to:

\n\t(?!(return|static)\b)(?P<type>(\w+[\t ])+[\t *]*)(?>(?P<variable>\w+)( = [\w>\s\n-]*?)?;\n(?P<code_between>(?>(?P<comment_or_string_or_not_preprocessor>\/\*.*?\*\/|"(?>\\"|.)*?"|(?!goto)[^#]))*?)(\t)?(?=\b(?P=variable)\b))(?<=\n\t)(?<!:\n\t)(?P=variable) =(?![^;]*?[^>_]\b(?P=variable)\b[^_])

And paste in into https://www.debuggex.com/, then select PCRE from the selectbox. (Sharing seems to be down at this point, so this is the only way to do it at the moment) Try it out! The regex is not as crazy as it looks.

There's two important assumptions this regex makes:

Code is indented using tabs, and the intent level determines the scope. (this is true, because of pgindent)
Declared variables are actually used. (this is true, because we would get compiler warnings otherwise)

There's two cases where this regex has some special behaviour:

Stop searching for the first usage of a variable when either a goto or a preprocessor command is found (outside a string or comment). These things affect the control flow in a way that the regex does not understand. (any # character is assumed to be a preprocessor commands).
Don't replace if the assignment is right after a label:, by checking if there was a : as the last character on the previous line. This works around errors like this:

hashovfl.c:865:3: error: a label can only be part of a statement and a declaration is not a statement
   OffsetNumber maxroffnum = PageGetMaxOffsetNumber(rpage);
   ^~~~~~~~~~~~

Detecting this case in this way is not perfect, because sometimes there is an extra newline or a comment between the label and the assignment. This is not easily detectable by the regex, because lookbehinds cannot have a variable length in Perl (or most regex engines for that matter). For these few cases (5 in the entire codebase) a manual change was done either before or after the automatic replacement to fix them so the code compiles again. (2nd and 5th commits of this patchset)

After all these changes make -s world doesn't show any warnings or errors and make check-world succeeds. I configured compilation like this:

./configure --enable-cassert --enable-depend --enable-debug --with-openssl --with-libxml --with-libxslt --with-uuid=e2fs --with-icu

What do I want with this email?

For people to look at a small sample of the changes made by this script. If you prefer this changeset is also available on Github[6], which might make viewing this enormous diff easier. If you see something weird that is not covered in the "Known issues" section below please share it, so it can be discussed and/or fixed.
A discussion on if this type of code change would be a net positive for Postgres codebase. Please explain clearly why or why not.
Some links to resources on what's necessary to get a big refactoring patch like this merged.

What don't I want?

For someone to go over all the changes right now. There's likely to be changes to the script or something else. Doing a full review of the changes would be better saved for later during a final review.

Known issues with the currently generated code

There's a few issues with the final generated code that I've already spotted. These should all be relatively easy to fix in an automated way. However, I think doing so is probably better done by pgindent or some other auto formatting tool, instead of with the regex. Note that I did run pgindent, it just didn't address these things:

Whitespace between type and variable is kept the same as it was before moving the declaration. If the original declaration had whitespace added in such a way that all variable names of declarations aligned, then this whitespace is preserved. This looks weird in various cases. See [7] for an example.
pgindent adds a newline after each block of declarations, even if they are not at the start of function. If this is desired is debatable, but to me it seems like it adds excessive newlines. See [8] for an example.
If all declarations are moved away from the top of the function, then an empty newline is kept at the top of the function. This seems like an unnecessary newline to me. See [9] for an example.

Sources:

tokei[10] results of lines of code:
Before:

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1353      1304323       828701       317382       158240
===============================================================================

After:

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1353      1291143       806267       317396       167480
===============================================================================

Obligatory jokes:

The idea behind this patch is to design out bugs like the one fixed by commit 9d52311. Previously, once one did RelationOpenSmgr(rel), it was considered okay to access rel->rd_smgr directly for some not-very-clear interval. But since that pointer will be cleared by relcache flushes, we had bugs arising from overreliance on a previous RelationOpenSmgr call still being effective. Now, very little code except that in rel.h and relcache.c should ever touch the rd_smgr field directly. The normal coding rule is to use RelationGetSmgr(rel) and not expect the result to be valid for longer than one smgr function call. There are a couple of places where using the function every single time seemed like overkill, but they are now annotated with large warning comments. Amul Sul, after an idea of mine. Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com

Apple's mechanism for dealing with functions that are available in only some OS versions confuses AC_CHECK_FUNCS, and therefore AC_REPLACE_FUNCS. We can use AC_CHECK_DECLS instead, so long as we enable -Werror=unguarded-availability-new. This allows people compiling for macOS to control whether or not preadv/pwritev are used by setting MACOSX_DEPLOYMENT_TARGET, rather than supplying a back-rev SDK. (Of course, the latter still works, too.) James Hilliard Discussion: https://postgr.es/m/20210122193230.25295-1-james.hilliard1@gmail.com

Allow a pager to be used by the \watch command. This works but isn't very useful with traditional pagers like "less", so use a different environment variable. The popular open source tool "pspg" (also by Pavel) knows how to display the output if you set PSQL_WATCH_PAGER="pspg --stream". To make \watch react quickly when the user quits the pager or presses ^C, and also to increase the accuracy of its timing and decrease the rate of useless context switches, change the main loop of the \watch command to use sigwait() rather than a sleeping/polling loop, on Unix. Supported on Unix only for now (like pspg). Author: Pavel Stehule <pavel.stehule@gmail.com> Author: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CAFj8pRBfzUUPz-3gN5oAzto9SDuRSq-TQPfXU_P6h0L7hO%2BEhg%40mail.gmail.com

This fixes a theoretical bug in tuplesort.c which, if a bounded sort was used in combination with a byval Datum sort (tuplesort_begin_datum), when switching the sort to a bounded heap in make_bounded_heap(), we'd call free_sort_tuple(). The problem was that when sorting Datums of a byval type, the tuple is NULL and free_sort_tuple() would free the memory for it regardless of that. This would result in a crash. Here we fix that simply by adding a check to see if the tuple is NULL before trying to disassociate and free any memory belonging to it. The reason this bug is only theoretical is that nowhere in the current code base do we do tuplesort_set_bound() when performing a Datum sort. However, let's backpatch a fix for this as if any extension uses the code in this way then it's likely to cause problems. Author: Ronan Dunklau Discussion: https://postgr.es/m/CAApHDvpdoqNC5FjDb3KUTSMs5dg6f+XxH4Bg_dVcLi8UYAG3EQ@mail.gmail.com Backpatch-through: 9.6, oldest supported version

4146925 went to the trouble of removing a theoretical bug from free_sort_tuple by checking if the tuple was NULL before freeing it. Let's make this a little more robust by also setting the tuple to NULL so that should we be called again we won't end up doing a pfree on the already pfree'd tuple. Per advice from Tom Lane. Discussion: https://postgr.es/m/3188192.1626136953@sss.pgh.pa.us Backpatch-through: 9.6, same as 4146925

There's no point in checking if an INT8 sequence has a seqmin and seqmax value is outside the range of the minimum and maximum values for an int64 type. These both use the same underlying types so an INT8 certainly cannot be outside the minimum and maximum values supported by int64. This code is fairly harmless and it seems likely that most compilers would optimize it out anyway, never-the-less, let's remove it replacing it with a small comment to mention why the check is not needed. Author: Greg Nancarrow, with the comment revised by David Rowley Discussion: https://postgr.es/m/CAJcOf-c9KBUZ8ow_6e%3DWSfbbEyTKfqV%3DVwoFuODQVYMySHtusw%40mail.gmail.com

The name introduced by commit 4656e3d was agreed to be unreasonably long. To match this change, rename initdb's recently-added --clobber-cache option to --discard-caches. Discussion: https://postgr.es/m/1374320.1625430433@sss.pgh.pa.us

"Result Cache" was never a great name for this node, but nobody managed to come up with another name that anyone liked enough. That was until David Johnston mentioned "Node Memoization", which Tom Lane revised to just "Memoize". People seem to like "Memoize", so let's do the rename. Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/20210708165145.GG1176@momjian.us Backpatch-through: 14, where Result Cache was introduced

The internals of the frontend-side callbacks for SASL are visible in libpq-int.h, but the header was not getting installed. This would cause compilation failures for applications playing with the internals of libpq. Issue introduced in 9fd8557. Author: Mikhail Kulagin Reviewed-by: Jacob Champion Discussion: https://postgr.es/m/05ce01d777cb$40f31d60$c2d95820$@postgrespro.ru

To add support for streaming transactions at prepare time into the built-in logical replication, we need to do the following things: * Modify the output plugin (pgoutput) to implement the new two-phase API callbacks, by leveraging the extended replication protocol. * Modify the replication apply worker, to properly handle two-phase transactions by replaying them on prepare. * Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase transactions. We enable the two_phase once the initial data sync is over. We however must explicitly disable replication of two-phase transactions during replication slot creation, even if the plugin supports it. We don't need to replicate the changes accumulated during this phase, and moreover, we don't have a replication connection open so we don't know where to send the data anyway. The streaming option is not allowed with this new two_phase option. This can be done as a separate patch. We don't allow to toggle two_phase option of a subscription because it can lead to an inconsistent replica. For the same reason, we don't allow to refresh the publication once the two_phase is enabled for a subscription unless copy_data option is false. Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi, Greg Nancarrow Tested-By: Haiying Tang Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com

Reported-By: Peter Eisentraut Backpatch-through: 14 Discussion: https://postgr.es/m/8f5e63b8-e8ed-0f80-d8c4-68222624c200@enterprisedb.com

Previously, we would send each line as a separate CopyData message. That's pretty wasteful if the table is narrow, as each CopyData message has 5 bytes of overhead. For efficiency, buffer up and pack 8 kB of input data into each CopyData message. The server also sends each line as a separate CopyData message in COPY TO STDOUT, and that's similarly wasteful. But that's documented in the FE/BE protocol description, so changing that would be a wire protocol break. Reviewed-by: Aleksander Alekseev Discussion: https://www.postgresql.org/message-id/40b2cec0-d0fb-3191-2ae1-9a3fe16a7e48%40iki.fi

Commit 0563a3a changed how partition constraints were generated such that this function no longer computes the mapping of parent attnos to child attnos. This is an external function that extensions could use, so this is potentially a breaking change. No external callers are known, however, and this will make it simpler to write such callers in the future. Author: Hou Zhijie Reviewed-by: David Rowley, Michael Paquier, Soumyadeep Chakraborty Discussion: https://www.postgresql.org/message-id/flat/OS0PR01MB5716A75A45BE46101A1B489894379@OS0PR01MB5716.jpnprd01.prod.outlook.com

This allows Param substitution to produce just the same result as writing a constant value literally would have done. While it hardly matters so far as the current core code is concerned, extensions might take more interest in node location fields. Julien Rouhaud Discussion: https://postgr.es/m/20170311220932.GJ15188@nol.local

Build farm animals running ancient HPUX and Solaris have a non-standard sigwait() from draft versions of POSIX, so they didn't like commit 7c09d27. To avoid the problem in general, only try to use sigwait() if it's declared by <signal.h> and matches the expected declaration. To select the modern declaration on Solaris (even in non-threaded programs), move -D_POSIX_PTHREAD_SEMANTICS into the right place to affect all translation units. Also fix the error checking. Modern sigwait() doesn't set errno. Thanks to Tom Lane for help with this. Discussion: https://postgr.es/m/3187588.1626136248%40sss.pgh.pa.us

A code path asserted that the archiver was dead, but a check made that impossible to happen. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACW=CYE1ars+2XyPTEPq0wQvru4c0dPZ=Nrn3EqNBkksvQ@mail.gmail.com Backpatch-throgh: 14

There is a non-trivial amount of code that handles ZLIB compression in pg_receivewal, from basics like the format name, the calculation of the start streaming position and of course the compression itself, but there was no automated coverage for it. This commit introduces a set of conditional tests (if the build supports ZLIB) to cover the creation of ZLIB-compressed WAL segments, the handling of the partial, compressed, WAL segments and the compression operation in itself. Note that there is an extra phase checking the validity of the generated files by using directly a gzip command, passed down by the Makefile of pg_receivewal. This part is skipped if the command cannot be found, something likely going to happen on Windows with MSVC except if one sets the variable GZIP_PROGRAM in the environment of the test. This set of tests will become handy for upcoming patches that add more options for the compression methods used by pg_receivewal, like LZ4, to make sure that no existing facilities are broken. Author: Georgios Kokolatos Reviewed-by: Gilles Darold, Michael Paquier Discussion: https://postgr.es/m/07BK3Mk5aEOsTwGaY77qBVyf9GjoEzn8TMgHLyPGfEFPIpTEmoQuP2P4c7teesjSg-LPeUafsp1flnPeQYINMSMB_UpggJDoduB5EDYBqaQ=@protonmail.com

When reporting "conflicting or redundant options" errors, try to ensure that errposition() is used, to help the user identify the offending option. Formerly, errposition() was invoked in less than 60% of cases. This patch raises that to over 90%, but there remain a few places where the ParseState is not readily available. Using errdetail() might improve the error in such cases, but that is left as a task for the future. Additionally, since this error is thrown from over 100 places in the codebase, introduce a dedicated function to throw it, reducing code duplication. Extracted from a slightly larger patch by Vignesh C. Reviewed by Bharath Rupireddy, Alvaro Herrera, Dilip Kumar, Hou Zhijie, Peter Smith, Daniel Gustafsson, Julien Rouhaud and me. Discussion: https://postgr.es/m/CALDaNm33FFSS5tVyvmkoK2cCMuDVxcui=gFrjti9ROfynqSAGA@mail.gmail.com

This commit fixes the description of a couple of multirange operators and oprjoin for another multirange operator. The change of oprjoin is more cosmetic since both old and new functions return the same constant. These cosmetic changes don't worth catalog incompatibility between 14beta2 and 14beta3. So, catversion isn't bumped. Discussion: https://postgr.es/m/CAPpHfdv9OZEuZDqOQoUKpXhq%3Dmc-qa4gKCPmcgG5Vvesu7%3Ds1w%40mail.gmail.com Backpatch-throgh: 14

Currently the pc files use hard coded paths for "includedir" and "libdir." Example: Cflags: -I/usr/include Libs: -L/usr/lib -lpq This is not very fortunate when cross compiling inside a buildroot, where the includes and libs are inside a staging directory, because this introduces host paths into the build: checking for pkg-config... /builder/shared-workdir/build/sdk/staging_dir/host/bin/pkg-config checking for PostgreSQL libraries via pkg_config... -L/usr/lib <---- This commit addresses this by doing the following two things: 1. Instead of hard coding the paths in "Cflags" and "Libs" "${includedir}" and "${libdir}" are used. Note: these variables can be overriden on the pkg-config command line ("--define-variable=libdir=/some/path"). 2. Add the variables "prefix" and "exec_prefix". If "includedir" and/or "libdir" are using these then construct them accordingly. This is done because buildroots (for instance OpenWrt) tend to rename the real pkg-config and call it indirectly from a script that sets "prefix", "exec_prefix" and "bindir", like so: pkg-config.real --define-variable=prefix=${STAGING_PREFIX} \ --define-variable=exec_prefix=${STAGING_PREFIX} \ --define-variable=bindir=${STAGING_PREFIX}/bin $@ Example #1: user calls ./configure with "--libdir=/some/lib" and "--includedir=/some/include": prefix=/usr/local/pgsql exec_prefix=${prefix} libdir=/some/lib includedir=/some/include Name: libpq Description: PostgreSQL libpq library Url: http://www.postgresql.org/ Version: 12.1 Requires: Requires.private: Cflags: -I${includedir} Libs: -L${libdir} -lpq Libs.private: -lcrypt -lm Example #2: user calls ./configure with no arguments: prefix=/usr/local/pgsql exec_prefix=${prefix} libdir=${exec_prefix}/lib includedir=${prefix}/include Name: libpq Description: PostgreSQL libpq library Url: http://www.postgresql.org/ Version: 12.1 Requires: Requires.private: Cflags: -I${includedir} Libs: -L${libdir} -lpq Libs.private: -lcrypt -lm Like this the paths can be forced into the staging directory when using a buildroot setup: checking for pkg-config... /home/sk/tmp/openwrt/staging_dir/host/bin/pkg-config checking for PostgreSQL libraries via pkg_config... -L/home/sk/tmp/openwrt/staging_dir/target-mips_24kc_musl/usr/lib Author: Sebastian Kemper <sebastian_ml@gmx.net> Co-authored-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/20200305213827.GA25135%40darth.lan

Due to how pg_size_pretty(bigint) was implemented, it's possible that when given a negative number of bytes that the returning value would not match the equivalent positive return value when given the equivalent positive number of bytes. This was due to two separate issues. 1. The function used bit shifting to convert the number of bytes into larger units. The rounding performed by bit shifting is not the same as dividing. For example -3 >> 1 = -2, but -3 / 2 = -1. These two operations are only equivalent with positive numbers. 2. The half_rounded() macro rounded towards positive infinity. This meant that negative numbers rounded towards zero and positive numbers rounded away from zero. Here we fix #1 by dividing the values instead of bit shifting. We fix #2 by adjusting the half_rounded macro always to round away from zero. Additionally, adjust the pg_size_pretty(numeric) function to be more explicit that it's using division rather than bit shifting. A casual observer might have believed bit shifting was used due to a static function being named numeric_shift_right. However, that function was calculating the divisor from the number of bits and performed division. Here we make that more clear. This change is just cosmetic and does not affect the return value of the numeric version of the function. Here we also add a set of regression tests both versions of pg_size_pretty() which test the values directly before and after the function switches to the next unit. This bug was introduced in 8a1fab3. Prior to that negative values were always displayed in bytes. Author: Dean Rasheed, David Rowley Discussion: https://postgr.es/m/CAEZATCXnNW4HsmZnxhfezR5FuiGgp+mkY4AzcL5eRGO4fuadWg@mail.gmail.com Backpatch-through: 9.6, where the bug was introduced.

Remove four probes for members of sockaddr_storage. Keep only the probe for sockaddr's sa_len, which is enough for our two remaining places that know about _len fields: 1. ifaddr.c needs to know if sockaddr has sa_len to understand the result of ioctl(SIOCGIFCONF). Only AIX is still using the relevant code today, but it seems like a good idea to keep it compilable on Linux. 2. ip.c was testing for presence of ss_len to decide whether to fill in sun_len in our getaddrinfo_unix() function. It's just as good to test for sa_len. If you have one, you have them all. (The code in #2 isn't actually needed at all on several OSes I checked since modern versions ignore sa_len on input to system calls. Proving that's the case for all relevant OSes is left for another day, but wouldn't get rid of that last probe anyway if we still want it for #1.) Discussion: https://postgr.es/m/CA%2BhUKGJJjF2AqdU_Aug5n2MAc1gr%3DGykNjVBZq%2Bd6Jrcp3Dyvg%40mail.gmail.com

Load balancing connections across multiple read replicas is a pretty common way of scaling out read queries. There are two main ways of doing so, both with their own advantages and disadvantages: 1. Load balancing at the client level 2. Load balancing by connecting to an intermediary load balancer Both JBDC (Java) and Npgsql (C#) already support client level load balancing (option #1). This patch implements client level load balancing for libpq as well. To stay consistent with the JDBC and Npgsql part of the ecosystem, a similar implementation and name for the option are used. It contains two levels of load balancing: 1. The given hosts are randomly shuffled, before resolving them one-by-one. 2. Once a host its addresses get resolved, those addresses are shuffled, before trying to connect to them one-by-one.

Commit e7cb7ee, which introduced the infrastructure for FDWs and custom scan providers to replace joins with scans, failed to add support handling of pseudoconstant quals assigned to replaced joins in createplan.c, leading to an incorrect plan without a gating Result node when postgres_fdw replaced a join with such a qual. To fix, we could add the support by 1) modifying the ForeignPath and CustomPath structs to store the list of RestrictInfo nodes to apply to the join, as in JoinPaths, if they represent foreign and custom scans replacing a join with a scan, and by 2) modifying create_scan_plan() in createplan.c to use that list in that case, instead of the baserestrictinfo list, to get pseudoconstant quals assigned to the join; but #1 would cause an ABI break. So fix by modifying the infrastructure to just disallow replacing joins with such quals. Back-patch to all supported branches. Reported by Nishant Sharma. Patch by me, reviewed by Nishant Sharma and Richard Guo. Discussion: https://postgr.es/m/CADrsxdbcN1vejBaf8a%2BQhrZY5PXL-04mCd4GDu6qm6FigDZd6Q%40mail.gmail.com

Default privileges are represented as NULL::aclitem[] in catalog ACL columns, while revoking all privileges leaves an empty aclitem[]. These two cases used to produce identical output in psql meta-commands like \dp. Using something like "\pset null '(default)'" as a workaround for spotting the difference did not work, because null values were always displayed as empty strings by describe.c's meta-commands. This patch improves that with two changes: 1. Print "(none)" for empty privileges so that the user is able to distinguish them from default privileges, even without special workarounds. 2. Remove the special handling of null values in describe.c, so that "\pset null" is honored like everywhere else. (This affects all output from these commands, not only ACLs.) The privileges shown by \dconfig+ and \ddp as well as the column privileges shown by \dp are not affected by change #1, because the respective aclitem[] is reset to NULL or deleted from the catalog instead of leaving an empty array. Erik Wienhold and Laurenz Albe Discussion: https://postgr.es/m/1966228777.127452.1694979110595@office.mailbox.org

1. TruncateMultiXact() performs the SLRU truncations in a critical section. Deleting the SLRU segments calls ForwardSyncRequest(), which will try to compact the request queue if it's full (CompactCheckpointerRequestQueue()). That in turn allocates memory, which is not allowed in a critical section. Backtrace: TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981 postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e] postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d] postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e] postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb] postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a] postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1] postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b] postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3] postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66] postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d] postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead] postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e] postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb] postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e] /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45] postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31] To fix, bail out in CompactCheckpointerRequestQueue() without doing anything, if it's called in a critical section. That covers the above call path, as well as any other similar cases where RegisterSyncRequest might be called in a critical section. 2. After fixing that, another problem became apparent: Autovacuum process doing that truncation can deadlock with the checkpointer process. TruncateMultiXact() sets "MyProc->delayChkptFlags |= DELAY_CHKPT_START". If the sync request queue is full and cannot be compacted, the process will repeatedly sleep and retry, until there is room in the queue. However, if the checkpointer is trying to start a checkpoint at the same time, and is waiting for the DELAY_CHKPT_START processes to finish, the queue will never shrink. More concretely, the autovacuum process is stuck here: #0 0x00007fc934926dc3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x000056220b24348b in WaitEventSetWaitBlock (set=0x56220c2e4b50, occurred_events=0x7ffe7856d040, nevents=1, cur_timeout=<optimized out>) at ../src/backend/storage/ipc/latch.c:1570 #2 WaitEventSetWait (set=0x56220c2e4b50, timeout=timeout@entry=10, occurred_events=<optimized out>, occurred_events@entry=0x7ffe7856d040, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:1516 #3 0x000056220b243224 in WaitLatch (latch=<optimized out>, latch@entry=0x0, wakeEvents=wakeEvents@entry=40, timeout=timeout@entry=10, wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:538 #4 0x000056220b26cf46 in RegisterSyncRequest (ftag=ftag@entry=0x7ffe7856d0a0, type=type@entry=SYNC_FORGET_REQUEST, retryOnError=true) at ../src/backend/storage/sync/sync.c:614 #5 0x000056220af9db0a in SlruInternalDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1495 #6 0x000056220af9dab1 in SlruDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1566 postgres#7 0x000056220af98e1b in PerformMembersTruncation (oldestOffset=<optimized out>, newOldestOffset=<optimized out>) at ../src/backend/access/transam/multixact.c:3006 postgres#8 TruncateMultiXact (newOldestMulti=newOldestMulti@entry=3221225472, newOldestMultiDB=newOldestMultiDB@entry=4) at ../src/backend/access/transam/multixact.c:3201 postgres#9 0x000056220b098303 in vac_truncate_clog (frozenXID=749, minMulti=<optimized out>, lastSaneFrozenXid=749, lastSaneMinMulti=3221225472) at ../src/backend/commands/vacuum.c:1917 postgres#10 vac_update_datfrozenxid () at ../src/backend/commands/vacuum.c:1760 postgres#11 0x000056220b1c3f76 in do_autovacuum () at ../src/backend/postmaster/autovacuum.c:2550 postgres#12 0x000056220b1c2c3d in AutoVacWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/autovacuum.c:1569 and the checkpointer is stuck here: #0 0x00007fc9348ebf93 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fc9348fe353 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x000056220b40ecb4 in pg_usleep (microsec=microsec@entry=10000) at ../src/port/pgsleep.c:50 #3 0x000056220afb43c3 in CreateCheckPoint (flags=flags@entry=108) at ../src/backend/access/transam/xlog.c:7098 #4 0x000056220b1c6e86 in CheckpointerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/checkpointer.c:464 To fix, add AbsorbSyncRequests() to the loops where the checkpointer waits for DELAY_CHKPT_START or DELAY_CHKPT_COMPLETE operations to finish. Backpatch to v14. Before that, SLRU deletion didn't call RegisterSyncRequest, which avoided this failure. I'm not sure if there are other similar scenarios on older versions, but we haven't had any such reports. Discussion: https://www.postgresql.org/message-id/ccc66933-31c1-4f6a-bf4b-45fef0d4f22e@iki.fi

JelteF force-pushed the remove-useless-lines branch 8 times, most recently from 09fab2a to 60378ba Compare December 8, 2020 17:25

petere and others added 22 commits July 12, 2021 22:16

doc: Fix typo in function prototype

5b60cf3

Fix lack of message pluralization

55b2a23

Clarify description of pg_stat_statements columns

b4deefc

Reported-By: Peter Eisentraut Backpatch-through: 14 Discussion: https://postgr.es/m/8f5e63b8-e8ed-0f80-d8c4-68222624c200@enterprisedb.com

Fix some nonstandard C code indentation in grammar file

9aa8268

Remove unnecessary assertion in postmaster.c

dc2db1e

A code path asserted that the archiver was dead, but a check made that impossible to happen. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACW=CYE1ars+2XyPTEPq0wQvru4c0dPZ=Nrn3EqNBkksvQ@mail.gmail.com Backpatch-throgh: 14

This was referenced Aug 19, 2021

Allow declaration after statement postgres/postgres#68

Closed

Allow declaration after statement and reformat code to use it #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow declaration after statement #1

Allow declaration after statement #1

Uh oh!

JelteF commented Dec 7, 2020 •

edited

Loading

Uh oh!

Uh oh!

Allow declaration after statement #1

Allow declaration after statement #1

Uh oh!

Conversation

JelteF commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this?

Why would we want this?

How was this changeset created?

How does this script work?

What do I want with this email?

What don't I want?

Known issues with the currently generated code

Uh oh!

Uh oh!

JelteF commented Dec 7, 2020 •

edited

Loading