Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow declaration after statement and reformat code to use it #2

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

JelteF
Copy link
Owner

@JelteF JelteF commented Aug 19, 2021

What is this?

It's a draft patch that replaces code like this:

pg_file_unlink(PG_FUNCTION_ARGS)
{
	char	   *filename;
	requireSuperuser();
	filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));

With this shorter version:

pg_file_unlink(PG_FUNCTION_ARGS)
{
	requireSuperuser();
	char      *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));

Why would we want this?

  1. It removes 23328 lines of code that don't have any impact on how the code behaves [1]. This is roughly 2.7% of all the lines of code in the codebase.
  2. Declarations are closer to the actual usage. This is advised by the "Code Complete" book [2] and has the following advantages:
    a. This limits variable scope to what is necessary. Which in turn makes the mental model you have to keep of a function when reading the code simpler.
    b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
  3. You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function. (as seen in the example above and the changes for pg_file_rename[3])
  4. Declaring variables after statements is allowed in C99. Postgres already requires C99, so it might as well use this feature too.

How was this changeset created?

I created a script that modifies all C files in the following way:

  1. Find a non static declaration.
  2. If it has an initializer and it is not a single variable/constant, don't consider replacing it. (reason: a function call initializer might have sideffects).
  3. Otherwise (it's a simple initializer or it has no initializer at all), take the type and variable from that declaration.
  4. Find the next use of the variable.
  5. If the next use is not an assignment, don't do anything (the value of the original initialization is used).
  6. If the next use is an assignment:
    1. Remove the old declaration
    2. Prefix the found assignment with the type
    3. Unless the variable is also used in the same line of the new initialization, e.g:
    int a = 1;
    a = a + a;

How does this script work?

It uses a Perl regex to search and replace! (obligatory jokes at the bottom of the email) This regex is based on the ones used in this PR to citus[4] and the similar PR to pg_auto_failover[5]. The script is in the 3rd commit of this patchset.

To visualize the regex in the script in a reasonable way, copy paste the search part of it:

\n\t(?!(return|static)\b)(?P<type>(\w+[\t ])+[\t *]*)(?>(?P<variable>\w+)( = [\w>\s\n-]*?)?;\n(?P<code_between>(?>(?P<comment_or_string_or_not_preprocessor>\/\*.*?\*\/|"(?>\\"|.)*?"|(?!goto)[^#]))*?)(\t)?(?=\b(?P=variable)\b))(?<=\n\t)(?<!:\n\t)(?P=variable) =(?![^;]*?[^>_]\b(?P=variable)\b[^_])

And paste that into https://www.debuggex.com/, then select PCRE from the dropdown. (Sharing seems to be down at this point, so this is the only way to do it at the moment) Try it out! The regex is not as crazy as it looks.

There's two important assumptions this regex makes:

  1. Code is indented using tabs, and the intent level determines the scope. (this is true, because of pgindent)
  2. Declared variables are actually used. (this is true, because we would get compiler warnings otherwise)

There's two cases where this regex has some special behaviour:

  1. Stop searching for the first usage of a variable when either a goto or a preprocessor command is found (outside a string or comment). These things affect the control flow in a way that the regex does not understand. (any # character is assumed to be a preprocessor commands).
  2. Don't replace if the assignment is right after a label:, by checking if there was a : as the last character on the previous line. This works around errors like this:
hashovfl.c:865:3: error: a label can only be part of a statement and a declaration is not a statement
   OffsetNumber maxroffnum = PageGetMaxOffsetNumber(rpage);
   ^~~~~~~~~~~~

Detecting this case in this way is not perfect, because sometimes there is an extra newline or a comment between the label and the assignment. This is not easily detectable by the regex, because lookbehinds cannot have a variable length in Perl (or most regex engines for that matter). For these few cases (5 in the entire codebase) a manual change was done either before or after the automatic replacement to fix them so the code compiles again. (2nd and 5th commits of this patchset)

After all these changes make -s world doesn't show any warnings or errors and make check-world succeeds. I configured compilation like this:

./configure --enable-cassert --enable-depend --enable-debug --with-openssl --with-libxml --with-libxslt --with-uuid=e2fs --with-icu

What I want to achieve with this email

  1. For people to look at a small sample of the changes made by this script. I will try to send patches with the actual changes made to the code as attachments in a followup email. However, I don't know if that will succeed since the patch files are very big and it might run into some mailserver limits. So in case that doesn't work, the full changeset is available on Github[6]. This make viewing this enormous diff quite a bit easier IMHO anyaway. If you see something weird that is not covered in the "Known issues" section below, please share it. That way it can be discussed and/or fixed.
  2. A discussion on if this type of code change would be a net positive for Postgres codebase. Please explain clearly why or why not.
  3. Some links to resources on what's necessary to get a big refactoring patch like this merged.

What I don't want to achieve with this email

  1. For someone to go over all the changes right now. There's likely to be changes to the script or something else. Doing a full review of the changes would be better saved for later during a final review.

Known issues with the currently generated code

There's a few issues with the final generated code that I've already spotted. These should all be relatively easy to fix in an automated way. However, I think doing so is probably better done by pgindent or some other auto formatting tool, instead of with the regex. Note that I did run pgindent, it just didn't address these things:

  1. Whitespace between type and variable is kept the same as it was before moving the declaration. If the original declaration had whitespace added in such a way that all variable names of declarations aligned, then this whitespace is preserved. This looks weird in various cases. See [3] for an example.
  2. pgindent adds a newline after each block of declarations, even if they are not at the start of function. If this is desired is debatable, but to me it seems like it adds excessive newlines. See [7] for an example.
  3. If all declarations are moved away from the top of the function, then an empty newline is kept at the top of the function. This seems like an unnecessary newline to me. See [3] for an example.

Sources:

  1. tokei[8] results of lines of code:
    Before:
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1361309       866514       330933       163862
===============================================================================
 Total                1389      1361309       866514       330933       163862
===============================================================================

After:

$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1347515       843186       330952       173377
===============================================================================
 Total                1389      1347515       843186       330952       173377
===============================================================================
  1. https://github.com/mgp/book-notes/blob/master/code-complete.markdown#103-guidelines-for-initializing-variables
@@ -401,11 +389,10 @@ pg_file_rename_internal(text *file1, text *file2, text *file3)
 Datum
 pg_file_unlink(PG_FUNCTION_ARGS)
 {
-	char	   *filename;
 
 	requireSuperuser();
 
-	filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+	char	   *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
 
 	if (access(filename, W_OK) < 0)
  1. Automatically convert useless declarations using regex replace citusdata/citus#3181
  2. Remove useless lines caused by declaration-after-statement warning hapostgres/pg_auto_failover#266
  3. Allow declaration after statement and reformat code to use it #2
@@ -2863,7 +2886,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	 * longer than we must.
 	 */
 	Buffer		buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
-								state->checkstrategy);
+											state->checkstrategy);
+
 	LockBuffer(buffer, BT_READ);
 
 	/*
  1. https://github.com/XAMPPRocky/tokei

Obligatory jokes:

  1. https://xkcd.com/1171/
  2. https://xkcd.com/208/
  3. https://stackoverflow.com/a/1732454/2570866 (and serious response to it https://neilmadden.blog/2019/02/24/why-you-really-can-parse-html-and-anything-else-with-regular-expressions/)

JelteF and others added 6 commits August 19, 2021 11:55
There's an annoying edge case in C, where a declaration cannot be the
first line after a label. A label needs to point to an expression, and a
declaration is not an expression. There were only 4 cases where this was
a problem in the entire codebase. I fixed those cases manually here in a
way that the script does not autofix them back.
@JelteF JelteF changed the title Allow declaration after statement Allow declaration after statement and reformat code to use it Aug 19, 2021
JelteF pushed a commit that referenced this pull request Mar 31, 2022
Currently the pc files use hard coded paths for "includedir" and
"libdir."

Example:

  Cflags: -I/usr/include
  Libs: -L/usr/lib -lpq

This is not very fortunate when cross compiling inside a buildroot,
where the includes and libs are inside a staging directory, because
this introduces host paths into the build:

  checking for pkg-config... /builder/shared-workdir/build/sdk/staging_dir/host/bin/pkg-config
  checking for PostgreSQL libraries via pkg_config... -L/usr/lib <----

This commit addresses this by doing the following two things:

  1. Instead of hard coding the paths in "Cflags" and "Libs"
     "${includedir}" and "${libdir}" are used.  Note: these variables
     can be overriden on the pkg-config command line
     ("--define-variable=libdir=/some/path").

  2. Add the variables "prefix" and "exec_prefix".  If "includedir"
     and/or "libdir" are using these then construct them accordingly.
     This is done because buildroots (for instance OpenWrt) tend to
     rename the real pkg-config and call it indirectly from a script
     that sets "prefix", "exec_prefix" and "bindir", like so:

     pkg-config.real --define-variable=prefix=${STAGING_PREFIX} \
       --define-variable=exec_prefix=${STAGING_PREFIX} \
       --define-variable=bindir=${STAGING_PREFIX}/bin $@

Example #1: user calls ./configure with "--libdir=/some/lib" and
"--includedir=/some/include":

  prefix=/usr/local/pgsql
  exec_prefix=${prefix}
  libdir=/some/lib
  includedir=/some/include

  Name: libpq
  Description: PostgreSQL libpq library
  Url: http://www.postgresql.org/
  Version: 12.1
  Requires:
  Requires.private:
  Cflags: -I${includedir}
  Libs: -L${libdir} -lpq
  Libs.private:  -lcrypt -lm

Example #2: user calls ./configure with no arguments:

  prefix=/usr/local/pgsql
  exec_prefix=${prefix}
  libdir=${exec_prefix}/lib
  includedir=${prefix}/include

  Name: libpq
  Description: PostgreSQL libpq library
  Url: http://www.postgresql.org/
  Version: 12.1
  Requires:
  Requires.private:
  Cflags: -I${includedir}
  Libs: -L${libdir} -lpq
  Libs.private:  -lcrypt -lm

Like this the paths can be forced into the staging directory when
using a buildroot setup:

  checking for pkg-config... /home/sk/tmp/openwrt/staging_dir/host/bin/pkg-config
  checking for PostgreSQL libraries via pkg_config... -L/home/sk/tmp/openwrt/staging_dir/target-mips_24kc_musl/usr/lib

Author: Sebastian Kemper <sebastian_ml@gmx.net>
Co-authored-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/20200305213827.GA25135%40darth.lan
JelteF pushed a commit that referenced this pull request Jun 24, 2022
Due to how pg_size_pretty(bigint) was implemented, it's possible that when
given a negative number of bytes that the returning value would not match
the equivalent positive return value when given the equivalent positive
number of bytes.  This was due to two separate issues.

1. The function used bit shifting to convert the number of bytes into
larger units.  The rounding performed by bit shifting is not the same as
dividing.  For example -3 >> 1 = -2, but -3 / 2 = -1.  These two
operations are only equivalent with positive numbers.

2. The half_rounded() macro rounded towards positive infinity.  This meant
that negative numbers rounded towards zero and positive numbers rounded
away from zero.

Here we fix #1 by dividing the values instead of bit shifting.  We fix #2
by adjusting the half_rounded macro always to round away from zero.

Additionally, adjust the pg_size_pretty(numeric) function to be more
explicit that it's using division rather than bit shifting.  A casual
observer might have believed bit shifting was used due to a static
function being named numeric_shift_right.  However, that function was
calculating the divisor from the number of bits and performed division.
Here we make that more clear.  This change is just cosmetic and does not
affect the return value of the numeric version of the function.

Here we also add a set of regression tests both versions of
pg_size_pretty() which test the values directly before and after the
function switches to the next unit.

This bug was introduced in 8a1fab3. Prior to that negative values were
always displayed in bytes.

Author: Dean Rasheed, David Rowley
Discussion: https://postgr.es/m/CAEZATCXnNW4HsmZnxhfezR5FuiGgp+mkY4AzcL5eRGO4fuadWg@mail.gmail.com
Backpatch-through: 9.6, where the bug was introduced.
JelteF pushed a commit that referenced this pull request Aug 18, 2022
We've heard a couple of reports of people having trouble with
multi-gigabyte-sized query-texts files.  It occurred to me that on
32-bit platforms, there could be an issue with integer overflow
of calculations associated with the total query text size.
Address that with several changes:

1. Limit pg_stat_statements.max to INT_MAX / 2 not INT_MAX.
The hashtable code will bound it to that anyway unless "long"
is 64 bits.  We still need overflow guards on its use, but
this helps.

2. Add a check to prevent extending the query-texts file to
more than MaxAllocHugeSize.  If it got that big, qtext_load_file
would certainly fail, so there's not much point in allowing it.
Without this, we'd need to consider whether extent, query_offset,
and related variables shouldn't be off_t not size_t.

3. Adjust the comparisons in need_gc_qtexts() to be done in 64-bit
arithmetic on all platforms.  It appears possible that under duress
those multiplications could overflow 32 bits, yielding a false
conclusion that we need to garbage-collect the texts file, which
could lead to repeatedly garbage-collecting after every hash table
insertion.

Per report from Bruno da Silva.  I'm not convinced that these
issues fully explain his problem; there may be some other bug that's
contributing to the query-texts file becoming so large in the first
place.  But it did get that big, so #2 is a reasonable defense,
and #3 could explain the reported performance difficulties.

(See also commit 8bbe4cb, which addressed some related bugs.
The second Discussion: link is the thread that led up to that.)

This issue is old, and is primarily a problem for old platforms,
so back-patch.

Discussion: https://postgr.es/m/CAB+Nuk93fL1Q9eLOCotvLP07g7RAv4vbdrkm0cVQohDVMpAb9A@mail.gmail.com
Discussion: https://postgr.es/m/5601D354.5000703@BlueTreble.com
JelteF pushed a commit that referenced this pull request Aug 25, 2022
Remove four probes for members of sockaddr_storage.  Keep only the probe
for sockaddr's sa_len, which is enough for our two remaining places that
know about _len fields:

1.  ifaddr.c needs to know if sockaddr has sa_len to understand the
result of ioctl(SIOCGIFCONF).  Only AIX is still using the relevant code
today, but it seems like a good idea to keep it compilable on Linux.

2.  ip.c was testing for presence of ss_len to decide whether to fill in
sun_len in our getaddrinfo_unix() function.  It's just as good to test
for sa_len.  If you have one, you have them all.

(The code in #2 isn't actually needed at all on several OSes I checked
since modern versions ignore sa_len on input to system calls.  Proving
that's the case for all relevant OSes is left for another day, but
wouldn't get rid of that last probe anyway if we still want it for #1.)

Discussion: https://postgr.es/m/CA%2BhUKGJJjF2AqdU_Aug5n2MAc1gr%3DGykNjVBZq%2Bd6Jrcp3Dyvg%40mail.gmail.com
JelteF pushed a commit that referenced this pull request Aug 25, 2022
We've heard a couple of reports of people having trouble with
multi-gigabyte-sized query-texts files.  It occurred to me that on
32-bit platforms, there could be an issue with integer overflow
of calculations associated with the total query text size.
Address that with several changes:

1. Limit pg_stat_statements.max to INT_MAX / 2 not INT_MAX.
The hashtable code will bound it to that anyway unless "long"
is 64 bits.  We still need overflow guards on its use, but
this helps.

2. Add a check to prevent extending the query-texts file to
more than MaxAllocHugeSize.  If it got that big, qtext_load_file
would certainly fail, so there's not much point in allowing it.
Without this, we'd need to consider whether extent, query_offset,
and related variables shouldn't be off_t not size_t.

3. Adjust the comparisons in need_gc_qtexts() to be done in 64-bit
arithmetic on all platforms.  It appears possible that under duress
those multiplications could overflow 32 bits, yielding a false
conclusion that we need to garbage-collect the texts file, which
could lead to repeatedly garbage-collecting after every hash table
insertion.

Per report from Bruno da Silva.  I'm not convinced that these
issues fully explain his problem; there may be some other bug that's
contributing to the query-texts file becoming so large in the first
place.  But it did get that big, so #2 is a reasonable defense,
and #3 could explain the reported performance difficulties.

(See also commit 8bbe4cb, which addressed some related bugs.
The second Discussion: link is the thread that led up to that.)

This issue is old, and is primarily a problem for old platforms,
so back-patch.

Discussion: https://postgr.es/m/CAB+Nuk93fL1Q9eLOCotvLP07g7RAv4vbdrkm0cVQohDVMpAb9A@mail.gmail.com
Discussion: https://postgr.es/m/5601D354.5000703@BlueTreble.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant