Protocol level gucs #5

JelteF · 2024-01-08T21:52:06Z

No description provided.

1a0da34 replaced Bitmap Table Scan's separate private and shared bitmap iterators with a unified iterator. It accidentally set up the prefetch iterator twice for non-parallel bitmap table scans. Remove the extra set up call to tbm_begin_iterate().

The original design for set operations involved appending the two input relations into one and adding a flag column that allows distinguishing which side each row came from. Then the SetOp node pries them apart again based on the flag. This is bizarre. The only apparent reason to do it is that when sorting, we'd only need one Sort node not two. But since sorting is at least O(N log N), sorting all the data is actually worse than sorting each side separately --- plus, we have no chance of taking advantage of presorted input. On top of that, adding the flag column frequently requires an additional projection step that adds cycles, and then the Append node isn't free either. Let's get rid of all of that and make the SetOp node have two separate children, using the existing outerPlan/innerPlan infrastructure. This initial patch re-implements nodeSetop.c and does a bare minimum of work on the planner side to generate correctly-shaped plans. In particular, I've tried not to change the cost estimates here, so that the visible changes in the regression test results will only involve removal of useless projection steps and not any changes in whether to use sorted vs hashed mode. For SORTED mode, we combine successive identical tuples from each input into groups, and then merge-join the groups. The tuple comparisons now use SortSupport instead of simple equality, but the group-formation part should involve roughly the same number of tuple comparisons as before. The cross-comparisons between left and right groups probably add to that, but I'm not sure to quantify how many more comparisons we might need. For HASHED mode, nodeSetop's logic is almost the same as before, just refactored into two separate loops instead of one loop that has an assumption that it will see all the left-hand inputs first. In both modes, I added early-exit logic to not bother reading the right-hand relation if the left-hand input is empty, since neither INTERSECT nor EXCEPT modes can produce any output if the left input is empty. This could have been done before in the hashed mode, but not in sorted mode. Sorted mode can also stop as soon as it exhausts the left input; any remaining right-hand tuples cannot have matches. Also, this patch adds some infrastructure for detecting whether child plan nodes all output the same type of tuple table slot. If they do, the hash table logic can use slightly more efficient code based on assuming that that's the input slot type it will see. We'll make use of that infrastructure in other plan node types later. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us

Remove the code for inserting flag columns in the inputs of a SetOp. That was the only reason why there would be resjunk columns in a set-operations plan tree, so we can get rid of some code that supported that, too. Get rid of choose_hashed_setop() in favor of building Paths for the hashed and sorted alternatives, and letting them fight it out within add_path(). Remove set_operation_ordered_results_useful(), which was giving wrong answers due to examining the wrong ancestor node: we need to examine the immediate SetOperationStmt parent not the topmost node. Instead make each caller of recurse_set_operations() pass down the relevant parent node. (This thinko seems to have led only to wasted planning cycles and possibly-inferior plans, not wrong query answers. Perhaps we should back-patch it, but I'm not doing so right now.) Teach generate_nonunion_paths() to consider pre-sorted inputs for sorted SetOps, rather than always generating a Sort node. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us

Append, MergeAppend, and RecursiveUnion can all use the support functions added in commit 2762792. The first two can report a fixed result slot type if all their children return the same fixed slot type. That does nothing for the append step itself, but might allow optimizations in the parent plan node. RecursiveUnion can optimize tuple hash table operations in the same way as SetOp now does. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us

It was reasonable to preserve the old API of BuildTupleHashTable() in the back branches, but in HEAD we should actively discourage use of that version. There are no remaining callers in core, so just get rid of it. Then rename BuildTupleHashTableExt() back to BuildTupleHashTable(). While at it, fix up the miserably-poorly-maintained header comment for BuildTupleHashTable[Ext]. It looks like more than one patch in this area has had the opinion that updating comments is beneath them. Discussion: https://postgr.es/m/538343.1734646986@sss.pgh.pa.us

b7493e1 removed leftover mentions of XLOG_HEAP2_FREEZE_PAGE records from comments but neglected to remove one mention of FREEZE_PAGE. Reported off-list by Alexander Lakhin

The new compact_attrs array stores a few select fields from FormData_pg_attribute in a more compact way, using only 16 bytes per column instead of the 104 bytes that FormData_pg_attribute uses. Using CompactAttribute allows performance-critical operations such as tuple deformation to be performed without looking at the FormData_pg_attribute element in TupleDesc which means fewer cacheline accesses. For some workloads, tuple deformation can be the most CPU intensive part of processing the query. Some testing with 16 columns on a table where the first column is variable length showed around a 10% increase in transactions per second for an OLAP type query performing aggregation on the 16th column. However, in certain cases, the increases were much higher, up to ~25% on one AMD Zen4 machine. This also makes pg_attribute.attcacheoff redundant. A follow-on commit will remove it, thus shrinking the FormData_pg_attribute struct by 4 bytes. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com

One test added in 9aea73f did not take into account that the backend may have some fsync even after a checkpoint. Let's relax it to be more flexible. Per report from buildfarm member grassquit, via Alexander Lakhin. Author: Bertrand Drouvot Discussion: https://postgr.es/m/6143ab0a-9e88-4790-8d9d-50ba45657761@gmail.com

The column is no longer needed as the offset is now cached in the CompactAttribute struct per commit 5983a4c. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com

RelationTruncate() does three things, while holding an AccessExclusiveLock and preventing checkpoints: 1. Logs the truncation. 2. Drops buffers, even if they're dirty. 3. Truncates some number of files. Step 2 could previously be canceled if it had to wait for I/O, and step 3 could and still can fail in file APIs. All orderings of these operations have data corruption hazards if interrupted, so we can't give up until the whole operation is done. When dirty pages were discarded but the corresponding blocks were left on disk due to ERROR, old page versions could come back from disk, reviving deleted data (see pgsql-bugs #18146 and several like it). When primary and standby were allowed to disagree on relation size, standbys could panic (see pgsql-bugs #18426) or revive data unknown to visibility management on the primary (theorized). Changes: * WAL is now unconditionally flushed first * smgrtruncate() is now called in a critical section, preventing interrupts and causing PANIC on file API failure * smgrtruncate() has a new parameter for existing fork sizes, because it can't call smgrnblocks() itself inside a critical section The changes apply to RelationTruncate(), smgr_redo() and pg_truncate_visibility_map(). That last is also brought up to date with other evolutions of the truncation protocol. The VACUUM FileTruncate() failure mode had been discussed in older reports than the ones referenced below, with independent analysis from many people, but earlier theories on how to fix it were too complicated to back-patch. The more recently invented cancellation bug was diagnosed by Alexander Lakhin. Other corruption scenarios were spotted by me while iterating on this patch and earlier commit 75818b3. Back-patch to all supported releases. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reported-by: rootcause000@gmail.com Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/18146-04e908c662113ad5%40postgresql.org Discussion: https://postgr.es/m/18426-2d18da6586f152d6%40postgresql.org

28328ec addressed one overflow danger in SampleHeapTupleVisible() but introduced another, albeit a less likely one. Modify the binary search code to remove this danger. Reported-by: Richard Guo Reviewed-by: Richard Guo, Ranier Vilela Discussion: https://postgr.es/m/CAMbWs4_bE%2BNscChbKWzw6HZOipCUyXfA5133qvoXQ654D3B2gQ%40mail.gmail.com

This used to say "nsubxcnt isn't decreased when subtransactions abort", but there's no variable called nsubxcnt. Commit 8548ddc changed it to "subxcnt", among other typo fixes, but that was wrong too: the comment actually talks about txn->nsubtxns. That's the field that's incremented but never decremented and is used for the allocation earlier in the function.

Like CurrentSnapshotData, it should not be accessed directly outside snapmgr.c.

Here we convert CompactAttribute.attalign from a char, which is directly derived from pg_attribute.attalign into a uint8, which stores the number of bytes to align the column's value by in the tuple. This allows tuple deformation and tuple size calculations to move away from using the inefficient att_align_nominal() macro, which manually checks each TYPALIGN_* char to translate that into the alignment bytes for the given type. Effectively, this commit changes those to TYPEALIGN calls, which are branchless and only perform some simple arithmetic with some bit-twiddling. The removed branches were often mispredicted by CPUs, especially so in real-world tables which often contain a mishmash of different types with different alignment requirements. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com

GetSnapshotData() set TransactionXmin = MyProc->xmin, but when SnapshotResetXmin() advanced MyProc->xmin, it did not advance TransactionXmin correspondingly. That meant that TransactionXmin could be older than MyProc->xmin, and XIDs between than TransactionXmin and the real MyProc->xmin could be vacuumed away. One known consequence is in pg_subtrans lookups: we might try to look up the status of an XID that was already truncated away. Back-patch to all supported versions. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/d27a046d-a1e4-47d1-a95c-fbabe41debb4@iki.fi

Library unloading has never been supported with its code removed in ab02d70, and there were some comments still mentioning that it was a possible operation. ChangAo has noticed the incorrect references in dfmgr.c, while I have noticed the other ones while scanning the rest of the tree for similar mistakes. Author: ChangAo Chen, Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/tencent_1D09840A1632D406A610C8C4E2491D74DB0A@qq.com

Jian He reported the src/include/utility/tcop.h one and the remainder were found by using a script to look for src/* and check that we have a filename or directory of that name. In passing, fix some out-date comments. Reported-by: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CACJufxGoE3H-7VgO02=PrR4SNuVWDVbfTyUnwO0HvS-Lxurnog@mail.gmail.com

This is necessary to be able to actually use the function on Windows.

This changes the libpq tracing code to correctly trace the NegotiateProtocolVersion message. Previously it wasn't important that tracing of the NegotiateProtocolVersion message worked correctly, because in practice libpq never received it. Now that we are planning to introduce protocol changes in future commits it starts to become more useful for testing/debugging.

Previously libpq would always error when the server returns a NegotiateProtocolVersion message. This was fine because libpq only supported a single protocol version and did not support any protocol parameters. But we now that we're discussing protocol changes we need to change this behaviour, and introduce a fallback mechanism when connecting to an older server. This patch modifies the client side checks to allow a range of supported protocol versions, instead of only allowing the exact version that was requested. Currently this "range" only contains the 3.0 version, but in a future commit we'll change this. In addition it changes the errors for protocol to say that they error because we did not request any parameters, not because the server did not support some of them. In a future commit more infrastructure for protocol parameters will be added, so that these checks can also succeed when receiving unsupported parameters back. Note that at the moment this change does not have any behavioural effect, because libpq will only request version 3.0 and will never send protocol parameters. Which means that the client never receives a NegotiateProtocolVersion message from the server.

All officially supported version of the PostgreSQL server send the NegotiateProtocolVersion message when an unsupported minor protocol version is requested by a client. But many other applications that implement the PostgreSQL protocol (connection poolers, or other databases) do not, and the same is true for many unspported PostgreSQL server versions. Connecting to such other applications thus fails if a client requests a protocol version different than 3.0. This patch adds a max_protocol_version connection option to libpq that specifies the protocol version that libpq should request from the server. Currently all allowed values result in the use of 3.0, but that will be changed in a future commit that bumps the protocol version. Even after that version bump the default will likely stay 3.0 for the time being. Once more of the ecosystem supports the NegotiateProtocolVersion message we might want to change the default to the latest minor version. We also add the similar min_protocol_version connection option, to allow a client to specify that connecting should fail if a lower protocol version is attempted by the server. This can be used to ensure certain protocol features are in used, which can be particularly useful if those features impact security.

To be able to test new protocol features added by future commits, we start requesting the latest protocol version in our libpq_pipeline test.

In preparation of new additions to the protocol in a follow up commit this bumps the minor protocol version number. Instead of bumping the version number to 3.1, which would be the next minor version, we skip that one and bump straight to 3.2. The reason for this is that many PgBouncer releases have, due to an off-by-one bug, reported 3.1 as supported[1]. These versions would interpret 3.1 as equivalent to 3.0. So if we would now add extra messages to the 3.1 protocol, clients would succeed to connect to PgBouncer, but would then cause connection failures when sending such new messages. So instead of bumping to 3.1, we bump the protocol version to 3.2, for which these buggy PgBouncer releases will correctly close the connection at the startup packet. It's a bit strange to skip a version number due to a bug in a third-party application, but PgBouncer is used widely enough that it seems worth it to not cause user confusion when connecting to recent versions of it. Especially since skipping a single minor version number in the protocol versioning doesn't really cost us anything. So, while this is not the most theoretically sound decission, it is the most pragmatic one. [1]: pgbouncer/pgbouncer#1007

Since the introduction of the NegotiateProtocolParameter message the server has been able to say to the client that it doesn't support any of the protocol parameters that the client requested. While this is important for backwards compatibility, it doesn't give much guidance for people wanting to add new protocol parameters. This commit intends to change that by adding a generic infrastructure that can be used to introduce new protocol parameters in a standardized way. There are two key features that are necessary to actually be able to use protocol parameters: 1. Negotiating the valid values of a protocol parameter (e.g. what compression methods are supported). This is needed because we want to support protocol parameters without adding an extra round-trip to the connection startup. So, a server needs to be able to accept the data in the StartupMessage, while also sharing with the client what it actually accepts in its response. 2. Changing a protocol parameter after connection startup. This is critical for connection poolers, otherwise they would need to separate connections with different values for the protocol parameters. To support these two features this commit adds three new protocol messages, including their code to handle these messages client and server side: 1. NegotiateProtocolParameter (BE): Sent during connection startup when the server supports the protocol parameter. This tells the client if the server accepted the value that the client provided for the parameter. It also tells the client what other values it accepts. 2. SetProtocolParameter (FE): Can be used to change protocol parameters after the connection startup. 3. SetProtocolParameterComplete (BE): Response to SetProtocolParameter which tells the client if the new value was accepted or not.

Connection poolers use the ParameterStatus message to track changes in session level parameters. Before assinging a server connection to a client, the pooler will first restore the server parameters that the client expects. This is done to emulate session level SET commands in transaction pooling mode. Previously this could only be done for a hard-coded set of backend parameters, but with this change a connection pooler can choose for which additional backend parameters it wants to receive status changes. It can do this by setting the newly added report_parameters protocol parameter to a list of parameter names.

JelteF force-pushed the protocol-level-gucs branch 10 times, most recently from 67d165e to 9f63c17 Compare January 15, 2024 18:48

JelteF force-pushed the protocol-level-gucs branch 2 times, most recently from c18f1bb to fecd994 Compare January 22, 2024 09:13

JelteF force-pushed the protocol-level-gucs branch from fecd994 to 9249bb9 Compare January 26, 2024 07:56

JelteF force-pushed the protocol-level-gucs branch from 9249bb9 to 17b500c Compare February 20, 2024 14:02

JelteF force-pushed the protocol-level-gucs branch 4 times, most recently from 09827df to cf406e4 Compare March 14, 2024 17:52

JelteF force-pushed the protocol-level-gucs branch from cf406e4 to 0d76476 Compare April 4, 2024 17:04

JelteF force-pushed the protocol-level-gucs branch from 0d76476 to daab02d Compare May 3, 2024 12:32

JelteF force-pushed the protocol-level-gucs branch 4 times, most recently from 83dbf59 to 1e848ec Compare May 18, 2024 10:07

JelteF force-pushed the protocol-level-gucs branch 6 times, most recently from dfbb1d1 to e37b306 Compare June 3, 2024 13:56

melanieplageman and others added 17 commits December 19, 2024 11:55

Remove final mention of FREEZE_PAGE from comments

8ac0021

b7493e1 removed leftover mentions of XLOG_HEAP2_FREEZE_PAGE records from comments but neglected to remove one mention of FREEZE_PAGE. Reported off-list by Alexander Lakhin

Mark CatalogSnapshotData static

1f81b48

Like CurrentSnapshotData, it should not be accessed directly outside snapmgr.c.

JelteF force-pushed the protocol-level-gucs branch 3 times, most recently from cc1b23a to 15613a7 Compare December 23, 2024 15:24

JelteF and others added 8 commits December 23, 2024 17:36

libpq: Add PQfullProtocolVersion to exports.txt

78668ee

This is necessary to be able to actually use the function on Windows.

Use latest protocol version in libpq_pipeline tests

f6441d2

To be able to test new protocol features added by future commits, we start requesting the latest protocol version in our libpq_pipeline test.

JelteF force-pushed the protocol-level-gucs branch from 15613a7 to 5374de7 Compare December 23, 2024 16:39

JelteF closed this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Protocol level gucs #5

Protocol level gucs #5

Uh oh!

JelteF commented Jan 8, 2024

Uh oh!

Uh oh!

Protocol level gucs #5

Protocol level gucs #5

Uh oh!

Conversation

JelteF commented Jan 8, 2024

Uh oh!

Uh oh!