Skip to content

Protocol level gucs #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1,402 commits into from
Closed

Protocol level gucs #5

wants to merge 1,402 commits into from

Conversation

JelteF
Copy link
Owner

@JelteF JelteF commented Jan 8, 2024

No description provided.

@JelteF JelteF force-pushed the protocol-level-gucs branch 10 times, most recently from 67d165e to 9f63c17 Compare January 15, 2024 18:48
@JelteF JelteF force-pushed the protocol-level-gucs branch 2 times, most recently from c18f1bb to fecd994 Compare January 22, 2024 09:13
@JelteF JelteF force-pushed the protocol-level-gucs branch from fecd994 to 9249bb9 Compare January 26, 2024 07:56
@JelteF JelteF force-pushed the protocol-level-gucs branch from 9249bb9 to 17b500c Compare February 20, 2024 14:02
@JelteF JelteF force-pushed the protocol-level-gucs branch 4 times, most recently from 09827df to cf406e4 Compare March 14, 2024 17:52
@JelteF JelteF force-pushed the protocol-level-gucs branch from cf406e4 to 0d76476 Compare April 4, 2024 17:04
@JelteF JelteF force-pushed the protocol-level-gucs branch from 0d76476 to daab02d Compare May 3, 2024 12:32
@JelteF JelteF force-pushed the protocol-level-gucs branch 4 times, most recently from 83dbf59 to 1e848ec Compare May 18, 2024 10:07
@JelteF JelteF force-pushed the protocol-level-gucs branch 6 times, most recently from dfbb1d1 to e37b306 Compare June 3, 2024 13:56
melanieplageman and others added 17 commits December 19, 2024 11:55
1a0da34 replaced Bitmap Table Scan's separate private and
shared bitmap iterators with a unified iterator. It accidentally set up
the prefetch iterator twice for non-parallel bitmap table scans. Remove
the extra set up call to tbm_begin_iterate().
The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.

Also, this patch adds some infrastructure for detecting whether
child plan nodes all output the same type of tuple table slot.
If they do, the hash table logic can use slightly more efficient
code based on assuming that that's the input slot type it will see.
We'll make use of that infrastructure in other plan node types later.

Patch by me; thanks to Richard Guo and David Rowley for review.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
Remove the code for inserting flag columns in the inputs of a SetOp.
That was the only reason why there would be resjunk columns in a
set-operations plan tree, so we can get rid of some code that
supported that, too.

Get rid of choose_hashed_setop() in favor of building Paths for
the hashed and sorted alternatives, and letting them fight it out
within add_path().

Remove set_operation_ordered_results_useful(), which was giving wrong
answers due to examining the wrong ancestor node: we need to examine
the immediate SetOperationStmt parent not the topmost node.  Instead
make each caller of recurse_set_operations() pass down the relevant
parent node.  (This thinko seems to have led only to wasted planning
cycles and possibly-inferior plans, not wrong query answers.  Perhaps
we should back-patch it, but I'm not doing so right now.)

Teach generate_nonunion_paths() to consider pre-sorted inputs for
sorted SetOps, rather than always generating a Sort node.

Patch by me; thanks to Richard Guo and David Rowley for review.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
Append, MergeAppend, and RecursiveUnion can all use the support
functions added in commit 2762792.  The first two can report a
fixed result slot type if all their children return the same fixed
slot type.  That does nothing for the append step itself, but might
allow optimizations in the parent plan node.  RecursiveUnion can
optimize tuple hash table operations in the same way as SetOp now
does.

Patch by me; thanks to Richard Guo and David Rowley for review.

Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
It was reasonable to preserve the old API of BuildTupleHashTable()
in the back branches, but in HEAD we should actively discourage use
of that version.  There are no remaining callers in core, so just
get rid of it.  Then rename BuildTupleHashTableExt() back to
BuildTupleHashTable().

While at it, fix up the miserably-poorly-maintained header comment
for BuildTupleHashTable[Ext].  It looks like more than one patch in
this area has had the opinion that updating comments is beneath them.

Discussion: https://postgr.es/m/538343.1734646986@sss.pgh.pa.us
b7493e1 removed leftover mentions of XLOG_HEAP2_FREEZE_PAGE records
from comments but neglected to remove one mention of FREEZE_PAGE.

Reported off-list by Alexander Lakhin
The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses.  Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses.

For some workloads, tuple deformation can be the most CPU intensive part
of processing the query.  Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column.  However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.

This also makes pg_attribute.attcacheoff redundant.  A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.

Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
One test added in 9aea73f did not take into account that the
backend may have some fsync even after a checkpoint.  Let's relax it to
be more flexible.

Per report from buildfarm member grassquit, via Alexander Lakhin.

Author: Bertrand Drouvot
Discussion: https://postgr.es/m/6143ab0a-9e88-4790-8d9d-50ba45657761@gmail.com
The column is no longer needed as the offset is now cached in the
CompactAttribute struct per commit 5983a4c.

Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
RelationTruncate() does three things, while holding an
AccessExclusiveLock and preventing checkpoints:

1. Logs the truncation.
2. Drops buffers, even if they're dirty.
3. Truncates some number of files.

Step 2 could previously be canceled if it had to wait for I/O, and step
3 could and still can fail in file APIs.  All orderings of these
operations have data corruption hazards if interrupted, so we can't give
up until the whole operation is done.  When dirty pages were discarded
but the corresponding blocks were left on disk due to ERROR, old page
versions could come back from disk, reviving deleted data (see
pgsql-bugs #18146 and several like it).  When primary and standby were
allowed to disagree on relation size, standbys could panic (see
pgsql-bugs #18426) or revive data unknown to visibility management on
the primary (theorized).

Changes:

 * WAL is now unconditionally flushed first
 * smgrtruncate() is now called in a critical section, preventing
   interrupts and causing PANIC on file API failure
 * smgrtruncate() has a new parameter for existing fork sizes,
   because it can't call smgrnblocks() itself inside a critical section

The changes apply to RelationTruncate(), smgr_redo() and
pg_truncate_visibility_map().  That last is also brought up to date with
other evolutions of the truncation protocol.

The VACUUM FileTruncate() failure mode had been discussed in older
reports than the ones referenced below, with independent analysis from
many people, but earlier theories on how to fix it were too complicated
to back-patch.  The more recently invented cancellation bug was
diagnosed by Alexander Lakhin.  Other corruption scenarios were spotted
by me while iterating on this patch and earlier commit 75818b3.

Back-patch to all supported releases.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reported-by: rootcause000@gmail.com
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/18146-04e908c662113ad5%40postgresql.org
Discussion: https://postgr.es/m/18426-2d18da6586f152d6%40postgresql.org
28328ec addressed one overflow danger in
SampleHeapTupleVisible() but introduced another, albeit a less likely
one. Modify the binary search code to remove this danger.

Reported-by: Richard Guo
Reviewed-by: Richard Guo, Ranier Vilela
Discussion: https://postgr.es/m/CAMbWs4_bE%2BNscChbKWzw6HZOipCUyXfA5133qvoXQ654D3B2gQ%40mail.gmail.com
This used to say "nsubxcnt isn't decreased when subtransactions
abort", but there's no variable called nsubxcnt. Commit 8548ddc
changed it to "subxcnt", among other typo fixes, but that was wrong
too: the comment actually talks about txn->nsubtxns. That's the field
that's incremented but never decremented and is used for the
allocation earlier in the function.
Like CurrentSnapshotData, it should not be accessed directly outside
snapmgr.c.
Here we convert CompactAttribute.attalign from a char, which is directly
derived from pg_attribute.attalign into a uint8, which stores the number
of bytes to align the column's value by in the tuple.

This allows tuple deformation and tuple size calculations to move away
from using the inefficient att_align_nominal() macro, which manually
checks each TYPALIGN_* char to translate that into the alignment bytes
for the given type.  Effectively, this commit changes those to TYPEALIGN
calls, which are branchless and only perform some simple arithmetic with
some bit-twiddling.

The removed branches were often mispredicted by CPUs, especially so in
real-world tables which often contain a mishmash of different types
with different alignment requirements.

Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
GetSnapshotData() set TransactionXmin = MyProc->xmin, but when
SnapshotResetXmin() advanced MyProc->xmin, it did not advance
TransactionXmin correspondingly. That meant that TransactionXmin could
be older than MyProc->xmin, and XIDs between than TransactionXmin and
the real MyProc->xmin could be vacuumed away. One known consequence is
in pg_subtrans lookups: we might try to look up the status of an XID
that was already truncated away.

Back-patch to all supported versions.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/d27a046d-a1e4-47d1-a95c-fbabe41debb4@iki.fi
Library unloading has never been supported with its code removed in
ab02d70, and there were some comments still mentioning that it was
a possible operation.

ChangAo has noticed the incorrect references in dfmgr.c, while I have
noticed the other ones while scanning the rest of the tree for similar
mistakes.

Author: ChangAo Chen, Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/tencent_1D09840A1632D406A610C8C4E2491D74DB0A@qq.com
Jian He reported the src/include/utility/tcop.h one and the remainder
were found by using a script to look for src/* and check that we have a
filename or directory of that name.

In passing, fix some out-date comments.

Reported-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CACJufxGoE3H-7VgO02=PrR4SNuVWDVbfTyUnwO0HvS-Lxurnog@mail.gmail.com
@JelteF JelteF force-pushed the protocol-level-gucs branch 3 times, most recently from cc1b23a to 15613a7 Compare December 23, 2024 15:24
JelteF and others added 8 commits December 23, 2024 17:36
This is necessary to be able to actually use the function on Windows.
This changes the libpq tracing code to correctly trace the
NegotiateProtocolVersion message. Previously it wasn't important that
tracing of the NegotiateProtocolVersion message worked correctly,
because in practice libpq never received it. Now that we are planning to
introduce protocol changes in future commits it starts to become more
useful for testing/debugging.
Previously libpq would always error when the server returns a
NegotiateProtocolVersion message. This was fine because libpq only
supported a single protocol version and did not support any protocol
parameters. But we now that we're discussing protocol changes we need to
change this behaviour, and introduce a fallback mechanism when
connecting to an older server.

This patch modifies the client side checks to allow a range of supported
protocol versions, instead of only allowing the exact version that was
requested. Currently this "range" only contains the 3.0 version, but in
a future commit we'll change this. In addition it changes the errors for
protocol to say that they error because we did not request any
parameters, not because the server did not support some of them. In a
future commit more infrastructure for protocol parameters will be added,
so that these checks can also succeed when receiving unsupported
parameters back.

Note that at the moment this change does not have any behavioural
effect, because libpq will only request version 3.0 and will never send
protocol parameters. Which means that the client never receives a
NegotiateProtocolVersion message from the server.
All officially supported version of the PostgreSQL server send the
NegotiateProtocolVersion message when an unsupported minor protocol
version is requested by a client. But many other applications that
implement the PostgreSQL protocol (connection poolers, or other
databases) do not, and the same is true for many unspported PostgreSQL
server versions. Connecting to such other applications thus fails if a
client requests a protocol version different than 3.0.

This patch adds a max_protocol_version connection option to libpq that
specifies the protocol version that libpq should request from the
server. Currently all allowed values result in the use of 3.0, but that
will be changed in a future commit that bumps the protocol version. Even
after that version bump the default will likely stay 3.0 for the time
being. Once more of the ecosystem supports the NegotiateProtocolVersion
message we might want to change the default to the latest minor version.

We also add the similar min_protocol_version connection option, to allow
a client to specify that connecting should fail if a lower protocol
version is attempted by the server. This can be used to ensure certain
protocol features are in used, which can be particularly useful if those
features impact security.
To be able to test new protocol features added by future commits, we
start requesting the latest protocol version in our libpq_pipeline test.
In preparation of new additions to the protocol in a follow up commit
this bumps the minor protocol version number. Instead of bumping the
version number to 3.1, which would be the next minor version, we skip
that one and bump straight to 3.2. The reason for this is that many
PgBouncer releases have, due to an off-by-one bug, reported 3.1 as
supported[1]. These versions would interpret 3.1 as equivalent to 3.0. So
if we would now add extra messages to the 3.1 protocol, clients would
succeed to connect to PgBouncer, but would then cause connection
failures when sending such new messages.

So instead of bumping to 3.1, we bump the protocol version to 3.2, for
which these buggy PgBouncer releases will correctly close the connection
at the startup packet. It's a bit strange to skip a version number due
to a bug in a third-party application, but PgBouncer is used widely
enough that it seems worth it to not cause user confusion when
connecting to recent versions of it. Especially since skipping a single
minor version number in the protocol versioning doesn't really cost us
anything. So, while this is not the most theoretically sound decission,
it is the most pragmatic one.

[1]: pgbouncer/pgbouncer#1007
Since the introduction of the NegotiateProtocolParameter message the
server has been able to say to the client that it doesn't support any of
the protocol parameters that the client requested.

While this is important for backwards compatibility, it doesn't give
much guidance for people wanting to add new protocol parameters. This
commit intends to change that by adding a generic infrastructure that
can be used to introduce new protocol parameters in a standardized way.

There are two key features that are necessary to actually be able to use
protocol parameters:

1. Negotiating the valid values of a protocol parameter (e.g. what
   compression methods are supported). This is needed because we want to
   support protocol parameters without adding an extra round-trip to the
   connection startup. So, a server needs to be able to accept the data
   in the StartupMessage, while also sharing with the client what it
   actually accepts in its response.
2. Changing a protocol parameter after connection startup. This is
   critical for connection poolers, otherwise they would need to
   separate connections with different values for the protocol
   parameters.

To support these two features this commit adds three new protocol
messages, including their code to handle these messages client and
server side:

1. NegotiateProtocolParameter (BE): Sent during connection startup when the
   server supports the protocol parameter. This tells the client if the
   server accepted the value that the client provided for the parameter.
   It also tells the client what other values it accepts.
2. SetProtocolParameter (FE): Can be used to change protocol parameters after
   the connection startup.
3. SetProtocolParameterComplete (BE): Response to SetProtocolParameter
   which tells the client if the new value was accepted or not.
Connection poolers use the ParameterStatus message to track changes in
session level parameters. Before assinging a server connection to a
client, the pooler will first restore the server parameters that the
client expects. This is done to emulate session level SET commands in
transaction pooling mode.

Previously this could only be done for a hard-coded set of backend
parameters, but with this change a connection pooler can choose for
which additional backend parameters it wants to receive status changes.
It can do this by setting the newly added report_parameters protocol
parameter to a list of parameter names.
@JelteF JelteF force-pushed the protocol-level-gucs branch from 15613a7 to 5374de7 Compare December 23, 2024 16:39
@JelteF JelteF closed this Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.