Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database commands coalescing #7019

Closed
sebastienros opened this issue Jan 10, 2022 · 75 comments
Closed

Database commands coalescing #7019

sebastienros opened this issue Jan 10, 2022 · 75 comments

Comments

@sebastienros
Copy link
Contributor

The Npgsql (PostgresQL driver for .NET) contributors have been looking into other drivers implementation to understand where improvements can be done.

Here is a current picture for the multi-queries scenario:

image

There are several frameworks around 30-34K (not all displayed here) and then a subsequent jump to ~60K.

After analyzing the network traffic between the Drogon framework and the Postgres server, which is represented in the following screenshot and validated by other specific payloads, it appears that queries issued from multiple http requests are coalesced in the same network packets and executed with a single "SYNC" command which will tell Postgres to execute them "as a batch".

image

Bundling together multiple commands in a single Sync batch has important consequences: if any command errors (e.g. unique constraint violation), all later commands are skipped until the Sync. Even more importantly, all commands are implicitly part of the same transaction.

Here we have not only implicit batching for the same request (20 queries), but more importantly batching across requests.

Our interpretation of the TechEmpower rule below leads us to believe that this method violates those rules

Except where noted, all database queries should be delivered to the database servers as-is and not coalesced or deduplicated at the database driver. In all cases where a database query is required, it is expected that the query will reach and execute on the database server. However, it is permissible to pipeline traffic between the application and database as a network-level optimization. That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query.

We haven't looked if the other frameworks showing the same high performance are using the same technique.

/cc @an-tao @roji

@roji
Copy link
Contributor

roji commented Jan 10, 2022

Note: most of the investigation work above was done by @NinoFloris and @vonzshik from the Npgsql project.

@an-tao
Copy link
Contributor

an-tao commented Jan 11, 2022

https://2ndquadrant.github.io/postgres/libpq-batch-mode.html

here I used the batch mode of postgresql driver, this work has been done two years ago, and the coalescing happens only when:

  1. all sqls are read-only;
  2. all sqls in pipeline are in the same call stack of the current event-loop;
    so if any wrriten sql is executed, the sync comand is sent immediately, you could see the network traffic in the update test to comfirm it. The rules say it's not permited to combine mutiple sqls into a single complex query, I don't think drogon broke it because every reading sql is executed separately (they are in the same batch and sent together , it's a network-level optimization)

@marty1885
Copy link

marty1885 commented Jan 11, 2022

Hi, I'm another maintainer of Drogon.

I've chatted with @an-tao privately. He is not as fluent in English. Poke me if you need help explaining this reply. Please allow me to explain my view and response of the rules in detail. My views may differ with his, so I don't represent his position. I think this in a reasonable framework-level optimization. Hopefully helping both parties understand the differences.

all database queries should be delivered to the database servers as-is and not coalesced or deduplicated at the database driver.

In drogon with PQ-batch support. We control when the sync happens. It's a framework level optimization. To be specific, a sync happens when a SQL query that modifies data occurs, enough commands are batched or when the web application wasn't busy adding more queries to the batch. Therefor a batch only consists of read queries or ends with a write query. Any write fails (transaction conflict, uniqueness failure, etc..) will not cause reads to fail. And no subsequent commands exists to be skipped.

In all cases where a database query is required, it is expected that the query will reach and execute on the database server.

Besides above. The SQL banchmark only reads from the DB. Thus it's impossible to have a transaction error causing queries to fail. The test itself guarantee that. I think we are safe in this regard too.

Furthermore, I also think this is a reasonable optimization in practice. Since a batch always ends with a write. We guarantee all reads are executed (Note that SELECT .. FOR UPDATE is considered writing). The only edge case is when the web app runs an intentionally bad SELECT query. Which could get placed in the middle of a batch. But I consider it an application bug and I won't blame the framework for it.

@Brar
Copy link

Brar commented Jan 11, 2022

@marty1885
Just for clarification. By "intentionally bad SELECT query" you mean any SELECT query that fails and causes the whole transaction to rollback instead of only the failing query - no?

@marty1885
Copy link

marty1885 commented Jan 11, 2022

No, I mean by some sequence of non-transaction queries.

co_await db->execSqlCoro("SELECT xxx FROM my_table WHERE ....");
co_await db->execSqlCoro("SELECT xxx FROM non_exist_table WHERE ....");;
co_await db->execSqlCoro("SELECT xxx FROM my_table WHERE ...."); // This gets dropped.

I think all queries in transaction are considered writing. Thus always not batched. Is it? If not, I think we need to add a special case. In any case, it doesn't invalidate this optimization. @an-tao

Also I think the performance benefits comes from sending all the queries at the same time. Only sending sync at the end is quark of pq_batch. We could spend some time look into the updated libpq and resolve the entire issue all together.

@vonzshik
Copy link

We guarantee all reads are executed (Note that SELECT .. FOR UPDATE is considered writing).

@marty1885 I assume you're talking about query parsing, which doesn't check for function and procedure calls (which can do inserts/updates). There is also a problem of PG adding a new keyword (however improbable it is), which might have side effects.

But I consider it an application bug and I won't blame the framework for it.

Speaking from my experience, I'm working on an app, which allows to generate queries via SQL with a bit of a meta language. As an example:

SELECT *
FROM "SomeTable"
WHERE 1 = 1
#param(ID, "ID")
#param(Value, "Value")

A user via UI can specify whether he wants to filter by ID or by Value (or both), and the corresponding SQL is generated with the required parameters. With some queries taking a hundreds of lines, it's not that rare for some combinations of parameters to generate an invalid SQL. While it's definitely not great, there is nothing requiring immediate fixing, since only the specific combination is broken. But if we were using drogon, that would start breaking unrelated queries, making that issue critical and requiring immediate fixing.

Just wondering, is batching (and possible issues from it) mentioned in drogon's documentation?

@Brar
Copy link

Brar commented Jan 11, 2022

Sorry, by transaction I meant everything that the driver puts before a sync packet.

The issue I see is something like the following:

co_await db->execSqlCoro("SET statement_timeout TO 10000");
co_await db->execSqlCoro("SELECT xxx FROM my_table WHERE ....");

// Simulates a query that might happen to take slightly longer than statement_timeout under some load conditions
co_await db->execSqlCoro("SELECT pg_sleep(15)");

co_await db->execSqlCoro("SELECT xxx FROM my_table WHERE ...."); // This gets dropped.

@roji
Copy link
Contributor

roji commented Jan 11, 2022

We guarantee all reads are executed (Note that SELECT .. FOR UPDATE is considered writing).

@marty1885 I assume you're talking about query parsing, which doesn't check for function and procedure calls (which can do inserts/updates). There is also a problem of PG adding a new keyword (however improbable it is), which might have side effects.

I want to make sure this crucial point by @vonzshik is clear... As I understand it, if a command is executed which happens to SELECT from a function, and around the same time another unrelated command is executed which produces an error (e.g. unique constraint violation), then any side-effects from the function may get rolled back, without any indication to the program that this happened. That is, the user code executing the function will continue as if the function completed successfully and its side effects were committed, when in fact they were silently rolled back.

There may be other such scenarios where "implicit" side-effects aren't detected. So I think the idea that a database driver can reliably parse SQL to know whether it's read-only is quite problematic.

@marty1885, you wrote:

Also I think the performance benefits comes from sending all the queries at the same time.

In our experience, sending the Sync at the end has quite a significant impact on performance (FWIW Npgsql does send all commands at the same time like Drogon, but with a Sync after each command instead of at the end). This is likely related with the PostgreSQL MVCC overhead of creating and committing many small transactions (Npgsql) vs. a single transaction which encompasses all commands (Drogon).

Only sending sync at the end is quark of pq_batch. We could spend some time look into the updated libpq and resolve the entire issue all together.

Importantly, Sync at the end is indeed the desirable behavior when application code wants to send a set of related commands, as a batch: either all commands complete or none do (I believe this was the intent of the PG pipelining mode). The crucial difference is that the application explicitly opts into this behavior - via some API gesture - whereas Drogon is applying this implicitly on unrelated commands, under the hood.

@marty1885
Copy link

marty1885 commented Jan 11, 2022

@vonzshik

I assume you're talking about query parsing, which doesn't check for function and procedure calls (which can do inserts/updates). There is also a problem of PG adding a new keyword (however improbable it is), which might have side effects.

Yeah. We noticed that earlier too. We will patch that.

Just wondering, is batching (and possible issues from it) mentioned in drogon's documentation?

Drogon's batching support requires a patched version of libpq (we haven't implement support after the feature being upstreamed). So we expect the users to know what they get getting themselves into at this point.

@Brar

Yeah.. that's an issue. But in a different way. Thanks for finding it.

co_await db->execSqlCoro("SELECT pg_sleep(15)"); suspends current coroutine from execution. No further code in the same coroutine gets executed. Since that statement always error out, it causes co_await <awaiter> to throw exception. And other SELECT statements in the same batch would also error out and throw. (My bad for the code I shared above. It's a quick illustration)

// coroutine 1
{
co_await db->execSqlCoro("SELECT pg_sleep(15)"); // This fails. Throw
co_await db->execSqlCoro("SELECT xxx FROM my_table WHERE ...."); // CPU never runs this because of throw
}

// coroutine 2
{
// Throws because the statement gets bundled and gets skipped.
co_await db->execSqlCoro("SELECT xxx FROM my_table WHERE ....");
}

@roji

In our experience, sending the Sync at the end has quite a significant impact on performance

Thanks for pointing it out. I'm discussing with @an-tao to determine that's the best action to fix this. Likely we'll add an option to opt-in to batch mode if we detect batch support.


Back to the original issue. I still think batching should be allowed in benchmark as it produces known behavior and does not break program flow when error happens. And it does not violate the rules under the benchmark.

@roji
Copy link
Contributor

roji commented Jan 11, 2022

I assume you're talking about query parsing, which doesn't check for function and procedure calls (which can do inserts/updates). There is also a problem of PG adding a new keyword (however improbable it is), which might have side effects.

Yeah. We noticed that earlier too. We will patch that.

How do you propose to patch this? If the idea is to disable batching, that would mean the driver parsing to detect arbitrary function calls within arbitrary user SQL - that once again sounds quite problematic and brittle?

Back to the original issue. I still think batching should be allowed in benchmark as it produces known behavior and does not break program flow when error happens. And it does not violate the rules under the benchmark.

Stepping back here, all the various problems signaled above a result of Drogon applying implicit batching under the hood; this technique changes actual program behavior in a significant way - and does break program flow in ways that are very hard to predict and diagnose (as everything depends on timing). It's hard to see how this could be treated as a transparent network optimization.

Compare this with sending all the commands together, but with Syncs between each command; this indeed guarantees at the protocol level that each command runs independently of each other, and errors are guaranteed to not affect unrelated commands; IMHO this is what is meant by a network-level optimization.

@marty1885
Copy link

marty1885 commented Jan 11, 2022

How do you propose to patch this? If the idea is to disable batching, that would mean the driver parsing to detect arbitrary function calls within arbitrary user SQL - that once again sounds quite problematic and brittle?

We'll add additional checks to prevent common cases. I think it's quite rate for this to happen since we haven't get user complaining it yet. Then figure out how to solve this.

Drogon applying implicit batching under the hood; this technique changes actual program behavior in a significant way - and does break program flow in ways that are very hard to predict and diagnose ... and errors are guaranteed to not affect unrelated commands; IMHO this is what is meant by a network-level optimization.

I think this is acceptable. At least for me, as a C++ developer. We are used to dealing with data races and feels fine as long as timing doesn't crash the application or causes unrecoverable state change. It's perfectly ok for lock-free C++ algorithms to return error when stuff doesn't work this try. This characteristic gives me the same vibe. The maintainers are also treating SQL this way.

So in conclusion, is the following allowed?

  • Drogon must send sync for every command
  • Add an batch flag to batch commands (and don't sync every command) that we know does not write

@roji
Copy link
Contributor

roji commented Jan 11, 2022

Drogon applying implicit batching under the hood; this technique changes actual program behavior in a significant way - and does break program flow in ways that are very hard to predict and diagnose ... and errors are guaranteed to not affect unrelated commands; IMHO this is what is meant by a network-level optimization.

I think this is acceptable. At least for me, as a C++ developer. We are used to dealing with data races and feels fine as long as timing doesn't crash the application or causes unrecoverable state change. It's perfectly ok for lock-free C++ algorithms to return error when stuff doesn't work this try. This characteristic gives me the same vibe. The maintainers are also treating SQL this way.

I agree that if an error were returned and the program could try again, this would maybe be acceptable. But here we're talking about database changes being silently rolled back - without an error - because some other unrelated command happened to error. This is quite different from attempting a lock-free operation and retrying it; it's more like your lock-free operation saying it succeeded, when in fact it didn't. I have no idea how I'd write reliable code in this situation, and so IMHO this does qualify as causing "unrecoverable state change". Regardless of how rare such a scenario is or what kind of SQL it requires from users, I don't believe a database driver should do something like this.

I do agree that it would be helpful for the benchmark rules to be more explicit on these nuances - IMHO it's important for the benchmarks not to encourage unsafe/dangerous driver behavior (as least according to my viewpoint).

@marty1885
Copy link

marty1885 commented Jan 11, 2022

database changes being silently rolled back - without an error

I to think about this. I think the current behavior is that driver notices something failed. Then framework understands some SQL commands gets dropped. Then the entire list of yet-to-be executed SQL calls throws error. Which is then recoverable.

Please help us determine what's the acceptable behavior for the benchmark. We will update our benchmark and framework.

@roji
Copy link
Contributor

roji commented Jan 11, 2022

database changes being silently rolled back - without an error

I to think about this. I think the current behavior is that driver notices something failed. Then framework understands some SQL commands gets dropped. Then the entire list of yet-to-be executed SQL calls throws error. Which is then recoverable.

@marty1885 I'm not referring to yet-to-be-executed SQL statements - I do understand these are skipped and the user gets an error. The problem is with SQL statements which were already previously executed, already completed, and are now rolled back because of the error.

@marty1885
Copy link

Ahhh... I didn't realize that

@an-tao
Copy link
Contributor

an-tao commented Jan 11, 2022

Thanks all who pointed out the potential risks of this implementation, now I think this is not safe enough as you said. I wanted to avoid these risks through some keyword matching, but now it seems that my original consideration is not comprehensive. Some read-only sql can break current transaction (as the timeout example mentioned by @Brar ), and some write operations can occur when selecting something from a function or procedure call (as @vonzshik said), so for a general application, if we don't send a sync command after every query, it's hard to avoid unexpected failures.

But on the other hand, I don't want to deprive some users of the opportunity to use this mode if they know their SQLs are safe with this. So my fix will be a optional parameter (disabled by default) when creating database clients, considering it is supported by drogon to create multiple database clients (with their own connections) with different configrations, users can send 'safe' sqls with the delay synchronous database client and send others with the normal one (one command one sync).

I think this mode can be used in many scenarios, for example, all the test cases of TFB benchmark can be run in this mode without any problem. Since the C driver provides a separate sync command interface, why don't we use it? so I think after fixing this, It's not a problem but a feature for those who want extreme performance.

IMO, I still don't think this breaks the rule, which forbids merging requests into one complex sql, but doesn't forbid pipelining. The above is just my personal opinion, if the organizers of TFB think this is illegal, I will change back to one-command-one-sync mode (also batch mode) or disable the feature if I added the option mentioned above.

@Brar
Copy link

Brar commented Jan 11, 2022

IMO, I still don't think this breaks the rule, which forbids merging requests into one complex sql, but doesn't forbid pipelining.

In my opinion the rules should not encourage dangerous/tricky driver implementations so no matter how we interpret the rule right now, I'd argue that it should be clarified in a way that requires a sync between what is meant to be unrelated commands in PostgreSQL.

If someone really thinks that a dangerous high-performance mode makes sense, this should be moved to a separate benchmark that isn't meant to reflect common general-purpose usage scenarios.

@roji
Copy link
Contributor

roji commented Jan 11, 2022

@an-tao @marty1885 thanks for keeping an open mind in the discussion above!

I agree with @Brar about the undesirability of doing this in the TechEmpower benchmarks. While you can (obviously) decide to offer this unsafe mode in Drogon, I personally do not believe this capability makes sense in a database driver (even opt-in), and prefer to not implement it in Npgsql. However, that would mean that drivers which engage in inherently unsafe practices will always perform better, in effect pushing other drivers (like mine) to support them as well. IMHO this isn't the point of the TechEmpower benchmarks - I'd rather we compared the performance of safe, standard practices which real users would want to use in the real world.

In any case, I agree that at this point the TechEmpower organizers need to make a decision on what they want to allow, and how exactly to interpret the rule 7 of the general requirements, and whether the above constitutes an allowed "network-level optimization".

@NateBrady23
Copy link
Member

Let me ask this, since everyone here has a much better understanding of what's actually happening here. If you were to rewrite rule 7 to explicitly allow or deny the optimization that's happening here, what would that look like? I'm hoping that extra sentence or two will be succinct enough to help me understand.

@roji
Copy link
Contributor

roji commented Jan 11, 2022

I'd propose something along the lines of:

If any pipelining or network optimization is performed, then the execution behavior and semantics must match what they would be if the multiple SQL statements were executed without the optimization, i.e. in separate roundtrips. For example, it is forbidden to perform any optimization could change the transactionality and/or error handling of the SQL statements. To avoid any doubt, such optimizations are forbidden even if they wouldn't produce the behavioral change with the specific SQL statements of the TechEmpower benchmarks, but would produce it with other statements a user might supply.

I don't want to introduce too much complexity to this discussion, but note that there's also the question of whether to allow explicit batching, which would be triggered via an API gesture in the benchmark code. This definitely wouldn't be a transparent network optimization (since it's explicit), but from a safety point of view it's entirely ok: the user code controls exactly which statements are included in the batch, and typically expects precisely the transactionality/error-handling behavior described above (skip all subsequent commands on any error in the batch and roll back). The crucial point with implicit batching is that it's done without the user specifying it, and on unrelated commands which can now affect each other.

I'm bringing up this point since rule 7 isn't clear on whether an explicit batching API is OK or not. In my personal opinion, it's fine to allow it since it's safe, and is how a developer would do a real-world implementation of e.g. the multiple query scenario. The sentences I wrote above seem like they disallow all forms of batching which would affect behavior, regardless of whether explicit or implicit; if we wanted to allow explicit batching, that would need a bit more tweaking.

@NateBrady23
Copy link
Member

One thing we do say in the Database Updates Test Rules is:

ix. Using bulk updates—batches of update statements or individual update statements that affect multiple rows—is acceptable but not required. To be clear: bulk reads are not permissible for selecting/reading the rows, but bulk updates are acceptable for writing the updates.

I think the line about bulk reads is something that's accidentally omitted from other test rules.

@an-tao
Copy link
Contributor

an-tao commented Jan 12, 2022

def verify_queries_count(self, tbl_name, url, concurrency=512, count=2, expected_queries=1024, expected_rows=1024,
check_updates=False):
'''
Checks that the number of executed queries, at the given concurrency level,
corresponds to: the total number of http requests made * the number of queries per request.
No margin is accepted on the number of queries, which seems reliable.
On the number of rows read or updated, the margin related to the database applies (1% by default see cls.margin)
On updates, if the use of bulk updates is detected (number of requests close to that expected), a margin
(5% see bulk_margin) is allowed on the number of updated rows.

I think the verify_queries_count function checks if there is a bulk read that breaks the current rules. if we decide to forbid the behavior discussed here, a new function verify_transactions_count should be added.
Maybe add following sentens to clearify the rules:
Each individual sql request must be automatically committed in a separate transaction,explicitly or implicitly merging multiple SQLs into a transaction is prohibited,disabling transactions is also not allowed.

@Brar
Copy link

Brar commented Jan 12, 2022

explicitly or implicitly merging multiple SQLs into a transaction is prohibited

I think that this touches the point that @roji was discussing above about explicit and implicit batching in the driver.

What we discussed above are the problems that arise when a driver implicitly merges SQL commands that the consumer of the API considers unrelated into one transaction so that they may be skipped or rolled back due to a failure of an unrelated command.

Allowing the consumer of a driver API to explicitly batch related SQL commands that they know are safe to run within a transaction is something different IMHO.

To illustrate with some pseudo API:

// Implicit batching
// These commands should not run in the same transaction because the API consumer considers them unrelated
db.Execute("SELECT * FROM test WHERE id = 1");
db.Execute("SELECT * FROM test WHERE id = 2");

// Or even worse implicit batching
// If these commands end up in the same transaction, how is the API consumer
// supposed to know that the failure of one command will affect the other one
Method_A()
{
    db.Execute("SELECT * FROM test WHERE id = 1");
}

Method_B()
{
    db.Execute("SELECT * FROM test WHERE id = 2");
}

// Explicit batching
// Here the API consumers have the option to explicitly group commands that they consider related
var batch = db.StartBatch()
batch.Add("SELECT * FROM test WHERE id = 1");
batch.Add("SELECT * FROM test WHERE id = 2");
batch.Execute();

That being said I think that the commands in the framework benchmark should be considered unrelated in order to better reflect a real life scenario and for that reason I agree with @an-tao that both explicit and implicit batching should be forbidden and the requirement that "Each individual sql request must be automatically committed in a separate transaction" makes sense to me.

A benchmark allowing explicit batching would be a special case that comes pretty close to what I called a "dangerous high-performance mode" in my comment above - except that it isn't as dangerous because the API consumer controls the dangers.
In any case I still think that such a high-perf-mode doesn't reflect general usage scenarios and should live in a separate benchmark if it is considered worthwhile.

@roji
Copy link
Contributor

roji commented Jan 12, 2022

Agree with @an-tao and @Brar that "Each individual sql request must be automatically committed in a separate transaction" is a good formulation, and pretty clearly disallow batching (explicit or implicit).

A benchmark allowing explicit batching would be a special case that comes pretty close to what I called a "dangerous high-performance mode" in my comment above - except that it isn't as dangerous because the API consumer controls the dangers. In any case I still think that such a high-perf-mode doesn't reflect general usage scenarios and should live in a separate benchmark if it is considered worthwhile.

Here I slightly disagree: I personally consider (explicit) batching to be a pretty basic optimization that anyone should be using where possible, rather than a high-perf mode carrying any dangers. The special transactionality and error handling semantics are usually a desirable feature being opted into - programs typically don't want later commands to continue if an earlier command failed. As I wrote above, I'd expect any developer implementing something like "multiple queries" to use it; for this reason, it may make sense to allow it in the benchmarks (or at least to be very clear on why we want to disallow it).

But whether explicit batching should be allows is ultimately a question of what the benchmarks are actually meant to measure.

@Brar
Copy link

Brar commented Jan 12, 2022

Here I slightly disagree: I personally consider (explicit) batching to be a pretty basic optimization that anyone should be using where possible

I completely agree that batching is a basic optimization that definitely has its place and is even really common but let's talk about the Multiple Database Queries test and what kind of real live scenario it is meant to stand for and whether this is really "a network-level optimization".

You can probably batch queries in Multiple Database Queries test but how would that be compatible with requirement vi. that requires that "This test is designed to exercise multiple queries with each resulting row selected individually". and requirement v. that requires that "Each row must be selected randomly in the same fashion as the single database query test"
"same fashion" isn't a very precise term but I understand it that way that you should execute the queries pretty much the same way as in the Single Database Query test and not mess with transaction semantics because - in the context of the benchmark - you happen to know that all queries are read-only and will never fail.
While this would indeed be a pretty basic optimization in the context of the benchmark I still have a hard time calling it "a network-level optimization".
It is definitely a PostgreSQL transaction handling optimization (you can call it wire protocol optimization but that only distracts from the fact that the real optimization runs inside PostgreSQL) where the speed gain doesn't come from the saved sync messages that don't get transmitted over the network but from a completely different code path this aggregated set of queries takes inside the PostgreSQL implementation.
It is definitely possible to find real-live scenario where that PostgreSQL transaction handling optimization safely applies and that is the exact reason why it is and should be implemented in drivers supporting PostgreSQL but I stand with my opinion that this is not what this benchmark is meant to test and that this optimization should be tested in a different scope if it is considered relevant, since it basically tests the driver's capabilities to get the most out of the inner workings of PostgreSQL and not how well it manages to concatenate multiple queries into a network buffer.

@ajcvickers
Copy link

@nbrady-techempower I think the fundamental point here is that any network-level optimization must not change the observed behavior of the application, and that includes when errors happen. The optimizations discussed here have the same behavior when everything on the network and in the database succeeds, but change the behavior in unexpected and damaging ways when something goes wrong. Like @roji said above, I don't believe a driver should behavior in this way.

I don't think it is necessary in the rules to go into specifics about the kinds of behavior that are allowed/not allowed in terms of the batching and sync commands. (Although discussions like these will help clarify these things, of course.) Instead the rules simply need to state:

Network-level optimizations are permitted so long as the observable behavior of the driver does not change in any way, either under normal operating conditions, or when network or database errors occur.

Whether or not batching is allowed is really a separate discussion.

@sebastienros
Copy link
Contributor Author

Adding @matt-42 (Lithium) @redkale (Redkale) @billywhizz (Just-JS)

Could you look at the recommend changes and confirm if it would mean your implementations need to be adapted? I noticed that your results on multiple-queries are even faster that Drogon as of the latest continuous run.

@msmith-techempower
Copy link
Member

msmith-techempower commented Feb 7, 2022

@ajcvickers

Network-level optimizations are permitted so long as the observable behavior of the driver does not change in any way, either under normal operating conditions, or when network or database errors occur.

This sounds like something we could test for in the verification step. Would there be some reproducible error we could cause to ensure that [something - what?] does/doesn't occur?

@roji
Copy link
Contributor

roji commented Feb 8, 2022

@msmith-techempower I'm not sure what kind of verifications you guys perform... In order to observe this, one would have to send one command with side-effects (e.g. insert some row), batched before another (unrelated) command which fails. The expected behavior is for the side-effects from the first command to persist, but the implicit batching described above those side-effects silently disappear. This sounds like a special scenario that would have to be coded against each driver, so I'm not sure it's relevant for verification...

Otherwise, it's probably possible to run PostgreSQL in detailed logging/trace mode, and verify protocol messages being received. Short of that, a packet capture and a verification at that level should always be possible, in case you do that kind of thing.

Maybe @sebastienros has other ideas...

@matt-42
Copy link
Contributor

matt-42 commented Feb 8, 2022

Adding @matt-42 (Lithium) @redkale (Redkale) @billywhizz (Just-JS)

Could you look at the recommend changes and confirm if it would mean your implementations need to be adapted? I noticed that your results on multiple-queries are even faster that Drogon as of the latest continuous run.

Lithium is using the upstream libpq batch mode yes. To me it is fine as long as frameworks using batching are tagged in TFB, that's why I added the -batch suffix to the lithium pgsql implementations in the benchmark. I could also add a lithium-postgres-nobatch if you want.
If the TFB team forbid batching I'll remove them from the lithium impl, but I don't consider batching an unsafe optimization for read-only scenarios like the multi/single-request tests.

However if the problem is side-effects (one error in the batch canceling all the next requests of the batch), you can ask drivers to implement a failback mode where the remaining requests of an aborted batch are re-run automatically. This would remove the side-effect while keeping the same performance in case of no-errors batchs.

@roji
Copy link
Contributor

roji commented Feb 14, 2022

To add to @vonzshik answer, and to summarize my views (once again):

  • A general purpose driver cannot reliably know whether a user-provided SQL has side effects or not. Like @vonzshik, I don't believe TFB is about crafting drivers whose sole purpose is to achieve high scores in benchmarks without being usable in the real world. I don't see what the point of that would be, nor how it's helpful to compare such a driver with real-world driver.
  • If you're assuming read-only queries, then I don't think it matters whether prepared queries or binary numeric data is used. Anything can be safely rolled back without consequences, since there are no side effects. The problem is that the read-only assumption can't be made in the real world.
    • As an aside, even in read-only scenarios, errors can occur. An unoptimized query or a missing index can cause a timeout, for example.
  • "little to no risk involved" is a problematic bar of quality for a database driver... Ask real-world users if they'd accept "little risk of data changes silently disappearing", in order to get slightly better perf, and I don't think you'll find many takers.

@billywhizz
Copy link
Contributor

@vonzshik

...but saying that a driver isn't going to support calling functions with side effects to make queries run faster in extreme cases doesn't look that great.

i didn't say this. these are choices a developer can make and options the driver can offer. it's perfectly legitimate for the developer to decide to have a connection which has only queries with the properties above while having a different connection that has different properties (like sync on every statement).

And if we do allow this, where exactly we're going to draw the line between being acceptable or not? For example, having a select-only supporting driver might make sense for someone.

once again, i never proposed a "select only" driver. i said it is legitimate and safe to pipeline or batch statements under certain conditons and if the driver implements this pipelining correctly and handles errors appropriately. the only place in the techempower tests where a sync is absolutely required after an exec in order to avoid returning incorrect results is directly after the update statements(s) in the updates test. all other queries have no side effects and are legitimate candidates for this kind of pipelining.

That's not going to works if we want users to decide the platform/provider they're going to use based on the benchmark results since every single driver on the top is going to take shortcuts (which usually are not publicized and are going to blow in your face eventually).

again, you are making assertions that are not backed up by evidence about potential failure scenarios that nobody on this thread is able to demonstrate in the context of the techempower tests as they are currently configured. do you really believe the techempower results have that big an influence on people's choice of database driver? given the dominance of voices from MSFT/ngpgsql community on here i think it's more about you being able to continue to promote your framework using the techempower scores - something you have done multiple times in the past and is in fact even mentioned on your github landing page.

Even not taking such an extreme example into account, I have no doubt in my mind that every single one of us here can make a 'raw' provider (I actually have one for testing potential optimizations and platform's overhead) which does satisfy every single benchmark requirements, but forgoes some of the things you expect from a driver (for example, the driver I linked doesn't check whether a database returned an error, this has to be handled by a user). Again, this doesn't make it useful either to us or to users (due to extreme requirements you have to have to use these drivers).

i think anyone making these kinds of decisions should be looking at the code and asking questions about implementations for less popular frameworks/libraries here. i don't buy the argument that the raw RPS numbers have a big influence on these choices. if they actually do, well then i think that says a lot more about the people making the decisions than about any of the frameworks on here.

To summarize, having a general-purpose driver optimize some use cases with explicit user's approval is great.

i agree the decision should be one made explicitly by the developer but i keep seeing reference to "general purpose" database drivers. i didn't see anything in the rules that excludes database drivers based on how general purpose or fully featured they are.

Whether we should have separate benchmarks (or mark these contenders with a tag) is debatable. Normalizing such optimization by saying that there is "little to no loss of security or robustness" will only encourage people to create drivers to abuse benchmarks in a race to the top. While I do work on PG driver in my spare time, I'm definitely not an expert and as such, I would rather be a bit more conservative while dealing with a potential data loss or even data corruption.

again, you are making assertions without any strong evidence that failures and data loss will happen if this pipelining technique is used. while that may be true if you used it without understanding the implications or explicitly opting into it that's fair but as you seem to say above this is a valid thing to do if the choice is explicit.

it is pretty obvious to me this is a legitimate thing to do and is fully supported by postgres wire protocol. it's mentioned clearly here.
Screenshot from 2022-02-15 19-40-44

also, in version 14 of postgres, this pipelining is now integrated and fully supported in libpq.
Screenshot from 2022-02-15 19-42-47

@billywhizz
Copy link
Contributor

@roji

To add to @vonzshik answer, and to summarize my views (once again):

re. "once again". i understand your position. i just disagree with it.

  • A general purpose driver cannot reliably know whether a user-provided SQL has side effects or not. Like @vonzshik, I don't believe TFB is about crafting drivers whose sole purpose is to achieve high scores in benchmarks without being usable in the real world. I don't see what the point of that would be, nor how it's helpful to compare such a driver with real-world driver.

this is your view of what techempower is and seems to be heavily informed by your specific area of interest which is postgres drivers. there are many different implementations here in various stages of development and community adoption (for example, literally nobody is using mine). what you seem to be calling for is much stricter rules on many different aspects of the implementations. maybe you should propose all these new rules in one place and see what the community and the maintainers think? personally, i like the fact that techempower is open to new and unproven frameworks and i've learned a lot from investigating the different implementations and the optimizations they have made. isn't that the whole point? it's an opportunity to learn? it seems to me you have a very different view that it should be purely a marketing tool for more mature and well-backed frameworks and libraries like the ones you work on.

  • If you're assuming read-only queries, then I don't think it matters whether prepared queries or binary numeric data is used. Anything can be safely rolled back without consequences, since there are no side effects. The problem is that the read-only assumption can't be made in the real world.

this read only assumption absolutely can be made in the real world. if you know the requirements up front then you can implement for this, as long as your database driver allows it and handles it correctly. i don't get how you can make the assertion that this approach is not applicable in the real world. also see my point above about this technique being fully supported in postgres wire protocol and latest release of client libraries. did they do all that work just so nobody could use this technique in the "real world"?

  • As an aside, even in read-only scenarios, errors can occur. An unoptimized query or a missing index can cause a timeout, for example.

a timeout in receiving results back on the database connection? sure, and that would be something that should be handled appropriately on the client side. if the previously returned results in the "batch" have been sent back to the client that is ok if the queries in the batch have no side effects. again, you are making claims that simply aren't true and have yet to provide a concrete example of how this approach will fail within the parameters of the techempower tests. if you can do that, then i'd be prepared to take your arguments more seriously.

  • "little to no risk involved" is a problematic bar of quality for a database driver... Ask real-world users if they'd accept "little risk of data changes silently disappearing", in order to get slightly better perf, and I don't think you'll find many takers.

"once again", this is just innuendo. you are making assertions about potential failure modes in a general technique being used here that you are unable to demonstrate. this is all on top of the opening post claiming the implementations cited violated the current rules when that is patently untrue.

@Brar
Copy link

Brar commented Feb 15, 2022

@billywhizz I have the impression that we have reached a point where further discussion is pretty much pointless. You have stated your point of view and others have too and there is some fundamental disagreement and little learning opportunity left for all of us.

this is your view of what techempower is

It is also what my view of what TechEmpower is but that's not important. What is important is what TechEmpower's view is.
They've stated some of it at https://www.techempower.com/benchmarks/#section=intro and at https://www.techempower.com/benchmarks/#section=motivation and I read things like "Each framework is operating in a realistic production configuration. " and "Choosing a web application framework involves evaluation of many factors." but this are random picks that I chose to support my point of view that some realistic usage is part of the goals and you might as well find parts that support your point of view.

The longer this thread becomes and the more we're discussing opinions instead of technology the harder it will become for people caring about technology to find their way through it.

I'm out now, waiting for TechEmpower's decision on this.

@NateBrady23
Copy link
Member

NateBrady23 commented Feb 15, 2022

Sorry, everyone, we've been swamped with client work. There's a lot to catch up on here. Something I will point out after reading the last few posts: we do expect that a framework permutation will eventually implement all of the tests. That's to say, if you're relying on a driver optimization that only reliably works for the multiple queries test and shouldn't be used for the updates test, that wouldn't belong here.

Though, I do agree with @billywhizz that we don't want to stifle innovation by making the test rules too strict. However, it's important to note that the whole idea here is seeing how certain configuration changes affect all aspects of the framework. i.e. if you tweak this configuration it helps multiple query reads but slows down updates, etc.

@roji
Copy link
Contributor

roji commented Feb 15, 2022

Here are some answers to the stuff I find most important...

First, just to be clear, I agree that for read-only workloads (no side-effects), this technique isn't risky per se. Nobody is/was claiming otherwise.

i didn't say this. these are choices a developer can make and options the driver can offer. it's perfectly legitimate for the developer to decide to have a connection which has only queries with the properties above while having a different connection that has different properties (like sync on every statement).

The problem with this - looking at it from a production/safety point of view - is that if someone gets this wrong and does accidentally do something with a side-effect, then you're in a very dangerous situation where the potentially massive error (again, lost data) will go unnoticed. I'm totally for low-level perf opt-ins, but as responsible library developers, we have to also take into account what happens when someone accidentally gets it wrong.

In other words, the question here is whether any behavior can be considered OK/safe as long as it's hidden behind an opt-in - no matter how dangerous. I don't believe so.

do you really believe the techempower results have that big an influence on people's choice of database driver?

Yeah, I do (or on programming language/whatever; see @Brar's quotes from above which seem to support this as a goal of TE).

But even if we ignore that for a moment, at the end of the day, TE shows different drivers/frameworks, sorted by RPS; what exactly is a user supposed to understand by looking at this list? There's obviously no way to know that driver X is significantly faster because they're using a special opt-in technique, that both has dangerous consequences when used wrong and only works for read-only queries. So the results become very hard to actually understand in a useful way.

So yes, I do agree with you that at the end of the day it's a question of what TE is about. If it's a playground for unsafe experimentation that isn't supposed to be production-ready, that's one thing; but as @Brar showed above, it seems that the idea is to show "realistic production configurations", which I don't believe the above is compatible with.

given the dominance of voices from MSFT/ngpgsql community on here i think it's more about you being able to continue to promote your framework using the techempower scores - something you have done multiple times in the past and is in fact even mentioned on your github landing page.

To be clear, we (the Npgsql team) could save ourselves the trouble of this discussion and just implement this feature, just like you have, and make a huge jump in the scores. It wouldn't even be particularly difficult for us to do so. It's just that we've discussed this quite a lot and believe this is a dangerous feature to have in a real-world database driver.

also see my point above about this technique being fully supported in postgres wire protocol and latest release of client libraries. did they do all that work just so nobody could use this technique in the "real world"?

The reason Sync-batching is supported at the wire level is IMHO so that explicit batching/pipelining can be implemented, which is perfectly fine and has nothing to do with this. The wire protocol doesn't tell you whether you can/should use this capability for mixing unrelated commands together in a single batch.

[...] this is all on top of the opening post claiming the implementations cited violated the current rules when that is patently untrue.

Regardless of all of the above, the 7th general requirement rule states: "However, it is permissible to pipeline traffic between the application and database as a network-level optimization". I agree that there's some vagueness there which could be interpreted (that's why we're asking for clarification), but IMHO Sync-batching is definitely not a network-level optimization, since it affects protocol and transactionality/error semantics.

@billywhizz
Copy link
Contributor

@roji

In other words, the question here is whether any behavior can be considered OK/safe as long as it's hidden behind an opt-in - no matter how dangerous. I don't believe so.

is this the question? this is why i am a little exasparated. you keep changing the question you are asking and moving the goalposts. i thought we were talking about something very specific - namely whether batching/pipelining of statements on the wire for postgres database requests is ok or not. then it became whether cross request batching of the sync messages is ok. it would be useful for everyone if someone from your group stated again clearly what change they are proposing and provided some clear evidence (note: not opinion) as to why this change would benefit the community and the project. i've already asserted that the behaviour here is safe for the techempower tests and i don't see any clear refutation of that.

you are the ones proposing changes to the rules after years of them existing as they are today, so i think the bar should be pretty high for making that case. maybe it would be better if some resources were dedicated to adding better/more compreshensive/less "gamable" tests to the existing suite? i'm all for that as i think existing tests are not very realistic and tend to stress RPS over all else. not meaning to be critical of the TE team here at all - is great to have this service provided for free and wish it were better funded and could grow some more.

But even if we ignore that for a moment, at the end of the day, TE shows different drivers/frameworks, sorted by RPS; what exactly is a user supposed to understand by looking at this list? There's obviously no way to know that driver X is significantly faster because they're using a special opt-in technique, that both has dangerous consequences when used wrong and only works for read-only queries. So the results become very hard to actually understand in a useful way.

but, surely it is up to library/framework developers and team members to do this analysis and discuss/explain the results and the benefits/tradeoffs of the different approaches? again, i think it comes back to very different views of what TE is for. i see it mostly as a community learning tool and think any rankings should be taken with a large pinch of salt, like any microbenchmarks.

If it's a playground for unsafe experimentation that isn't supposed to be production-ready, that's one thing;

have you looked at many of the frameworks here? it's pretty clear many of them are new and not production ready and many are experimental or are using various "unusual" techniques in order to improve perf. again, what specific proposal do you have here and what kind of effect do you think that would have on the submissions if it were implemented?

maybe a good proposal you could make here would be a new flag for "production ready" or "experimental". i'd be fine with that and i'd certainly classify my own as not production ready and experimental at this stage. but it is "realistic" given all the libraries it depends on are having to be built from the ground up and it's not even near a 1.0 release yet. if it was easy to filter on this flag then folks could easily narrow their choices down to "production ready" frameworks and TE folks could decide if they want to only include those ones in the regularly (or not so =)) published roundups?

given the dominance of voices from MSFT/ngpgsql community on here i think it's more about you being able to continue to promote your framework using the techempower scores - something you have done multiple times in the past and is in fact even mentioned on your github landing page.

To be clear, we (the Npgsql team) could save ourselves the trouble of this discussion and just implement this feature, just like you have, and make a huge jump in the scores. It wouldn't even be particularly difficult for us to do so. It's just that we've discussed this quite a lot and believe this is a dangerous feature to have in a real-world database driver.

ok, so basically you have decided you don't want to expose postgres pipelining to the developer so everyone has to abide by your decision?

also see my point above about this technique being fully supported in postgres wire protocol and latest release of client libraries. did they do all that work just so nobody could use this technique in the "real world"?

The reason Sync-batching is supported at the wire level is IMHO so that explicit batching/pipelining can be implemented, which is perfectly fine and has nothing to do with this. The wire protocol doesn't tell you whether you can/should use this capability for mixing unrelated commands together in a single batch.

no, it doesn't, so we're just going on your opinion/preferences then? as i said above i agree with you that this isn't a reasonable approach for all scenarios but as i have also said before my opinion is that it can be done safely for all of the techempower tests as they currently stand, even the update test as long as an explicit sync is allowed on the actual update statement.

[...] this is all on top of the opening post claiming the implementations cited violated the current rules when that is patently untrue.

Regardless of all of the above, the 7th general requirement rule states: "However, it is permissible to pipeline traffic between the application and database as a network-level optimization". I agree that there's some vagueness there which could be interpreted (that's why we're asking for clarification), but IMHO Sync-batching is definitely not a network-level optimization, since it affects protocol and transactionality/error semantics.

the problem with your argument here is the tests as they currently stand allow this kind of sync/cross-request batching to be done and for the solution still to be safe. imho, this is perfectly safe to do for all the tests where the queries are readonly given the preconditions discussed in various places above. so, the only question is how do we also handle the update test. here is some pseudocode/work in progress that might make that clearer:

const pg = await postgres.createSocket(config)
// enable pipeline mode for this connection, meaning syncs will automatically be 
// sent for each batch of commands put on the wire for postgres
pg.pipeline = true 

// this command will not have sync automatically appended by default as the connection 
// has pipeline mode enabled. if pipeline mode was disabled, it would
const getWorldById = await pg.compile(worlds)

const updateWorlds = await pg.compile(updateWorlds)
// because we set sync = true on this command, it will always have a sync appended 
// to the exec when it is called
updateWorlds.sync = true

// this will put 20 Bind/Exec pairs in the buffer which is flushed onto the wire for postgres 
// if there are no pending responses or buffer is full
const worlds = await Promise.all(spray(20, getRandomWorld))
worlds.forEach(world => world.randomnumber = getRandom())

// this will put a Bind/Exec/Sync in the buffer so it and all preceding queries since the last sync 
// will be part of a transaction/syncpoint
await updateWorlds(worlds)

again. this is perfectly safe. each batch is explicitly tied to a http request and if any query in a batch fails, then the driver can hand an error back which can be handled safely?

we will end up with this on the wire

BIND
EXEC
... 19 other queries
BIND
EXEC // update
SYNC
BIND
EXEC
... 19 other queries
BIND
EXEC // update
SYNC
...

is that wrong/broken/unsafe?

for multi-query test, we could decide whether to do a sync on every http request or not. for the individual query tests (db and fortunes), if we did pipelining, then it would look like this
Screenshot from 2022-02-16 01-16-44

this is, in effect, using the sync mechanism, where it is known the queries have no side effects, to optimise the network interaction with the postgres wire protocol. are you saying, at scale, it would be acceptable to justify paying for 30-50% more resources to service the load just because we don't want to implement some logic client side to retry a failure when the client receives a transaction rollback error, which is highly unlikely to ever happen anyway?

not sure if i can contribute much more to the discussion. i think it is really your move to either provide a more compelling argument for why this kind of optimisation should be disallowed or to modify your proposal to be less disruptive and overbearing for the existing community of implementers and maintainers.

@ajcvickers
Copy link

What about if a network error occurs? I would expect a driver to potentially:

  • Not commit something to the database, and hence not ack to the client that anything has been committed.
  • Or commit something to the database, but not get a chance to ack this to the client.

However, unless I've opted into a fire-and-forget mode, which isn't the case here, I would consider it a driver bug if it:

  • Acked to the client that something has been committed to the database, when it actually has not been committed.

This last case is what we are talking about. Regardless of the chance/conditions required to generate the error, I don't believe a driver should provide acks when data has not been finally committed to the database.

@roji
Copy link
Contributor

roji commented Feb 16, 2022

is this the question? this is why i am a little exasparated. you keep changing the question you are asking and moving the goalposts. i thought we were talking about something very specific - namely whether batching/pipelining of statements on the wire for postgres database requests is ok or not.

That seems... disingenuous... I've repeatedly written above that IMO batching in itself isn't unsafe, as long as it's explicit, but that implicit, cross-request batching is not. In your response, you wrote:

it's perfectly legitimate for the developer to decide to have a connection which has only queries with the properties above while having a different connection that has different properties (like sync on every statement).

... which I understood to mean the you consider this behavior safe, as long as the developer opts into implicit, cross-request batching. That is what raises the question of whether an dangerous feature is OK as long as it's behind an opt-in.

it would be useful for everyone if someone from your group stated again clearly what change they are proposing and provided some clear evidence (note: not opinion) as to why this change would benefit the community and the project.

I think this has been done (numerous times) above. You may not be convinced - which is OK - but that isn't to say it hasn't been clearly stated. To attempt yet again in tight form, implicit cross-command batching is unsafe because it can cause effects of earlier commands to disappear silently on failure. This is especially bad without clear user opt-in, and from a cursory check, some/most of the top frameworks don't have this - drivers implicitly batch by default! But even with a clear opt-in I view this as quite dangerous, because of what happens when users get it wrong. I don't know how I can make the above clearer. In addition, beyond safety I think that such a fundamental optimization that's only reliable for a specific workload (i.e. read-only) is problematic in these benchmarks (see @nbrady-techempower comment above).

To drive the point home, here's a thought experiment from @vonzshik: let's consider a feature where the driver recognizes the same query SQL, and caches query results - a form of memoization. Now, such an optimization would obviously make things extremely fast, and a driver implementing it would shoot to the very top in the Fortunes benchmark. This is specifically prohibited in rule 7 ("In all cases where a database query is required, it is expected that the query will reach and execute on the database server."). Now, someone could claim this is a legitimate driver optimization which could help in the real world, but is this something TE should allow? I'd argue against it, since the point of the scenario isn't to show the best perf for some very narrow scenario where some specific optimization is possible (same query over and over), but rather to give an idea of general driver perf, where interacting with the database is unavoidable.

i think it comes back to very different views of what TE is for. i see it mostly as a community learning tool and think any rankings should be taken with a large pinch of salt, like any microbenchmarks.

That's not how I read the Motivation and questions page - not especially the "realistic" vs. "stripped" category. There's definitely a place for "non-realistic/potentially unsafe" experiments, and I have nothing against them, as long as they're clearly marked as such. But there's a problem when we start to mix experiments and learning tools with drivers that hold themselves to production-grade reliability/safety standards - the comparison becomes meaningless, and IMHO indeed, potentially dangerous.

maybe a good proposal you could make here would be a new flag for "production ready" or "experimental".

I agree this is a good idea - and does exist to a certain extent via the realistic/stripped distinction. Though there still needs to be a common set of rules which specify what's allowed and what isn't, and those same rules would need to apply to both production-ready and experimental, right? Otherwise, experimental just becomes a place where you don't have to comply with any rules. In other words, I'm not sure it would make much sense to allow something in experimental that wouldn't ever be allowed in production-ready.

ok, so basically you have decided you don't want to expose postgres pipelining to the developer so everyone has to abide by your decision?

You seem to be consistently mis-representing my position here (or misunderstanding it). I've already written several times that IMHO PG pipelining is totally fine as long as it's explicit for a given batch, i.e. I as the developer hand the driver a set of SQL command and tell it to batch them (see this comment). That's very different from the driver batching unrelated SQL commands from different origins/HTTP requests - even if the user opts into that.

Regardless of all of the above, the 7th general requirement rule states: "However, it is permissible to pipeline traffic between the application and database as a network-level optimization". I agree that there's some vagueness there which could be interpreted (that's why we're asking for clarification), but IMHO Sync-batching is definitely not a network-level optimization, since it affects protocol and transactionality/error semantics.

the problem with your argument here is the tests as they currently stand allow this kind of sync/cross-request batching to be done and for the solution still to be safe. imho, this is perfectly safe to do for all the tests where the queries are readonly given the preconditions discussed in various places above.

You misunderstood my point. Rule 7 (and other rules) prohibit certain things regardless of whether they're safe or not - that's another discussion. TE is setting some rules on what it is they'd like to see benchmarked, to make sure things stay apples-to-apples, and so that we benchmark what we actually want to benchmark (i.e. actual database access). If drivers start to do memoization (as above), or to coalesce multiple SQL commands into one, we're leaving that common territory and comparing apples and oranges.

imho, this is perfectly safe to do for all the tests where the queries are readonly given the preconditions discussed in various places above. so, the only question is how do we also handle the update test.

Here again, I don't think that's the discussion - I know (and have known from the beginning...) that specifically for these benchmarks (including the updates one), this technique won't cause issues. The problem, as I've already written, is that these techniques are tailored specifically for these TE scenarios, and would be dangerous in the general case in the real world. And as such, the question is whether it makes sense for TE to allow it.

@billywhizz
Copy link
Contributor

@ajcvickers

This last case is what we are talking about. Regardless of the chance/conditions required to generate the error, I don't believe a driver should provide acks when data has not been finally committed to the database.

nobody here is suggesting acknowledging uncommitted data to the web client is a good idea or should be allowed so i don't really understand what this adds to the conversation.

@billywhizz
Copy link
Contributor

@roji

... which I understood to mean the you consider this behavior safe, as long as the developer opts into implicit, cross-request batching. That is what raises the question of whether an dangerous feature is OK as long as it's behind an opt-in.

yes. after investigating, i consider it safe to send multiple execs from different http requests to the database as a pipeline with a single sync as long as developer has opted into this behaviour and is aware of the behaviour, namely:

  • any not explicitly synced commands should have no side effects and will be implicitly pipelined by the database driver
  • any commands with side effects should be explicitly synced (see pseudocode i posted above)

i.e. this is safe and perfectly compliant with postgres wire protocol and a valid optimization

sequenceDiagram
    autonumber
    client->>web: GET /db
    client->>web: GET /db
    web->>db: BIND<br>EXEC (1)<br>BIND<br>EXEC (2)<br>SYNC
    db->>db: process
    db->>web: ROW (1)<br>COMPLETE<br>ROW (2)<br>COMPLETE<br>READY
    web->>client: ROW (1)
    web->>client: ROW (2)
Loading

i can accept you could argue this is a specific optimization for the TE tests but the reality is this user-definable (as discussed/agreed above) behaviour is safe across all the TE tests as they are today and have been for a long time. so it's general enough to cover all the scenarios in the TE tests as well as being applicable more widely in the field. it doesn't require any trickery to work across all the tests and it can be safely multiplexed onto one postgres connection per web server thread/process. i.e. @matt-42's implementation is safe at a wire protocol level, although he has accepted, as we all seem to, the need for explicit opt-in from the developer.

i think it's important to stress that, given postgres process per connection model, it's only possible to drive the postgres process on the other end of the connection to capacity by doing sync batching in this way for small/fast queries. if you force a sync/flush on every sql command then you effectively throttle the postgres process. in this scenario, your only option for scaling is by using multiple connections per web server process which causes many more postgres processes to be created (no. of web servers times no. of connections per web server) and results in much higher contention for all resources on the databsase server. surely uncovering these kind of safe, compliant optimizations that allow better utilization of resources is a large part of why TE benchmarks exist?

But even with a clear opt-in I view this as quite dangerous, because of what happens when users get it wrong.

i'm not sure how this is relevant. there are different frameworks with different levels of abstraction and different levels of hand-holding and protection for the developer. you seem to be saying we should only allow one kind of framework and exclude any frameworks that allow lower level access or more control/options for the developer.

I think that such a fundamental optimization that's only reliable for a specific workload (i.e. read-only) is problematic in these benchmarks

it seems you haven't actually read what i have written (the pseudocode above is i think pretty clear?). you can still incorporate commands with side effects (e.g. in the /update test) if they are followed by a sync and this is something that can be controlled by the developer. you even admit later on in this reply:

I know (and have known from the beginning...) that specifically for these benchmarks (including the updates one), this technique won't cause issues

To drive the point home, here's a thought experiment from @vonzshik:

this is just a completely random scenario you have interjected into the conversation here. this kind of caching has nothing to do with what we are discussing, which is using the postgres wire protocol as designed.

That's not how I read the Motivation and questions page - not especially the "realistic" vs. "stripped" category.

realistic v stripped is, in my understanding, more about an idomatic/common implementation of the web framework code versus a hand crafted/low level one that would not be a common pattern for developers using that framework or language. to me, this is not the same as saying in general this framework is experimental or production ready. the reality here is that anybody deciding to adopt any particular framework is going to do that research themselves before coming to a decision. at least i hope they are.

the comparison becomes meaningless, and IMHO indeed, potentially dangerous.

there it is again. "dangerous". more innuendo without evidence. i don't think the comparison is meaningless. i think it's all really interesting and educational.

I'm not sure it would make much sense to allow something in experimental that wouldn't ever be allowed in production-ready.

well yes. this is why i believe we should be flexible/lenient in what we allow. the goal of the exercise imo is to learn about performance and its change over time and the various tradeoffs between different platforms, frameworks and approaches. for something like this, openness is important and is i think a large part of the continued interest in these benchmarks. if we introduce too many rules and make it too strict, then where are the boundaries being pushed and what do we learn over time? do we have to wait for MSFT or someone else to tell us what's allowed and what isn't? doesn't this just kill the innovation and experimentation happening here?

TE is setting some rules on what it is they'd like to see benchmarked, to make sure things stay apples-to-apples, and so that we benchmark what we actually want to benchmark (i.e. actual database access).

but, we are benchmarking actual database access. you seem to be implying that the technique we are discussing here is somehow skipping sql commands when it's patently not. this is explicitly tested for by TE toolset.

If drivers start to do memoization (as above)

again, you introduce memoization when literally nobody is talking about doing this. this seems disingenuous to me.

or to coalesce multiple SQL commands into one

there is no coalescing of multiple SQL commands into one happening here. all the commands are individual exec messages and are run individually in the server process and results are built/buffered as they are run and flushed to the wire when the sync is encountered or the buffers are full. this is all perfectly acceptable usage of the postgres wire protocol as far as i can see.

comparing apples and oranges.

you've said yourself that you could submit an entry with these optimizations. why not do so and give it a different name as @matt-42 has done? that seems fair to me and tbh i would like to do the same because i would like to see the perf difference in TE environment given my framework does not currently do this kind of cross request batching. then you can explain to your users the benefits, tradeoffs and dangers of the different approaches and you can see where there may be room for further optimizations.

Here again, I don't think that's the discussion

no, because you keep changing the discussion.

I know (and have known from the beginning...) that specifically for these benchmarks (including the updates one), this technique won't cause issues

you've known from the beginning but you continue to assert this technique is "not safe".

these techniques are tailored specifically for these TE scenarios

i think this whole thread really comes down to this argument alone. you favour a much stricter interpretation of this rule by TE which has clearly not been in effect to date. this is also a completely different argument to the initial, specific, objection raised by @sebastienros which asserts the cross request batching is a violation of the rule that all sql commands must be executed individually on the server and "not combined into a single complex query".

so, it's up to folks at TE if they want to tighten the rules and explicitly disallow this kind of optimization. imo that would be restrictive and we would be better if we encouraged the community to adopt a convention as @matt-42 has done of marking the solutions that use this technique with a "-batch" suffix or something similar.

the key, hopefully final, point i want to make here is that in all the scenarios in TE using this technique, you are never acknowledging data back to the web client that has not been committed to the database. i'm happy to be proved wrong on this but i'm pretty sure i'm not.

@roji
Copy link
Contributor

roji commented Feb 18, 2022

Stepping back and trying to be crystal clear: there's the question of whether implicit batching of unrelated commands is safe, and there's the question of whether it should be allowed in the benchmarks, according to the rules. The two questions are related, though they are separate.

Yes, I agree (and always have) that there are specific scenarios where implicit batching is safe - read-only SQL chains, and also read-only chains followed by a single update, after which comes the Sync (the latter scenario isn't something we've been discussing recently, so I haven't been mentioning it). You are arguing that users should be allowed to explicitly opt into this when they know their workload happens to match the above; I'd avoid this in a production, real-world driver since the consequences of getting it wrong are disasterous (silent data loss). Even if this discussion only results in some drivers putting this behind an opt-in (which seems to be the case), then I feel something good has already come out of this.

But in any case, every driver writer does and will do what they think is right, which is where we come to the 2nd question - should TE allow implicit batching.

My view of the TE benchmarks, is that they're supposed to show you typical or standard performance for a given scenario; Fortunes is supposed to tell you e.g. how many RPS a framework does when executing a single database query. Allowing the above optimization would instead show RPS numbers for queries which are known to be side-effect free; that's quite a restriction: we're no longer discussing the perf of any single query, but of a very specific subset of queries. Now, it's true that Fortunes happens to be compatible with this optimization from a safety point of view, but IMHO the idea is to use Fortunes as a proxy for general query performance; otherwise the result becomes rather meaningless.

This is what I meant with my "ad absurdum" example memoization - it's just another example of a possible driver optimization. Although it could be a valid thing to do in the real world (no potential safety issues in any case), this isn't something that we want used in the benchmarks, since it doesn't measure what we want (general database query performance): it would provide us with results that are only relevant when the same queries are repeated over and over again (and lagging results are OK). In the same way, implicit batching provides us with results that are only relevant when queries are known to be 100% side-effect-free.

To summarize, not everything that can speed up perf is something we necessarily want allowed in these benchmarks, and just like TE already disallows memoization and coalescing of multiple SQL commands into one, I believe it should disallow this optimization, which is only safe for use in very specific scenarios as we've already agreed. I don't think this should necessarily limit innovation or experimentation - but it would make sure that we're showing apples-to-apples numbers for the same generalized scenarios.

I hope the above accurately (and respectfully!) represents the state of the discussion and our respective positions.

@michaelhixson
Copy link
Contributor

I'd propose something along the lines of:

If any pipelining or network optimization is performed, then the execution behavior and semantics must match what they would be if the multiple SQL statements were executed without the optimization, i.e. in separate roundtrips. For example, it is forbidden to perform any optimization could change the transactionality and/or error handling of the SQL statements. To avoid any doubt, such optimizations are forbidden even if they wouldn't produce the behavioral change with the specific SQL statements of the TechEmpower benchmarks, but would produce it with other statements a user might supply.

I agree with this proposed requirement.

I tried to articulate this position in a Google Group thread from 2018 -- read from "I bet a compliant pipelining
implementation would" onwards. These clarifications never made it into the official requirements, and I think the proposed addition does a good job of that.

It might be worth clarifying with an example for the Postgres protocol specifically, as it sounds like the only way to implement this is to put a sync message between each query.

@NateBrady23
Copy link
Member

We're going to update the requirements with the proposed text above. Thank you everybody for your contributions to this discussion. We won't have any automated way to test for violating this requirement right away, so we'll continue to rely on the community to politely point out the frameworks that do. Thanks again!

@roji
Copy link
Contributor

roji commented Mar 8, 2022

@nbrady-techempower thanks for your patience with this long thread!

If you do decide to implement verification for this, I think @billywhizz's suggestion above is a good way to do that:

it should be possible to tighten the current checks by verifying the number of txns for each run is same as number of http responses received - so, at least one txn per http response received, no? this would disallow cross request batching for single and fortunes and would allow per request batching for the multi and update queries and make cross-request batching pretty much impossible on those too.

@billywhizz
Copy link
Contributor

sorry to continue this.... =)

this new requirement still seems a bit unclear to me, especially with @roji's response re. my proposed verification which would allow pipelining, but only within each http request. just to try to clarify, is this compliant with the new rule? i think it should be.

for each http request to
/db
we put at least

BIND
EXEC
SYNC

on the wire

for each http request to
/fortune
we put at least

BIND
EXEC
SYNC

on the wire

for each http request to
/query?q=N
we put at least

BIND
EXEC
... repeat N - 1 times
SYNC

on the wire

for each request
/update?q=N
we put at least

BIND
EXEC
... repeat N - 1 times
BIND
EXEC
SYNC

on the wire

so, essentially, each http request must have an enclosing transaction and transactions/syncpoints cannot be shared across http requests.

OR

are we saying every single sql command must be wrapped in a transaction, so the /query and /update http requests where N = 20 will result in 20 and 21 syncpoints instead of 1?

@michaelhixson
Copy link
Contributor

OR

are we saying every single sql command must be wrapped in a transaction, so the /query and /update http requests where N = 20 will result in 20 and 21 syncpoints instead of 1?

This one.

I'm wondering if there is a way we can make the requirement clearer.

From item 7 under General Test Requirements:

That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query.

Would you have asked the same question if we'd added "or transaction" to the end of that?

... as long as the queries themselves are executed separately on the database server and not combined into a single complex query or transaction.

We could also get really specific about Postgres in there. We could add a sentence like this:

If using PostgreSQL's extended query protocol, each query must be separated by a Sync message.

@Brar
Copy link

Brar commented Mar 11, 2022

I'm wondering if there is a way we can make the requirement clearer.

After seeing the different opinions around this topic I'd say that it cannot be too clear.

Would you have asked the same question if we'd added "or transaction" to the end of that?

I think that would help to specify the intent you stated above in a rather database-agnostic way, even if it may not completely resolve misunderstanding regarding the behavior around PostgreSQL's Sync message since it's effect on transactions is not documented very clearly and is not totally obvious.

We could also get really specific about Postgres in there. We could add a sentence like this:

If using PostgreSQL's extended query protocol, each query must be separated by a Sync message.

In my opinion that would make it crystal clear, resolve any further misunderstandings about your intent and hit the major issue about comparability of driver implementations that has been discussed here.
If you specify it this way, people may challenge your intent and ask the question whether this is a useless limitation you force upon driver implementations but most certainly wouldn't misunderstand it. Depending on the depth of their insight you could point them to this thread to better understand the issues around sync or just tell them that the results of optimizations beyond this point are hard to generalize and are not a goal you are pursuing with the current tests.
The compliance of a driver implementation with this rule would also be measurable/provable on the network level (e. g. via wireshark).

@billywhizz
Copy link
Contributor

@Brar @michaelhixson @roji @nbrady-techempower

If you specify it this way, people may challenge your intent and ask the question whether this is a useless limitation you force upon driver implementations but most certainly wouldn't misunderstand it.

i agree this would be much clearer but i think it will need a verification implemented in the tfb toolset to enforce this clarified rule, otherwise nobody can have any real confidence in which frameworks are compliant or not without going and testing themselves. i might try to do some work on this over next couple weeks if nobody at techempower is available right now and you agree it would be worth doing.

PS - this response from @michaelhixson on that google group thread is i think the clearest statement of what is actually required: https://groups.google.com/g/framework-benchmarks/c/A78HrYsz4AQ/m/OfFev6s_AwAJ. i wish it had been posted here earlier. 🙄

@michaelhixson
Copy link
Contributor

i agree this would be much clearer but i think it will need a verification implemented in the tfb toolset to enforce this clarified rule, otherwise nobody can have any real confidence in which frameworks are compliant or not without going and testing themselves. i might try to do some work on this over next couple weeks if nobody at techempower is available right now and you agree it would be worth doing.

Right now we're trying to avoid creating more code in TFB's existing Python toolset that we'd have to port over to our new work-in-progress toolset. I think this verification is worth adding, but now is not a good time.

If we were going to add this to the existing toolset, my hope is that it would fit in to the "run siege, then query the database for stats" verifications that we have already.

Related code:

  • AbstractDatabase.verify_queries
    def verify_queries(cls, config, table_name, url, concurrency=512, count=2, check_updates=False):
    '''
    Verify query and row numbers for table_name.
    Retrieve from the database statistics of the number of queries made, the number of rows read, eventually the number of updated rows.
    Run 2 repetitions of http requests at the concurrency level 512 with siege.
    Retrieve statistics again, calculate the number of queries made and the number of rows read.
    '''
  • postgres.Database.get_queries
    def get_queries(cls, config):
    return cls.__exec_and_fetchone(config, "SELECT SUM(calls) FROM pg_stat_statements WHERE query ~* '[[:<:]]%s[[:>:]]'" % cls.tbl_name)

I can't tell if Postgres (or any other database) exposes the stats we need here, though. Nothing is jumping out at me from the pg_stat_statements table that we're currently using or from any of the other pg_stat tables.

If anyone can find out where Postgres stores the information we'd need here, that would be useful to us. If the change to the Python toolset would be small enough, we'd consider making it.

@roji
Copy link
Contributor

roji commented Mar 14, 2022

If anyone can find out where Postgres stores the information we'd need here, that would be useful to us. If the change to the Python toolset would be small enough, we'd consider making it.

So if we're following @billywhizz's suggestion of checking the number of transactions against the number of HTTP requests (and calculated number of DB requests done inside them), then that information should be available in pg_stat_database.xact_commit (docs) (suggested above). You'd sample this value before starting the benchmark, and then ensure that after the benchmark it's equal to the total number of statements that were supposed to be issued.

@michaelhixson
Copy link
Contributor

I made those two small edits to the requirements: "or transaction" and "If using PostgreSQL...".

I tried implementing the verification changes using pg_stat_database.xact_commit, but I couldn't get it working correctly. The xact_commit number seems to lag behind what I'd expect. In manual testing, sending a second request makes the number "catch up" to what it should have been after the first request. In automatic verification, the number is much more than 1 request short and it's very inconsistent. I'm not sure what's going on there.

I pushed my (failing) work in progress to a branch: master...michaelhixson:batch-query-verification

@billywhizz
Copy link
Contributor

I tried implementing the verification changes using pg_stat_database.xact_commit, but I couldn't get it working correctly. The xact_commit number seems to lag behind what I'd expect. In manual testing, sending a second request makes the number "catch up" to what it should have been after the first request. In automatic verification, the number is much more than 1 request short and it's very inconsistent. I'm not sure what's going on there.

yep. i saw this same behaviour with xact_commit when i tried it a few weeks ago. didn't have time to investigate why. maybe something that should be raised with the pg devs?

@Brar
Copy link

Brar commented Mar 22, 2022

Maybe the following information about the statistics collector may help:

When using the statistics to monitor collected data, it is important to realize that the information does not update instantaneously. Each individual server process transmits new statistical counts to the collector just before going idle; so a query or transaction still in progress does not affect the displayed totals. Also, the collector itself emits a new report at most once per PGSTAT_STAT_INTERVAL milliseconds (500 ms unless altered while building the server). So the displayed information lags behind actual activity.

Source: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS

@stephenh
Copy link

I am wary of "yet another opinion", but FWIW I am not a low-level framework optimizer (like @billywhizz :-D and others on this thread), but I am an application-/orm-builder and I could see @billywhizz 's outline of:

for each request /update?q=N -->

BIND
EXEC
... repeat N - 1 times
BIND
EXEC
SYNC

Being done safely & automatically for users of ORMs that use a Unit of Work pattern:

// called for each request to /update
async function handleUpdate(n: number) {
  await uow.transaction(async (uow) => {
    await Promise.all(zeroTo(n).map(async () => {
      const w = await uow.load(World, random());
      w.randomNumber = random();
    }));
    // The ORM diffs/sends UPDATE(s) automatically when lambda finishes
  });

There are:

  • No hand-written queries here (i.e. to the concern that "skipping SYNC relies on the programmer knowing what they're doing / writing side-effect-free queries"; here the ORM is driving that and it knows/controls the queries),
  • The code is very high-level/realistic production code, i.e. not a micro-optimization just to please the TechEmpower benchmark

And so, AFAIU in this model, the per-request pipelining that @billywhizz proposed would actually be safe & kosher?

Even if it's not allowed by TechEmpower, I want to get my ORM to do that someday. :-D

(I get the push back that cross-request pipelining is even more esoteric, although I wonder if an ORM like ^ could still safely take advantage of it (when outside of transactions), again by knowing/controlling the queries, which given the ORM-focus of the TechEmpower benchmark, again seems kosher.)

(...also I wonder if just saying updates?q=N endpoint must use a transaction would more succinctly highlight that each request must use its own connection and not multiplex a single connection; surely that was discussed, but I didn't see it go by.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests