fix: Limit concurrent schema cache loads#4643
fix: Limit concurrent schema cache loads#4643mkleczek wants to merge 3 commits intoPostgREST:mainfrom
Conversation
92cf79e to
b63457e
Compare
src/PostgREST/AppState.hs
Outdated
| -- Allow 10 concurrent schema cache loads, guarded by advisory locks. | ||
| -- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time | ||
| lockId <- getRandomR (50168275::Int64, 50168275 + 10) | ||
| let stmt = SQL.Statement "SELECT pg_catalog.pg_advisory_xact_lock($1)" (HE.param $ HE.nonNullable HE.int8) HD.noResult configDbPreparedStatements |
There was a problem hiding this comment.
These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.
I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.
WDYT of the solution on #4642 (comment)? Would that be preferable?
There was a problem hiding this comment.
I guess one drawback
Also that it's a bit more operational overhead, we would also have to recommend setting lock_timeout in addition to statement_timeout to avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)
There was a problem hiding this comment.
I guess one drawback
Also that it's a bit more operational overhead, we would also have to recommend setting
lock_timeoutin addition tostatement_timeoutto avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)
Is it really an issue? If we get lock timeout we are going to retry anyway.
There was a problem hiding this comment.
These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.
Yeah, they are tx scoped.
I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.
Maybe we should introduce a config property that activates it then?
WDYT of the solution on #4642 (comment)? Would that be preferable?
See #4642 (comment)
I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.
There was a problem hiding this comment.
I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.
Maybe we should introduce a config property that activates it then?
Yes, agree... a config sounds good. Should we parametrize the number of locks? Or should we just hardcode to 10 and expose a boolean config? (Also how do we know 10 is the right number?)
We would also need to test this, seems doable to ensure only 10 connections can exist at a time when say 20 postgREST instances of db-pool=1 + with a PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP are spawned.
There was a problem hiding this comment.
Agree, I think
1is a good number. Given that I think we should avoid a config and just set it.Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
@steve-chavez @wolfgangwalther
I think we can actually do better.
How about we adjust the number of locks based on the (estimated) number of nodes connected to the same database?
There are two issues to solve:
- How do we estimate the number of cluster nodes?
- What should be the algorithm to calculate the number of locks?
Node number estimation
The idea is to estimate that based on:
- number of active db sessions opened by the same user as
session_user - number of open connections in the pool
The estimate would be: active_sessions_number / connections_in_the_pool
This assumes the load is spread evenly among cluster members so all nodes should have the same number of open connections.
Number of locks calculation
We need a sublinear function and it seems to me logarithm is well fit. The number of locks would be round(log2(estimated_number_of_nodes))
That way we can allow concurrent schema loads while protecting from thundering herd issue in large cluster.
I've committed implementation of this idea for you to review. If you don't like it we can easily delete the commit. If you think it is OK, we can split into coherent pieces.
Added the test that verifies the level of concurrency for various cluster sizes and the results are as follows:
| Nodes | Locks |
|---|---|
| 2 | 2 |
| 4 | 3 |
| 6 | 4 |
| 8 | 4 |
| 16 | 5 |
WDYT?
There was a problem hiding this comment.
Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.
Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.
Node number estimation
Number of locks calculation
Seems complicated. One simpler idea that ocurrs to me:
- Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.
- We can also know the time the latest scache query took, since we have the metric
postgrest/src/PostgREST/Metrics.hs
Line 29 in 66fda76
Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?
There was a problem hiding this comment.
Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.
Node number estimation
Number of locks calculationSeems complicated. One simpler idea that ocurrs to me:
- Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.
I am afraid I don't understand. How do you want to count the number of Pgrst instances based on LISTEN channel?
- We can also know the time the latest scache query took, since we have the metric
postgrest/src/PostgREST/Metrics.hs
Line 29 in 66fda76
Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?
OK, but what should be the formula to calculate the number of locks?
IMHO, If you already have the estimated number of nodes, calculating log2(number_of_nodes) is as simple as it gets - no additional information required.
The main reason why the proposal in this PR has merit IMHO is all required information is available locally to the node (ie. it is only the number of its opened connections). So acquiring the lock can be done with a single SELECT query taking a single parameter.
There was a problem hiding this comment.
Also I believe this should be a feature instead of a fix.
I think we're looking at two separate issues here:
- It's a bug that multiple PostgREST instances just end up as a thundering herd.
- There's a performance problem in reloading multiple PostgREST instances at the same time.
We should fix the bug by limiting to 1 concurrent schema cache reloader. We should then discuss how we can best improve performance. This discussion is inflating both.
There was a problem hiding this comment.
- It's a bug that multiple PostgREST instances just end up as a thundering herd.
- There's a performance problem in reloading multiple PostgREST instances at the same time.
We should fix the bug by limiting to 1 concurrent schema cache reloader.
Such a "fix" would introduce another (major) bug and hence is not acceptable IMO.
We should then discuss how we can best improve performance. This discussion is inflating both.
Given the above, we must discuss both to come up with the right solution, I'm afraid.
src/PostgREST/AppState.hs
Outdated
| withTxLock <- do | ||
| -- Allow 10 concurrent schema cache loads, guarded by advisory locks. | ||
| -- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time | ||
| lockId <- getRandomR (50168275::Int64, 50168275 + 10) |
There was a problem hiding this comment.
What's the reasoning behind this magic number?
There was a problem hiding this comment.
What's the reasoning behind this magic number?
It is just randomly generated large number. We need something with low probability of being used by something else than PostgREST here (but it needs to be hardcoded constant so that there is no risk of instances using different locks).
c72cfeb to
8b8ef5c
Compare
8b8ef5c to
f3e0692
Compare
|
@steve-chavez @wolfgangwalther rebased on top of #4672 to make review easier. |
50f945f to
0d02360
Compare
ed98903 to
2017326
Compare
DISCLAIMER: This commit was authored entirely by a human without the assistance of LLMs. Some helpers are provided for introspecting metrics already (used in JWT cache tests). This change provides facilities to additionally validate emited Observation events. A new Spec module is also implemented, adding basic tests of schema cache reloading - their main goal is to excercise the new infrastructure.
DISCLAIMER: This commit was authored entirely by a human without the assistance of LLMs. Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events. This is problematic because Hasql pool reports various connection events in multiple phases. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles. This change adds a ConnTrack data structure and logic to track database connections lifecycles. At the moment it supports "connected" and "inUse" connection counts precisely. The "pgrst_db_pool_available" metric is implemented on top of ConnTrack instead of a simple Gauge.
DISCLAIMER: This commit was authored entirely by a human without the assistance of LLMs. Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster. This change adds limiting of number of concurrent schema cache loading queries using advisory locks.
2017326 to
a2876fe
Compare
DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster.
This change adds limiting of number of concurrent schema cache loading queries using advisory locks.
Fixes #4642