Skip to content

fix: Limit concurrent schema cache loads#4643

Open
mkleczek wants to merge 3 commits intoPostgREST:mainfrom
mkleczek:use-advisory-locks-to-throttle-schema-cache-loading
Open

fix: Limit concurrent schema cache loads#4643
mkleczek wants to merge 3 commits intoPostgREST:mainfrom
mkleczek:use-advisory-locks-to-throttle-schema-cache-loading

Conversation

@mkleczek
Copy link
Collaborator

@mkleczek mkleczek commented Feb 10, 2026

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster.

This change adds limiting of number of concurrent schema cache loading queries using advisory locks.

Fixes #4642

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 2 times, most recently from 92cf79e to b63457e Compare February 10, 2026 21:51
-- Allow 10 concurrent schema cache loads, guarded by advisory locks.
-- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time
lockId <- getRandomR (50168275::Int64, 50168275 + 10)
let stmt = SQL.Statement "SELECT pg_catalog.pg_advisory_xact_lock($1)" (HE.param $ HE.nonNullable HE.int8) HD.noResult configDbPreparedStatements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.

I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.

WDYT of the solution on #4642 (comment)? Would that be preferable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one drawback

Also that it's a bit more operational overhead, we would also have to recommend setting lock_timeout in addition to statement_timeout to avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one drawback

Also that it's a bit more operational overhead, we would also have to recommend setting lock_timeout in addition to statement_timeout to avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)

Is it really an issue? If we get lock timeout we are going to retry anyway.

Copy link
Collaborator Author

@mkleczek mkleczek Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.

Yeah, they are tx scoped.

I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.

Maybe we should introduce a config property that activates it then?

WDYT of the solution on #4642 (comment)? Would that be preferable?

See #4642 (comment)

I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.
Maybe we should introduce a config property that activates it then?

Yes, agree... a config sounds good. Should we parametrize the number of locks? Or should we just hardcode to 10 and expose a boolean config? (Also how do we know 10 is the right number?)

We would also need to test this, seems doable to ensure only 10 connections can exist at a time when say 20 postgREST instances of db-pool=1 + with a PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP are spawned.

Copy link
Collaborator Author

@mkleczek mkleczek Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I think 1 is a good number. Given that I think we should avoid a config and just set it.

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?

@steve-chavez @wolfgangwalther
I think we can actually do better.

How about we adjust the number of locks based on the (estimated) number of nodes connected to the same database?
There are two issues to solve:

  1. How do we estimate the number of cluster nodes?
  2. What should be the algorithm to calculate the number of locks?

Node number estimation

The idea is to estimate that based on:

  • number of active db sessions opened by the same user as session_user
  • number of open connections in the pool

The estimate would be: active_sessions_number / connections_in_the_pool

This assumes the load is spread evenly among cluster members so all nodes should have the same number of open connections.

Number of locks calculation

We need a sublinear function and it seems to me logarithm is well fit. The number of locks would be round(log2(estimated_number_of_nodes))

That way we can allow concurrent schema loads while protecting from thundering herd issue in large cluster.

I've committed implementation of this idea for you to review. If you don't like it we can easily delete the commit. If you think it is OK, we can split into coherent pieces.
Added the test that verifies the level of concurrency for various cluster sizes and the results are as follows:

Nodes Locks
2 2
4 3
6 4
8 4
16 5

WDYT?

Copy link
Member

@steve-chavez steve-chavez Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.

Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.


Node number estimation
Number of locks calculation

Seems complicated. One simpler idea that ocurrs to me:

  1. Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.
  2. We can also know the time the latest scache query took, since we have the metric
    schemaCacheQueryTime :: Gauge,

Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?

Copy link
Collaborator Author

@mkleczek mkleczek Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.

Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.

Node number estimation
Number of locks calculation

Seems complicated. One simpler idea that ocurrs to me:

  1. Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.

I am afraid I don't understand. How do you want to count the number of Pgrst instances based on LISTEN channel?

  1. We can also know the time the latest scache query took, since we have the metric
    schemaCacheQueryTime :: Gauge,

Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?

OK, but what should be the formula to calculate the number of locks?

IMHO, If you already have the estimated number of nodes, calculating log2(number_of_nodes) is as simple as it gets - no additional information required.

The main reason why the proposal in this PR has merit IMHO is all required information is available locally to the node (ie. it is only the number of its opened connections). So acquiring the lock can be done with a single SELECT query taking a single parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I believe this should be a feature instead of a fix.

I think we're looking at two separate issues here:

  1. It's a bug that multiple PostgREST instances just end up as a thundering herd.
  2. There's a performance problem in reloading multiple PostgREST instances at the same time.

We should fix the bug by limiting to 1 concurrent schema cache reloader. We should then discuss how we can best improve performance. This discussion is inflating both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's a bug that multiple PostgREST instances just end up as a thundering herd.
  2. There's a performance problem in reloading multiple PostgREST instances at the same time.

We should fix the bug by limiting to 1 concurrent schema cache reloader.

Such a "fix" would introduce another (major) bug and hence is not acceptable IMO.

We should then discuss how we can best improve performance. This discussion is inflating both.

Given the above, we must discuss both to come up with the right solution, I'm afraid.

withTxLock <- do
-- Allow 10 concurrent schema cache loads, guarded by advisory locks.
-- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time
lockId <- getRandomR (50168275::Int64, 50168275 + 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this magic number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this magic number?

It is just randomly generated large number. We need something with low probability of being used by something else than PostgREST here (but it needs to be hardcoded constant so that there is no risk of instances using different locks).

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 11 times, most recently from c72cfeb to 8b8ef5c Compare February 20, 2026 16:56
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from 8b8ef5c to f3e0692 Compare February 25, 2026 19:58
@mkleczek
Copy link
Collaborator Author

@steve-chavez @wolfgangwalther rebased on top of #4672 to make review easier.

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 3 times, most recently from 50f945f to 0d02360 Compare March 4, 2026 10:03
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 3 times, most recently from ed98903 to 2017326 Compare March 13, 2026 05:27
DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Some helpers are provided for introspecting metrics already (used in JWT cache tests). This change provides facilities to additionally validate emited Observation events.
A new Spec module is also implemented, adding basic tests of schema cache reloading - their main goal is to excercise the new infrastructure.
DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events. This is problematic because Hasql pool reports various connection events in multiple phases. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles.

This change adds a ConnTrack data structure and logic to track database connections lifecycles. At the moment it supports "connected" and "inUse" connection counts precisely. The "pgrst_db_pool_available" metric is implemented on top of ConnTrack instead of a simple Gauge.
DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster.

This change adds limiting of number of concurrent schema cache loading queries using advisory locks.
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from 2017326 to a2876fe Compare March 13, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Thundering herd problem in PostgREST cluster on AWS ECS

3 participants