You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 22, 2021. It is now read-only.
I was linked to: Que a job queue in ruby that claims superior performance by using Advisory locks. This is true advisory locks are faster, and would require some work (since you're essentially implementing you're own concurrency). However, there's several questions left to answer:
What happens to advisory locks in a failover scenario? It seems like they're entirely lost (since they are stored in a shared memory region, and not flushed to disk). We could use a notification here, but notifications aren't completely durable. I still see a row update needing to happen otherwise (which negates the benefits).
Advisory locks seem to use the same shared memory region that PostgreSQL uses for connections. To quote the docs:
Care must be taken not to exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands depending on how the server is configured.
Between the number of connections, and locks for a potentially large job queue this could be really catastrophic. As the number of jobs could create so many locks normal behavior that uses locks can't go through.
Now none of these are necissarily deal breakers (perhaps we can offer an "unsafe" mode of execution that uses locks, for the extra performance gain; where we can then document the pitfalls). That being said research needs to be done in order to determine, what the best way to handle these scenarios are.
The text was updated successfully, but these errors were encountered:
FWIW talked to Chris, the lead developer for Que, and got some answers to his experiences:
1. You’re right that the locks would get cleared in a failover scenario, and the workers would start from
scratch when they connected to the new database. But then, that’s another reason to make job
idempotent. I guess you could maybe also run into a situation where new workers spin up against the
new database while the old workers are still working jobs through the old one, so you could have some
jobs being worked more than once simultaneously in that case?
2. The advisory locks are taken only when the job is locked to be worked, so the limiting factor would
be the number of lockers/workers, not the number of jobs (at my last job the job queue got backed up
to around a million records once, and it was fine). And if you need that many advisory locks I believe
there are configuration options you can use to bump up the limits.```
A couple ideas I think we can take away from the points above:
Could potentially be saved by having each worker place a "placeholder" advisory lock. One that doesn't actually ever get checked, but can be used as detection for when a failover has occurred out from under the application. Then the application in the scheduler loop can re-lock any job IDs it's currently working on.
Another possibility is using an unlogged table. If your worker node id isn't in that table, you've failed over. Relock jobs.
Advisory lock buildup doesn't seem to be too big of a concern. Even amongst our biggest worker fleets at Instructure. I don't think this is something we have to be too concerned about. However we should check what the performance of postgres is when a large number of nodes exist/a large number of locks exist (also to check: how pgbouncer handles these).
Description
I was linked to: Que a job queue in ruby that claims superior performance by using Advisory locks. This is true advisory locks are faster, and would require some work (since you're essentially implementing you're own concurrency). However, there's several questions left to answer:
Between the number of connections, and locks for a potentially large job queue this could be really catastrophic. As the number of jobs could create so many locks normal behavior that uses locks can't go through.
Now none of these are necissarily deal breakers (perhaps we can offer an "unsafe" mode of execution that uses locks, for the extra performance gain; where we can then document the pitfalls). That being said research needs to be done in order to determine, what the best way to handle these scenarios are.
The text was updated successfully, but these errors were encountered: