[13.0] Support multi-nodes with lock on jobrunner #256

guewen · 2020-09-30T15:35:12Z

Starting several odoo (main) processes with "--load=web,queue_job"
was unsupported, as it would start several jobrunner, which would all
listen to postgresql notifications and try to enqueue jobs in concurrent
workers.

This is an issue in several cases:

it causes issues on odoo.sh that uses an hybrid model for workers
and starts several jobrunners (How to set up queue_job in odoo sh? #169 (comment))
it defeats any setup that would use several nodes to keep the service
available in case of failure of a node/host

The solution implemented here is using a PostgreSQL advisory lock,
at session level in a connection on the "postgres" database, which
ensure 2 job runners are not working on the same set of databases.

At loading, the job runner tries to acquire the lock. If it can, it
initializes the connection and listen for jobs. If the lock is taken
by another job runner, it waits and retry to acquire it every 30
seconds.

Example when a job runner is started and another one starts:

INFO ? odoo.addons.queue_job.jobrunner.runner: starting
INFO ? odoo.addons.queue_job.jobrunner.runner: already started on another node

The shared lock identifier is computed based on the set of databases
the job runner has to listen to: if a job runner is started with
--database=queue1 and another with --database=queue2, they will
have different locks and such will be able to work in parallel.

Important: new databases need a restart of the job runner. This was
already the case, and would be a great improvement, but is out of
scope for this improvement.

guewen · 2020-09-30T15:36:34Z

Tested only locally for now, but I'm taking reviews.
Also, I'd love to hear feedback from someone using odoo.sh since it's a huge issue on this platform currently.

sbidoul · 2020-09-30T16:08:07Z

Hi Guewen, thanks for tackling this topic.

So if I understand correctly if there are multiple databases we could in theory have some databases handled by one runner and others by another runner?

guewen · 2020-09-30T16:16:35Z

So if I understand correctly if there are multiple databases we could in theory have some databases handled by one runner and others by another runner?

Yes, that's a possible scenario. If 2 jobrunners are started at the same time for the same set of databases, they'll each race for the lock and could each acquire only some of them.

sbidoul · 2020-09-30T16:22:48Z

If there is more than one active jobrunner (even on different databases) it could overload the system because each will have it's own root channel and potentially each will start as many concurrent jobs.

I was wondering if we could acquire only one advisory lock on the postgres database ?

StefanRijnhart

We run a multi node setup with a similar patch that applies an advisory lock to prevent duplicate jobrunners. Works well.

guewen · 2020-09-30T16:35:59Z

If there is more than one active jobrunner it could overload the system because each will have it's own root channel and potentially each will start as many concurrent jobs.

Right, root channels are global across databases. I overlooked this.

I was wondering if we could acquire only one advisory lock on the postgres database ?

It means we have to keep open a new connection on postgres (or template0?) though.

sbidoul · 2020-09-30T16:37:58Z

Or acquiring the lock on the first database in alphabetical order ? Dangerous if one database gets added, though.

sbidoul · 2020-10-01T06:59:16Z

postgres (or template0?)

Access to the postgres database is (was?) a requirement of Odoo (to be confirmed for v14), so that could be ok ?
Keeping a connection open on template0 might be an issue as, IIRC, createdb --template fails if there are open connections on the template database.

guewen · 2020-10-01T07:25:09Z

Access to the postgres database is (was?) a requirement of Odoo (to be confirmed for v14), so that could be ok ?

Totally, the bus wouldn't work without access to postgres:
https://github.com/odoo/odoo/blob/73a671050c97f6f65f2a1b9916e330f0a7b30dc6/addons/bus/models/bus.py#L63
(same in 14.0).

For the record, I thought about using a single connection to postgres, with all the notifications going there with the database name in the notification payload, but on the side of the worker that delays a job we must use NOTIFY in the transaction of the current database, so that's not an option.

StefanRijnhart · 2020-10-01T07:55:50Z

Also, is a single connection to Postgres not a problem for setups with multiple Odoo instances querying the same postgresql instance?

sbidoul · 2020-10-01T07:59:08Z

Also, is a single connection to Postgres not a problem for setups with multiple Odoo instances querying the same postgresql instance?

If we make the advisory lock name configurable that may work.

sbidoul · 2020-10-01T08:00:26Z

That additional connection to postgres would be used for acquiring the advisory lock and nothing else.

guewen · 2020-10-01T08:15:22Z

Also, is a single connection to Postgres not a problem for setups with multiple Odoo instances querying the same postgresql instance?

I was reaching to the same point. We have a PostgreSQL cluster with several odoo instances, we have one jobrunner per database (--database=myproject1) and with a lock in postgres, I would be very annoyed to have a global lock :)

If we make the advisory lock name configurable that may work.

By default, when we start 2 jobrunners, one with --database=project1 and a second one with --database=project2, we should not require customizing options IMO. It's too easy to miss it.
What about generating the advisory lock name from the list of databases? If I start 2 jobrunners for --database=project1,project2, they'll both have the same advisory lock generated, but if I start 2 with different database names, they'll have different locks. (It'll not work well if I start --database=p1,p2 and --database=p2,p3 but this could be documented...).

StefanRijnhart · 2020-10-01T08:45:28Z

Yes, and you could normalize (i.e. sort) the list of databases.

guewen · 2020-10-01T10:33:20Z

I pushed a --squash commit (:warning: let me squash before any merge, I updated the description :)) with the postgres shared lock approach.

lilee115 · 2020-10-01T20:02:03Z

Tested only locally for now, but I'm taking reviews.
Also, I'd love to hear feedback from someone using odoo.sh since it's a huge issue on this platform currently.

@guewen Hi, guewen, thanks for your work on this issue.

I had a try with this commit on odoo.sh, unfortunately it didn't work in our project. Here is what I did:

We have a branch, which works with the queue_job in old version locally
I replaced the queue_job folder with this commit
Push to github and make the branch as a staging branch on odoo.sh
step 1: triggered a job, and the new job kept pending
step 2: Ran odoo-bin --load=web,queue_job in shell on odoo.sh, then the pending job was executed
step 3: continued to trigger some jobs, the new jobs kept pending again
step 4: terminated the odoo-bin, and re-ran odoo-bin --load=web,queue_job, one job was executed
step 5: terminated and re-ran again, one more job and only one was executed
It seems like executing only one job every time i did a re-run.

I hope this info would help you in this case. If I did it in a wrong way, please let me know too, I would love to try it again, thx

guewen · 2020-10-05T05:09:49Z

thanks @lilee115 , I guess I should finally try to run it on odoo.sh to investigate

guewen · 2020-10-09T12:06:00Z

@lilee115 it seems the issue you mention was caused by the anonymous session which could not be retrieved. After merging #252 and rebasing my branch on top of it, I could test my branch on odoo.sh, with jobs properly executing.

I pushed a new commit for the required configuration on odoo.sh.

I couldn't test with more than one odoo worker though because I am on a trial project. I don't see why the lock wouldn't work though.

Cedric-Pigeon · 2020-10-14T13:28:19Z

Hello I am facing the same problem as @lilee115 but not on odoo.sh Job runner only runs once and after that jobs remains pending.

guewen · 2020-10-18T12:07:39Z

Hello I am facing the same problem as @lilee115 but not on odoo.sh Job runner only runs once and after that jobs remains pending.

Do you have logs using the debug level?

Cedric-Pigeon · 2020-10-19T15:15:46Z

Seems that problem only appears on a given server, not elsewhere. So it should be a wrong parameters setting.

nilshamerlinck · 2020-12-08T16:18:03Z

This would be a great built-in feature for HA support!

2 remarks:

I think we need a SET idle_in_transaction_session_timeout = 60000; to make sure that the pg server closes the session and frees the lock if the host is unreachable
this implementation assumes a stable list of databases; it could be the default case, but it would be nice to have an alternative option to explicitly define a custom lock_name; for example ODOO_QUEUE_JOB_JOBRUNNER_LOCK_NAME=static

guewen · 2020-12-11T09:05:10Z

* I think we need a `SET idle_in_transaction_session_timeout = 60000;` to make sure that the pg server closes the session and frees the lock if the host is unreachable

Very good point!

this implementation assumes a stable list of databases; it could be the default case, but it would be nice to have an alternative option to explicitly define a custom lock_name; for example ODOO_QUEUE_JOB_JOBRUNNER_LOCK_NAME=static

What's your idea behind the custom lock name? Is it to complement #267? If so, releasing and taking a new lock with the new list of database wouldn't be better? Or could a custom lock name be better for another reason?

guewen · 2021-03-28T18:08:30Z

I added

cr.execute("SET idle_in_transaction_session_timeout = 60000;")

before taking the lock.

This comment:

this implementation assumes a stable list of databases; it could be the default case, but it would be nice to have an alternative option to explicitly define a custom lock_name; for example ODOO_QUEUE_JOB_JOBRUNNER_LOCK_NAME=static

Could be part of a follow-up pull request.

nilshamerlinck · 2021-04-05T07:41:49Z

Hi @guewen

I did some tests. We need to keep a transaction open if we want to leverage idle_in_transaction_session_timeout.

With these changes, I confirm that these scenarios are supported:

multi nodes (on different hosts)

instance1 and instance2 start
instance1 acquires the lock, and becomes the active jobrunner
instance2 becomes a standby jobrunner, and retries to acquire the lock every TRY_ACQUIRE_INTERVAL seconds
somehow instance1 becomes unreachable (network issue)
after at most 2*SHARED_LOCK_KEEP_ALIVE seconds, the pg server terminates the idle connection
instance2 takes over as the new active jobrunner
instance1 connectivity is back to normal, an exception is raised (psycopg2.DatabaseError: SSL SYSCALL error: Connection timed out or psycopg2.OperationalError: server closed the connection unexpectedly)
instance1 becomes a standby jobrunner
somehow instance2 becomes unreachable (network issue)
instance1 takes over as the new active jobrunner

odoo.sh "hybrid" workers

worker1 starts, becomes the active jobrunner
worker2 starts, becomes a standby jobrunner
odoo.sh stops worker1 (graceful stop, so the sharedlock connection is properly closed)
worker2 becomes the active jobrunner

My only concern in the context of odoo.sh is that in the absence of http trafic, all http workers will be stopped and thus all jobrunners. If you have a scheduled action that creates queue jobs, they will stay pending until the next http request (that will trigger the start of an http worker). One workaround is to use an external service (e.g. pingdom or a custom prometheus) to request the instance periodically.

guewen · 2021-04-06T05:53:29Z

Excellent! Many thanks @nilshamerlinck for the test, fix and details!

My only concern in the context of odoo.sh is that in the absence of http trafic, all http workers will be stopped and thus all jobrunners. If you have a scheduled action that creates queue jobs, they will stay pending until the next http request (that will trigger the start of an http worker). One workaround is to use an external service (e.g. pingdom or a custom prometheus) to request the instance periodically.

Yes, I suppose you are right. Same story for jobs with an ETA in the future.

yelizariev · 2021-04-06T07:15:33Z

Can't you use an odoo cron for that?

My only concern in the context of odoo.sh is that in the absence of http trafic, all http workers will be stopped and thus all jobrunners. If you have a scheduled action that creates queue jobs, they will stay pending until the next http request (that will trigger the start of an http worker). One workaround is to use an external service (e.g. pingdom or a custom prometheus) to request the instance periodically.

nilshamerlinck · 2021-04-15T10:49:17Z

Hi @yelizariev

Can't you use an odoo cron for that?

A cron could help to keep workers warm indeed, but:

"Scheduled actions [...] are run on a best effort policy"
having an external service to monitor the instance also brings the benefit of alerting you in case of node/platform downtime ;-)

[14.0] Support multi-nodes with lock on jobrunner (port of OCA#256)

hoangtrann · 2021-07-24T15:56:46Z

Hi guys, is this good to run on a production instance on Odoo.sh?

amigrave · 2021-08-09T10:23:42Z

Hi guys, is this good to run on a production instance on Odoo.sh?

Sorry to hijack the thread again but unfortunately this solution will still cause issues on Odoo.sh.
The advisory lock breaks our server consolidation system by preventing the postgresql replication to occur.
As more and more instances using queue_job are causing issues we had to document it as incompatible with Odoo.sh.
We will check if an alternative locking is possible and advise accordingly (I will keep you posted here but can't provide an ETA for now).

thomaspaulb · 2021-09-07T09:34:49Z

queue_job/jobrunner/runner.py

                self.initialize_databases()
                _logger.info("database connections ready")
+
+                last_keep_alive = None
+
                # inner loop does the normal processing
                while not self._stop:
                    self.process_notifications()
                    self.run_jobs()
                    self.wait_notification()


I'm not 100% sure what the exact problem is that @amigrave has with locking vs. replication, but if it is the fact that the advisory lock is held for long - maybe a solution can be to always try to re-acquire the lock at the beginning of this loop, and let it go at the end? Then the lock will not be held so long.

If, by chance, another instance of the job runner is 'first' to acquire the lock, it will just change the master job runner to that one, but there will still only be one running at a time.

thomaspaulb · 2021-09-07T09:38:13Z

queue_job/jobrunner/runner.py

+* PostgreSQL advisory locks are based on a integer, the list of database names
+  is sorted, hashed and converted to an int64, so we lose information in the
+  identifier. A low risk of collision is possible. If it happens some day, we
+  should add an option for a custom lock identifier.


Could it be a solution to try to lock the channel records using:

SELECT name FROM queue_job_channel FOR UPDATE;

instead of going for the integer lock?

That could solve both above issues.

Long locks on a table will lead to vacuum issues (and probably replication as well)

thomaspaulb · 2021-09-07T09:40:02Z

queue_job/jobrunner/runner.py

+* If 2 job runners have a database in common but a different list (e.g.
+  ``db_name=project1,project2`` and ``db_name=project2,project3``), both job
+  runners will work and listen to ``project2``, which will lead to unexpected
+  behavior.


Perhaps hold a lock for each of the database names separately?

simahawk · 2022-02-22T08:59:06Z

@guewen @thomaspaulb any update on this?

Just wondering how far is this PR from being "odoo.sh ready" (if possible at all).
I don't have the full ctx yet, forgive me if I miss some bits 😉

Soon we'll have to provide odoo.sh compat for some connector modules which likely will lead us to remove the dependency on queue_job from connector because of this.

CC @sebalix

guewen · 2022-02-22T10:48:52Z

Hi @simahawk, I didn't seek for a new solution after @amigrave's last message.

If I had time, maybe I would do an external job runner as a service 😉

Maybe a fallback on cron jobs could be used as a poor's man solution.

guewen · 2022-02-22T10:53:07Z

And #256 (comment) may be worth trying.

github-actions · 2022-06-26T12:32:24Z

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days.
If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

PCatinean · 2024-04-05T04:00:51Z

Hi @amigrave, I know this thread is quite old but there have been some new developments for an HA deployment of queue_job that is also compatible with odoo.sh.

This new approach is being discussed here #607

We are now evaluating this solution vs session level advisory locks but the last thing you mentioned in this PR was that advisory locks caused an issue to the db replication.

If you could find a few minutes to give us some feedback there so we can ensure this will also be compatible with odoo.sh we would greatly appreciate it.

guewen requested a review from sbidoul September 30, 2020 15:35

StefanRijnhart approved these changes Sep 30, 2020

View reviewed changes

StefanRijnhart self-requested a review October 1, 2020 11:51

guewen force-pushed the 13.0-multi-node-lock branch from 5043c49 to 5aad591 Compare October 9, 2020 11:31

guewen mentioned this pull request Oct 9, 2020

Jobs stuck in pending in odoo.sh. #244

Closed

guewen mentioned this pull request Dec 15, 2020

[13.0] base_import_async failing with "Wrong value for ir.attachment.type: 'product'" when importing product.template #295

Closed

guewen force-pushed the 13.0-multi-node-lock branch from e0d49e6 to 1959d48 Compare December 15, 2020 10:12

guewen mentioned this pull request Jan 19, 2021

How to set up queue_job in odoo sh? #169

Closed

guewen force-pushed the 13.0-multi-node-lock branch from 07163c3 to 4137c87 Compare March 28, 2021 18:05

[FIX] idle_in_transaction_session_timeout

dd3d4b2

Add note about workers put to sleep on odoo.sh

8678701

This was referenced May 14, 2021

[14.0] [ADD] Support multi-nodes with lock on jobrunner #346

Closed

[14.0] Support multi-nodes with lock on jobrunner EssentNovaTeam/queue#15

Merged

StefanRijnhart added a commit to EssentNovaTeam/queue that referenced this pull request May 17, 2021

Merge pull request #15 from EssentNovaTeam/14.0-nova-multi-node-lock

a47553f

[14.0] Support multi-nodes with lock on jobrunner (port of OCA#256)

nilshamerlinck mentioned this pull request Sep 7, 2021

[Feature] Concurrent job runners #2

Closed

This was referenced Sep 7, 2021

Host on odoo.sh OCA/connector#410

Closed

RFC: Jobrunner channels as cron jobs? #371

Closed

thomaspaulb reviewed Sep 7, 2021

View reviewed changes

dreispt added the question label Nov 27, 2021

simahawk added work in progress and removed question labels Feb 22, 2022

github-actions bot added the stale PR/Issue without recent activity, it'll be soon closed automatically. label Jun 26, 2022

sbidoul mentioned this pull request Jul 1, 2022

[Question] queue_job when multiple odoo servers are used with load balancing (and single postgres) #422

Open

github-actions bot closed this Jul 31, 2022

PCatinean mentioned this pull request Sep 9, 2023

New module odoosh_queue_job for running vanilla queue_job on odoo.sh #562

Open

guewen mentioned this pull request Dec 19, 2023

[IMP] Multi-node high-availability jobrunner #607

Open

[13.0] Support multi-nodes with lock on jobrunner #256

[13.0] Support multi-nodes with lock on jobrunner #256

Conversation

guewen commented Sep 30, 2020 • edited

guewen commented Sep 30, 2020

sbidoul commented Sep 30, 2020

guewen commented Sep 30, 2020 • edited

sbidoul commented Sep 30, 2020 • edited

StefanRijnhart left a comment

Choose a reason for hiding this comment

guewen commented Sep 30, 2020 • edited

sbidoul commented Sep 30, 2020

sbidoul commented Oct 1, 2020

guewen commented Oct 1, 2020

StefanRijnhart commented Oct 1, 2020

sbidoul commented Oct 1, 2020

sbidoul commented Oct 1, 2020

guewen commented Oct 1, 2020 • edited

StefanRijnhart commented Oct 1, 2020

guewen commented Oct 1, 2020 • edited

lilee115 commented Oct 1, 2020 • edited

guewen commented Oct 5, 2020

guewen commented Oct 9, 2020

Cedric-Pigeon commented Oct 14, 2020

guewen commented Oct 18, 2020

Cedric-Pigeon commented Oct 19, 2020

nilshamerlinck commented Dec 8, 2020

guewen commented Dec 11, 2020

guewen commented Mar 28, 2021

nilshamerlinck commented Apr 5, 2021

guewen commented Apr 6, 2021

yelizariev commented Apr 6, 2021

nilshamerlinck commented Apr 15, 2021

hoangtrann commented Jul 24, 2021

amigrave commented Aug 9, 2021

thomaspaulb Sep 7, 2021

Choose a reason for hiding this comment

thomaspaulb Sep 7, 2021

Choose a reason for hiding this comment

guewen Feb 22, 2022

Choose a reason for hiding this comment

thomaspaulb Sep 7, 2021

Choose a reason for hiding this comment

simahawk commented Feb 22, 2022

guewen commented Feb 22, 2022

guewen commented Feb 22, 2022

github-actions bot commented Jun 26, 2022

PCatinean commented Apr 5, 2024

guewen commented Sep 30, 2020 •

edited

guewen commented Sep 30, 2020 •

edited

sbidoul commented Sep 30, 2020 •

edited

guewen commented Sep 30, 2020 •

edited

guewen commented Oct 1, 2020 •

edited

guewen commented Oct 1, 2020 •

edited

lilee115 commented Oct 1, 2020 •

edited