Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[13.0] Support multi-nodes with lock on jobrunner #256

Closed
wants to merge 4 commits into from

Conversation

guewen
Copy link
Member

@guewen guewen commented Sep 30, 2020

Starting several odoo (main) processes with "--load=web,queue_job"
was unsupported, as it would start several jobrunner, which would all
listen to postgresql notifications and try to enqueue jobs in concurrent
workers.

This is an issue in several cases:

  • it causes issues on odoo.sh that uses an hybrid model for workers
    and starts several jobrunners (How to set up queue_job in odoo sh? #169 (comment))
  • it defeats any setup that would use several nodes to keep the service
    available in case of failure of a node/host

The solution implemented here is using a PostgreSQL advisory lock,
at session level in a connection on the "postgres" database, which
ensure 2 job runners are not working on the same set of databases.

At loading, the job runner tries to acquire the lock. If it can, it
initializes the connection and listen for jobs. If the lock is taken
by another job runner, it waits and retry to acquire it every 30
seconds.

Example when a job runner is started and another one starts:

INFO ? odoo.addons.queue_job.jobrunner.runner: starting
INFO ? odoo.addons.queue_job.jobrunner.runner: already started on another node

The shared lock identifier is computed based on the set of databases
the job runner has to listen to: if a job runner is started with
--database=queue1 and another with --database=queue2, they will
have different locks and such will be able to work in parallel.

Important: new databases need a restart of the job runner. This was
already the case, and would be a great improvement, but is out of
scope for this improvement.

@guewen guewen requested a review from sbidoul September 30, 2020 15:35
@guewen
Copy link
Member Author

guewen commented Sep 30, 2020

Tested only locally for now, but I'm taking reviews.
Also, I'd love to hear feedback from someone using odoo.sh since it's a huge issue on this platform currently.

@sbidoul
Copy link
Member

sbidoul commented Sep 30, 2020

Hi Guewen, thanks for tackling this topic.

So if I understand correctly if there are multiple databases we could in theory have some databases handled by one runner and others by another runner?

@guewen
Copy link
Member Author

guewen commented Sep 30, 2020

So if I understand correctly if there are multiple databases we could in theory have some databases handled by one runner and others by another runner?

Yes, that's a possible scenario. If 2 jobrunners are started at the same time for the same set of databases, they'll each race for the lock and could each acquire only some of them.

@sbidoul
Copy link
Member

sbidoul commented Sep 30, 2020

If there is more than one active jobrunner (even on different databases) it could overload the system because each will have it's own root channel and potentially each will start as many concurrent jobs.

I was wondering if we could acquire only one advisory lock on the postgres database ?

Copy link
Member

@StefanRijnhart StefanRijnhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run a multi node setup with a similar patch that applies an advisory lock to prevent duplicate jobrunners. Works well.

@guewen
Copy link
Member Author

guewen commented Sep 30, 2020

If there is more than one active jobrunner it could overload the system because each will have it's own root channel and potentially each will start as many concurrent jobs.

Right, root channels are global across databases. I overlooked this.

I was wondering if we could acquire only one advisory lock on the postgres database ?

It means we have to keep open a new connection on postgres (or template0?) though.

@sbidoul
Copy link
Member

sbidoul commented Sep 30, 2020

Or acquiring the lock on the first database in alphabetical order ? Dangerous if one database gets added, though.

@sbidoul
Copy link
Member

sbidoul commented Oct 1, 2020

postgres (or template0?)

Access to the postgres database is (was?) a requirement of Odoo (to be confirmed for v14), so that could be ok ?
Keeping a connection open on template0 might be an issue as, IIRC, createdb --template fails if there are open connections on the template database.

@guewen
Copy link
Member Author

guewen commented Oct 1, 2020

Access to the postgres database is (was?) a requirement of Odoo (to be confirmed for v14), so that could be ok ?

Totally, the bus wouldn't work without access to postgres:
https://github.com/odoo/odoo/blob/73a671050c97f6f65f2a1b9916e330f0a7b30dc6/addons/bus/models/bus.py#L63
(same in 14.0).

For the record, I thought about using a single connection to postgres, with all the notifications going there with the database name in the notification payload, but on the side of the worker that delays a job we must use NOTIFY in the transaction of the current database, so that's not an option.

@StefanRijnhart
Copy link
Member

Also, is a single connection to Postgres not a problem for setups with multiple Odoo instances querying the same postgresql instance?

@sbidoul
Copy link
Member

sbidoul commented Oct 1, 2020

Also, is a single connection to Postgres not a problem for setups with multiple Odoo instances querying the same postgresql instance?

If we make the advisory lock name configurable that may work.

@sbidoul
Copy link
Member

sbidoul commented Oct 1, 2020

That additional connection to postgres would be used for acquiring the advisory lock and nothing else.

@guewen
Copy link
Member Author

guewen commented Oct 1, 2020

Also, is a single connection to Postgres not a problem for setups with multiple Odoo instances querying the same postgresql instance?

I was reaching to the same point. We have a PostgreSQL cluster with several odoo instances, we have one jobrunner per database (--database=myproject1) and with a lock in postgres, I would be very annoyed to have a global lock :)

If we make the advisory lock name configurable that may work.

By default, when we start 2 jobrunners, one with --database=project1 and a second one with --database=project2, we should not require customizing options IMO. It's too easy to miss it.
What about generating the advisory lock name from the list of databases? If I start 2 jobrunners for --database=project1,project2, they'll both have the same advisory lock generated, but if I start 2 with different database names, they'll have different locks. (It'll not work well if I start --database=p1,p2 and --database=p2,p3 but this could be documented...).

@StefanRijnhart
Copy link
Member

Yes, and you could normalize (i.e. sort) the list of databases.

@guewen
Copy link
Member Author

guewen commented Oct 1, 2020

I pushed a --squash commit (:warning: let me squash before any merge, I updated the description :)) with the postgres shared lock approach.

@StefanRijnhart StefanRijnhart self-requested a review October 1, 2020 11:51
@lilee115
Copy link

lilee115 commented Oct 1, 2020

Tested only locally for now, but I'm taking reviews.
Also, I'd love to hear feedback from someone using odoo.sh since it's a huge issue on this platform currently.

@guewen Hi, guewen, thanks for your work on this issue.

I had a try with this commit on odoo.sh, unfortunately it didn't work in our project. Here is what I did:

  • We have a branch, which works with the queue_job in old version locally

  • I replaced the queue_job folder with this commit

  • Push to github and make the branch as a staging branch on odoo.sh

  • step 1: triggered a job, and the new job kept pending

  • step 2: Ran odoo-bin --load=web,queue_job in shell on odoo.sh, then the pending job was executed

  • step 3: continued to trigger some jobs, the new jobs kept pending again

  • step 4: terminated the odoo-bin, and re-ran odoo-bin --load=web,queue_job, one job was executed

  • step 5: terminated and re-ran again, one more job and only one was executed

  • It seems like executing only one job every time i did a re-run.

I hope this info would help you in this case. If I did it in a wrong way, please let me know too, I would love to try it again, thx

@guewen
Copy link
Member Author

guewen commented Oct 5, 2020

thanks @lilee115 , I guess I should finally try to run it on odoo.sh to investigate

@guewen
Copy link
Member Author

guewen commented Oct 9, 2020

@lilee115 it seems the issue you mention was caused by the anonymous session which could not be retrieved. After merging #252 and rebasing my branch on top of it, I could test my branch on odoo.sh, with jobs properly executing.

I pushed a new commit for the required configuration on odoo.sh.

I couldn't test with more than one odoo worker though because I am on a trial project. I don't see why the lock wouldn't work though.

@Cedric-Pigeon
Copy link
Contributor

Hello I am facing the same problem as @lilee115 but not on odoo.sh Job runner only runs once and after that jobs remains pending.

@guewen
Copy link
Member Author

guewen commented Oct 18, 2020

Hello I am facing the same problem as @lilee115 but not on odoo.sh Job runner only runs once and after that jobs remains pending.

Do you have logs using the debug level?

@Cedric-Pigeon
Copy link
Contributor

Seems that problem only appears on a given server, not elsewhere. So it should be a wrong parameters setting.

@nilshamerlinck
Copy link
Contributor

This would be a great built-in feature for HA support!

2 remarks:

  • I think we need a SET idle_in_transaction_session_timeout = 60000; to make sure that the pg server closes the session and frees the lock if the host is unreachable
  • this implementation assumes a stable list of databases; it could be the default case, but it would be nice to have an alternative option to explicitly define a custom lock_name; for example ODOO_QUEUE_JOB_JOBRUNNER_LOCK_NAME=static

@guewen
Copy link
Member Author

guewen commented Dec 11, 2020

* I think we need a `SET idle_in_transaction_session_timeout = 60000;` to make sure that the pg server closes the session and frees the lock if the host is unreachable

Very good point!

  • this implementation assumes a stable list of databases; it could be the default case, but it would be nice to have an alternative option to explicitly define a custom lock_name; for example ODOO_QUEUE_JOB_JOBRUNNER_LOCK_NAME=static

What's your idea behind the custom lock name? Is it to complement #267? If so, releasing and taking a new lock with the new list of database wouldn't be better? Or could a custom lock name be better for another reason?

@guewen
Copy link
Member Author

guewen commented Mar 28, 2021

I added

cr.execute("SET idle_in_transaction_session_timeout = 60000;")

before taking the lock.

This comment:

this implementation assumes a stable list of databases; it could be the default case, but it would be nice to have an alternative option to explicitly define a custom lock_name; for example ODOO_QUEUE_JOB_JOBRUNNER_LOCK_NAME=static

Could be part of a follow-up pull request.

@nilshamerlinck
Copy link
Contributor

Hi @guewen

I did some tests. We need to keep a transaction open if we want to leverage idle_in_transaction_session_timeout.

With these changes, I confirm that these scenarios are supported:

  1. multi nodes (on different hosts)
  • instance1 and instance2 start
  • instance1 acquires the lock, and becomes the active jobrunner
  • instance2 becomes a standby jobrunner, and retries to acquire the lock every TRY_ACQUIRE_INTERVAL seconds
  • somehow instance1 becomes unreachable (network issue)
  • after at most 2*SHARED_LOCK_KEEP_ALIVE seconds, the pg server terminates the idle connection
  • instance2 takes over as the new active jobrunner
  • instance1 connectivity is back to normal, an exception is raised (psycopg2.DatabaseError: SSL SYSCALL error: Connection timed out or psycopg2.OperationalError: server closed the connection unexpectedly)
  • instance1 becomes a standby jobrunner
  • somehow instance2 becomes unreachable (network issue)
  • instance1 takes over as the new active jobrunner
  1. odoo.sh "hybrid" workers
  • worker1 starts, becomes the active jobrunner
  • worker2 starts, becomes a standby jobrunner
  • odoo.sh stops worker1 (graceful stop, so the sharedlock connection is properly closed)
  • worker2 becomes the active jobrunner

My only concern in the context of odoo.sh is that in the absence of http trafic, all http workers will be stopped and thus all jobrunners. If you have a scheduled action that creates queue jobs, they will stay pending until the next http request (that will trigger the start of an http worker). One workaround is to use an external service (e.g. pingdom or a custom prometheus) to request the instance periodically.

@guewen
Copy link
Member Author

guewen commented Apr 6, 2021

Excellent! Many thanks @nilshamerlinck for the test, fix and details!

My only concern in the context of odoo.sh is that in the absence of http trafic, all http workers will be stopped and thus all jobrunners. If you have a scheduled action that creates queue jobs, they will stay pending until the next http request (that will trigger the start of an http worker). One workaround is to use an external service (e.g. pingdom or a custom prometheus) to request the instance periodically.

Yes, I suppose you are right. Same story for jobs with an ETA in the future.

@yelizariev
Copy link
Member

Can't you use an odoo cron for that?

My only concern in the context of odoo.sh is that in the absence of http trafic, all http workers will be stopped and thus all jobrunners. If you have a scheduled action that creates queue jobs, they will stay pending until the next http request (that will trigger the start of an http worker). One workaround is to use an external service (e.g. pingdom or a custom prometheus) to request the instance periodically.

@nilshamerlinck
Copy link
Contributor

Hi @yelizariev

Can't you use an odoo cron for that?

A cron could help to keep workers warm indeed, but:

StefanRijnhart added a commit to EssentNovaTeam/queue that referenced this pull request May 17, 2021
[14.0] Support multi-nodes with lock on jobrunner (port of OCA#256)
@hoangtrann
Copy link

Hi guys, is this good to run on a production instance on Odoo.sh?

@amigrave
Copy link

amigrave commented Aug 9, 2021

Hi guys, is this good to run on a production instance on Odoo.sh?

Sorry to hijack the thread again but unfortunately this solution will still cause issues on Odoo.sh.
The advisory lock breaks our server consolidation system by preventing the postgresql replication to occur.
As more and more instances using queue_job are causing issues we had to document it as incompatible with Odoo.sh.
We will check if an alternative locking is possible and advise accordingly (I will keep you posted here but can't provide an ETA for now).

self.initialize_databases()
_logger.info("database connections ready")

last_keep_alive = None

# inner loop does the normal processing
while not self._stop:
self.process_notifications()
self.run_jobs()
self.wait_notification()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what the exact problem is that @amigrave has with locking vs. replication, but if it is the fact that the advisory lock is held for long - maybe a solution can be to always try to re-acquire the lock at the beginning of this loop, and let it go at the end? Then the lock will not be held so long.

If, by chance, another instance of the job runner is 'first' to acquire the lock, it will just change the master job runner to that one, but there will still only be one running at a time.

* PostgreSQL advisory locks are based on a integer, the list of database names
is sorted, hashed and converted to an int64, so we lose information in the
identifier. A low risk of collision is possible. If it happens some day, we
should add an option for a custom lock identifier.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be a solution to try to lock the channel records using:

SELECT name FROM queue_job_channel FOR UPDATE;

instead of going for the integer lock?

That could solve both above issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long locks on a table will lead to vacuum issues (and probably replication as well)

* If 2 job runners have a database in common but a different list (e.g.
``db_name=project1,project2`` and ``db_name=project2,project3``), both job
runners will work and listen to ``project2``, which will lead to unexpected
behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps hold a lock for each of the database names separately?

@simahawk
Copy link
Contributor

@guewen @thomaspaulb any update on this?

Just wondering how far is this PR from being "odoo.sh ready" (if possible at all).
I don't have the full ctx yet, forgive me if I miss some bits 😉

Soon we'll have to provide odoo.sh compat for some connector modules which likely will lead us to remove the dependency on queue_job from connector because of this.

CC @sebalix

@guewen
Copy link
Member Author

guewen commented Feb 22, 2022

Hi @simahawk, I didn't seek for a new solution after @amigrave's last message.

If I had time, maybe I would do an external job runner as a service 😉

Maybe a fallback on cron jobs could be used as a poor's man solution.

@guewen
Copy link
Member Author

guewen commented Feb 22, 2022

And #256 (comment) may be worth trying.

@github-actions
Copy link

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days.
If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

@PCatinean
Copy link
Contributor

Hi @amigrave, I know this thread is quite old but there have been some new developments for an HA deployment of queue_job that is also compatible with odoo.sh.

This new approach is being discussed here #607

We are now evaluating this solution vs session level advisory locks but the last thing you mentioned in this PR was that advisory locks caused an issue to the db replication.

If you could find a few minutes to give us some feedback there so we can ensure this will also be compatible with odoo.sh we would greatly appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale PR/Issue without recent activity, it'll be soon closed automatically. work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet