Skip to content

Commit

Permalink
Support multi-nodes with lock on jobrunner
Browse files Browse the repository at this point in the history
Starting several odoo (main) processes with "--load=web,queue_job"
was unsupported, as it would start several job runner, which would all
listen to postgresql notifications and try to enqueue jobs in concurrent
workers.

This is an issue in several cases:

* it causes issues on odoo.sh that uses an hybrid model for workers
and starts several job runners [0]
* it defeats any setup that would use several nodes to keep the service
available in case of failure of a node/host

The solution implemented here is using a PostgreSQL advisory lock,
at session level in a connection on the "postgres" database, which
ensure 2 job runners are not working on the same set of databases.

At loading, the job runner tries to acquire the lock. If it can, it
initializes the connection and listen for jobs. If the lock is taken
by another job runner, it waits and retry to acquire it every 30
seconds.

Example when a job runner is started and another one starts:

    INFO ? odoo.addons.queue_job.jobrunner.runner: starting
    INFO ? odoo.addons.queue_job.jobrunner.runner: already started on another node

The shared lock identifier is computed based on the set of databases
the job runner has to listen to: if a job runner is started with
``--database=queue1`` and another with ``--database=queue2``, they will
have different locks and such will be able to work in parallel.

Important: new databases need a restart of the job runner. This was
already the case, and would be a great improvement, but is out of
scope for this improvement.

[0] #169 (comment)
  • Loading branch information
guewen committed Oct 9, 2020
1 parent a089cb1 commit 5aad591
Showing 1 changed file with 129 additions and 5 deletions.
134 changes: 129 additions & 5 deletions queue_job/jobrunner/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,35 @@
* It does not run jobs itself, but asks Odoo to run them through an
anonymous ``/queue_job/runjob`` HTTP request. [1]_
How does concurrent job runners work?
-------------------------------------
If several nodes (on different hosts or not) of job runners are started,
a shared lock ensures that only one job runner works on a database at
a time. These rules are to take in consideration:
* The identifier of the shared lock is based on the database list provided,
so either ``--database``/``db_name`` or all the databases in PostgreSQL.
* When 2 job runners with the exact same list of databases are started,
only the first one will work. The second one will wait and take over
if the first one is stopped.
Caveats:
* If 2 job runners have a database in common but a different list (e.g.
``db_name=project1,project2`` and ``db_name=project2,project3``), both job
runners will work and listen to ``project2``, which will lead to unexpected
behavior.
* The same applies when no database is specified and all the cluster's databases
are used. If a job runner is started on the cluster's databases, a new database
is created and a second job runner is started, they'll both work on a same set
of databases with unexpected behaviors.
* PostgreSQL advisory locks are based on a integer, the list of database names
is sorted, hashed and converted to an int64, so we lose information in the
identifier. A low risk of collision is possible. If it happens some day, we
should add an option for a custom lock identifier.
How to use it?
--------------
Expand Down Expand Up @@ -134,6 +163,7 @@
"""

import datetime
import hashlib
import logging
import os
import select
Expand All @@ -152,6 +182,8 @@
from .channels import ENQUEUED, NOT_DONE, PENDING, ChannelManager

SELECT_TIMEOUT = 60
TRY_ACQUIRE_INTERVAL = 30 # seconds
SHARED_LOCK_KEEP_ALIVE = 60 # seconds
ERROR_RECOVERY_DELAY = 5

_logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -251,7 +283,51 @@ def urlopen():
thread.start()


class Database(object):
class SharedLockDatabase(object):
def __init__(self, db_name, lock_name):
self.db_name = db_name
self.lock_ident = self.name_to_int64(lock_name)
connection_info = _connection_info_for(db_name)
self.conn = psycopg2.connect(**connection_info)
self.conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
self.acquired = False
self.try_acquire()

@staticmethod
def name_to_int64(lock_name):
hasher = hashlib.sha256()
hasher.update(lock_name.encode("utf-8"))
# pg_try_advisory_lock is limited to an 8-byte (64bit) signed integer
return int.from_bytes(hasher.digest()[:8], byteorder="big", signed=True)

def try_acquire(self):
self.acquired = self._acquire()

def _acquire(self):
with closing(self.conn.cursor()) as cr:
# session level lock
cr.execute("SELECT pg_try_advisory_lock(%s);", (self.lock_ident,))
acquired = cr.fetchone()[0]
return acquired

def keep_alive(self):
query = "SELECT 1"
with closing(self.conn.cursor()) as cr:
cr.execute(query)

def close(self):
# pylint: disable=except-pass
# if close fail for any reason, it's either because it's already closed
# and we don't care, or for any reason but anyway it will be closed on
# del
try:
self.conn.close()
except Exception:
pass
self.conn = None


class QueueDatabase(object):
def __init__(self, db_name):
self.db_name = db_name
connection_info = _connection_info_for(db_name)
Expand Down Expand Up @@ -339,6 +415,11 @@ def __init__(
if channel_config_string is None:
channel_config_string = _channels()
self.channel_manager.simple_configure(channel_config_string)

self.shared_lock_db = None
# TODO: how to detect new databases or databases
# on which queue_job is installed after server start?
self.list_db_names = self.get_db_names()
self.db_by_name = {}
self._stop = False
self._stop_pipe = os.pipe()
Expand Down Expand Up @@ -388,11 +469,22 @@ def close_databases(self, remove_jobs=True):
db.close()
except Exception:
_logger.warning("error closing database %s", db_name, exc_info=True)

self.db_by_name = {}

if self.shared_lock_db:
try:
self.shared_lock_db.close()
except Exception:
_logger.warning(
"error closing database %s",
self.shared_lock_db.db_name,
exc_info=True,
)

def initialize_databases(self):
for db_name in self.get_db_names():
db = Database(db_name)
for db_name in self.list_db_names:
db = QueueDatabase(db_name)
if not db.has_queue_job:
_logger.debug("queue_job is not installed for db %s", db_name)
else:
Expand Down Expand Up @@ -470,6 +562,12 @@ def wait_notification(self):
for conn in conns:
conn.poll()

def keep_alive_shared_lock(self):
self.shared_lock_db.keep_alive()

def _lock_ident(self):
return "qj:{}".format("-".join(sorted(self.list_db_names)))

def stop(self):
_logger.info("graceful stop requested")
self._stop = True
Expand All @@ -481,16 +579,42 @@ def run(self):
while not self._stop:
# outer loop does exception recovery
try:
# When concurrent jobrunners are started, the first to win the
# race acquires an advisory lock on PostgreSQL and gets to
# work. When a jobrunner is stopped, the lock is released, and
# another node can take over.
self.shared_lock_db = SharedLockDatabase("postgres", self._lock_ident())
if not self.shared_lock_db.acquired:
self.close_databases()
_logger.info("already started on another node")
# no database to work with... retry later in case a concurrent
# node is stopped
time.sleep(TRY_ACQUIRE_INTERVAL)
continue

_logger.info("initializing database connections")
# TODO: how to detect new databases or databases
# on which queue_job is installed after server start?
self.initialize_databases()
_logger.info("database connections ready")

last_keep_alive = None

# inner loop does the normal processing
while not self._stop:
self.process_notifications()
self.run_jobs()
self.wait_notification()
if (
not last_keep_alive
or time.time() >= last_keep_alive + SHARED_LOCK_KEEP_ALIVE
):
last_keep_alive = time.time()
# send a keepalive on the shared lock connection at
# most every 60 seconds
self.keep_alive_shared_lock()
# TODO here, when we have no "db_name", we could list again
# the databases and if the list changed, try to acquire a new
# lock

except KeyboardInterrupt:
self.stop()
except InterruptedError:
Expand Down

0 comments on commit 5aad591

Please sign in to comment.