30 second Failure Detection and Process Failover #474

bmbouter · 2017-01-12T18:41:19Z

This is to test a rather large story: https://pulp.plan.io/issues/2509

Sanity Checking

Start Pulp normally and perform a basic sync to sanity check Pulp.

start Pulp normally. Allow the system to idle for 40 seconds and observe the logs. Verify that no errors are reported while Pulp idles over 40 seconds.

Upgrade test

This is a 1-time manual test on rpm based installation on EL7.
This is not a dev test and likely should not be automated.

Upgrade to this release from an earlier release of Pulp.
Verify that the /usr/lib/systemd/system/pulp_resource_manager.service contains the line: --heartbeat-interval=5

Worker Failure Testing

With Pulp normally started, look at the output of pulp-admin status and verify all expected workers are present. Then kill -9 a specific worker, for example reserved_resource_worker-0 with sudo pkill -9 -f reserved_resource_worker-0.

After 30 seconds have passed check pulp-admin status
Verify that the worker is no longer shown. It is expected that the status API will not show a killed worker 30 seconds after the kill occurs.
Verify that errors are shown in the log that the worker has gone missing.

Resource Manager Failover Testing

Testing normal concurrent operation

Start Pulp normally, but keep the resource manager stopped. Then in one terminal on the box run one resource manager A with:

sudo -u apache /bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@boxA -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_managerA.pid --heartbeat-interval=5

On a second terminal run resource manager B with:

sudo -u apache /bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@boxB -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_managerB.pid --heartbeat-interval=5

Verify that you see the full Celery "banner" with stars in resource manager A.
Verify that you see a log statement that resource manager A has acquired the lock.
Verify that you do not see any output from resource manager B.
Verify that you see a log statement that resource manager B is a hot spare.
Verify that both resource_managers are reported via pulp-admin status

Failover due to graceful shutdown

Start both resource_manager started processes as described above.

Ctrl+C from resource manager A
Verify that within 5 seconds the logs emit a statement like new lock acquired by 'resource_manager@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and resource manager B is the new primary resource manager
Verify that resource manager B displays the celery banner with stars within 5 seconds of the Ctrl-C
Verify that only resource manager B is shown in pulp-admin status

Failover due to killing

Start both resource_manager started processes as described above.

kill -9 the resource_manager that has acquired the lock

Verify that resource manager B displays the celery banner with stars within 30 seconds of the kill
Verify that within 5 seconds the logs emit a statement like new lock acquired by 'resource_manager@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and resource manager B is the new primary resource manager
Verify that only resource manager B is shown in pulp-admin status

Celerybeat Failover Testing

This can't be done on 1 computer without changing Pulp code due to both celerybeats having the same name. You can do the same test on two machines with one celerybeat on one, and a second on another. I'm going to apply this diff to the code to allow me to do it on 1 machine in two separate terms.

diff --git a/server/pulp/server/async/scheduler.py b/server/pulp/server/async/scheduler.py
index d745406..b158386 100644
--- a/server/pulp/server/async/scheduler.py
+++ b/server/pulp/server/async/scheduler.py
@@ -28,8 +28,9 @@ import pulp.server.logs  # noqa
 
 _logger = logging.getLogger(__name__)
 
+import random
 # setting the celerybeat name
-CELERYBEAT_NAME = constants.SCHEDULER_WORKER_NAME + "@" + platform.node()
+CELERYBEAT_NAME = constants.SCHEDULER_WORKER_NAME + "@" + str(random.randint(1,10000))
 
 
 class EventMonitor(threading.Thread):

Testing normal concurrency operation

Start Pulp normally, but keep celerybeat stopped. Then in one terminal on the box run celerybeat A with:

sudo -u apache /bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler --pidfile=/var/run/pulp/celerybeatA.pid

On a second terminal run celerybeat B with:

sudo -u apache /bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler --pidfile=/var/run/pulp/celerybeatB.pid

Verify that pulp-admin status shows both scheduler@ entries
Verify that the logs contain a statement like 'New lock acquired by scheduler@xxxxxxxx'
Verify that a log statement exists identifying the extra celerybeat as a hot spare.

Failover due to graceful shutdown

Start both celerybeat processes as described above.

Ctrl-C the celerybeat that has acquired the lock.

Verify that within 5 seconds the logs emit a statement like New lock acquired by 'scheduler@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and that the other celerybeat is the new primary celerybeat instance
Verify that pulp-admin status shows only one scheduler@ entry
Verify that no Errors are shown in the logs

Failover due to killing

Start both celerybeat processes as described above.

kill -9 the pid of the celerybeat that has the lock. If you start celerybeatA first then it will get the lock.

Verify that within 30 seconds the logs emit a statement like New lock acquired by 'scheduler@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and that the other celerybeat is the new primary celerybeat instance
Verify that an Error is emitted saying something like: Worker 'scheduler@xxxx' has gone missing
Verify that pulp-admin status shows only one scheduler@ entry

The text was updated successfully, but these errors were encountered:

preethit · 2017-02-10T18:06:47Z

I will do this manually.

nixocio · 2017-10-31T16:58:42Z

This issue was manually tested. It is related to an old Pulp release 2.12 series. In case of need re-open this issue.

elyezer added the Issue Type: Test Case label Jan 13, 2017

preethit added this to the Coverage for 2.12 milestone Jan 19, 2017

nixocio closed this as completed Oct 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

30 second Failure Detection and Process Failover #474

30 second Failure Detection and Process Failover #474

bmbouter commented Jan 12, 2017 •

edited by dralley

Loading

preethit commented Feb 10, 2017

nixocio commented Oct 31, 2017

30 second Failure Detection and Process Failover #474

30 second Failure Detection and Process Failover #474

Comments

bmbouter commented Jan 12, 2017 • edited by dralley Loading

Sanity Checking

Upgrade test

Worker Failure Testing

Resource Manager Failover Testing

Testing normal concurrent operation

Failover due to graceful shutdown

Failover due to killing

Celerybeat Failover Testing

Testing normal concurrency operation

Failover due to graceful shutdown

Failover due to killing

preethit commented Feb 10, 2017

nixocio commented Oct 31, 2017

bmbouter commented Jan 12, 2017 •

edited by dralley

Loading