Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

30 second Failure Detection and Process Failover #474

Closed
bmbouter opened this issue Jan 12, 2017 · 2 comments
Closed

30 second Failure Detection and Process Failover #474

bmbouter opened this issue Jan 12, 2017 · 2 comments

Comments

@bmbouter
Copy link
Member

bmbouter commented Jan 12, 2017

This is to test a rather large story: https://pulp.plan.io/issues/2509

Sanity Checking

Start Pulp normally and perform a basic sync to sanity check Pulp.

start Pulp normally. Allow the system to idle for 40 seconds and observe the logs. Verify that no errors are reported while Pulp idles over 40 seconds.

Upgrade test

This is a 1-time manual test on rpm based installation on EL7.
This is not a dev test and likely should not be automated.

Upgrade to this release from an earlier release of Pulp.
Verify that the /usr/lib/systemd/system/pulp_resource_manager.service contains the line: --heartbeat-interval=5

Worker Failure Testing

With Pulp normally started, look at the output of pulp-admin status and verify all expected workers are present. Then kill -9 a specific worker, for example reserved_resource_worker-0 with sudo pkill -9 -f reserved_resource_worker-0.

After 30 seconds have passed check pulp-admin status
Verify that the worker is no longer shown. It is expected that the status API will not show a killed worker 30 seconds after the kill occurs.
Verify that errors are shown in the log that the worker has gone missing.

Resource Manager Failover Testing

Testing normal concurrent operation

Start Pulp normally, but keep the resource manager stopped. Then in one terminal on the box run one resource manager A with:

sudo -u apache /bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@boxA -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_managerA.pid --heartbeat-interval=5

On a second terminal run resource manager B with:

sudo -u apache /bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@boxB -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_managerB.pid --heartbeat-interval=5

Verify that you see the full Celery "banner" with stars in resource manager A.
Verify that you see a log statement that resource manager A has acquired the lock.
Verify that you do not see any output from resource manager B.
Verify that you see a log statement that resource manager B is a hot spare.
Verify that both resource_managers are reported via pulp-admin status

Failover due to graceful shutdown

Start both resource_manager started processes as described above.

Ctrl+C from resource manager A
Verify that within 5 seconds the logs emit a statement like new lock acquired by 'resource_manager@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and resource manager B is the new primary resource manager
Verify that resource manager B displays the celery banner with stars within 5 seconds of the Ctrl-C
Verify that only resource manager B is shown in pulp-admin status

Failover due to killing

Start both resource_manager started processes as described above.

kill -9 the resource_manager that has acquired the lock

Verify that resource manager B displays the celery banner with stars within 30 seconds of the kill
Verify that within 5 seconds the logs emit a statement like new lock acquired by 'resource_manager@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and resource manager B is the new primary resource manager
Verify that only resource manager B is shown in pulp-admin status

Celerybeat Failover Testing

This can't be done on 1 computer without changing Pulp code due to both celerybeats having the same name. You can do the same test on two machines with one celerybeat on one, and a second on another. I'm going to apply this diff to the code to allow me to do it on 1 machine in two separate terms.

diff --git a/server/pulp/server/async/scheduler.py b/server/pulp/server/async/scheduler.py
index d745406..b158386 100644
--- a/server/pulp/server/async/scheduler.py
+++ b/server/pulp/server/async/scheduler.py
@@ -28,8 +28,9 @@ import pulp.server.logs  # noqa
 
 _logger = logging.getLogger(__name__)
 
+import random
 # setting the celerybeat name
-CELERYBEAT_NAME = constants.SCHEDULER_WORKER_NAME + "@" + platform.node()
+CELERYBEAT_NAME = constants.SCHEDULER_WORKER_NAME + "@" + str(random.randint(1,10000))
 
 
 class EventMonitor(threading.Thread):

Testing normal concurrency operation

Start Pulp normally, but keep celerybeat stopped. Then in one terminal on the box run celerybeat A with:

sudo -u apache /bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler --pidfile=/var/run/pulp/celerybeatA.pid

On a second terminal run celerybeat B with:

sudo -u apache /bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler --pidfile=/var/run/pulp/celerybeatB.pid

Verify that pulp-admin status shows both scheduler@ entries
Verify that the logs contain a statement like 'New lock acquired by scheduler@xxxxxxxx'
Verify that a log statement exists identifying the extra celerybeat as a hot spare.

Failover due to graceful shutdown

Start both celerybeat processes as described above.

Ctrl-C the celerybeat that has acquired the lock.

Verify that within 5 seconds the logs emit a statement like New lock acquired by 'scheduler@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and that the other celerybeat is the new primary celerybeat instance
Verify that pulp-admin status shows only one scheduler@ entry
Verify that no Errors are shown in the logs

Failover due to killing

Start both celerybeat processes as described above.

kill -9 the pid of the celerybeat that has the lock. If you start celerybeatA first then it will get the lock.

Verify that within 30 seconds the logs emit a statement like New lock acquired by 'scheduler@yyyyyyy'
Verify that you see a log statement stating that failover has occurred, and that the other celerybeat is the new primary celerybeat instance
Verify that an Error is emitted saying something like: Worker 'scheduler@xxxx' has gone missing
Verify that pulp-admin status shows only one scheduler@ entry

@preethit
Copy link

I will do this manually.

@nixocio
Copy link

nixocio commented Oct 31, 2017

This issue was manually tested. It is related to an old Pulp release 2.12 series. In case of need re-open this issue.

@nixocio nixocio closed this as completed Oct 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants