rbd-mirror A/A: InstanceWatcher watch/notify stub for leader/follower RPC #13312

trociny · 2017-02-08T13:58:11Z

Fixes: http://tracker.ceph.com/issues/18783
Signed-off-by: Mykola Golub mgolub@mirantis.com

trociny · 2017-02-08T14:16:37Z

@dillaman Implemented according to [1], though there are some questions to discuss:

We could use "rbd_mirror_leader" object for the purposes we introduced "rbd_mirror_instances".
Instead of adding a record to "rbd_mirror_instances" the instance could send a notification to the leader, so only the leader would manage instances records.
Not sure how useful up/down state is. The purpose I can image is to handle a situation when the leader crashes (blacklisted) when removing the instance, so a new leader is aware about unfinished removal. But I suppose the new leader would eventually figure out that the instance need to be removed, if it really misbehaves. Also the fact that the old leader crashed (blacklisted) may be an indication that we should not trust its decision to remove the instance.

[1] http://tracker.ceph.com/issues/18783

trociny · 2017-02-10T07:33:40Z

@dillaman Yet another question. Working on the next subtask [1] I need to blacklist an instance that does not respond to heartbeats. The problem is to learn that instance address. I could use list_watchers on rbd_mirror_instance.<client instance id> object and find the address matching watchers instance id, but this does not look very nice to me.

May be when an instance creates and starts watching "rbd_mirror_instance." object it also should acquire the lock on it? Then I could just use break_lock to blacklist it.

[1] http://tracker.ceph.com/issues/18784

dillaman · 2017-02-13T20:34:25Z

@trociny

We could use "rbd_mirror_leader" object for the purposes we introduced "rbd_mirror_instances".

I'm fine w/ that -- I just suggested a new object to keep its purpose clear.

Instead of adding a record to "rbd_mirror_instances" the instance could send a notification to the leader, so only the leader would manage instances records.

The trouble is that we need to ensure that the client id state mapping table has a record for a client before said client creates its watch/notify object. Otherwise, we could have uncontrolled growth of orphaned objects. If the slave needs to send a message to the leader saying "i'm alive" (or the leader automatically adds the up entry when it receives the ack), we need a new broadcast message on the "rbd_mirror_leader" object from the leader to the slave saying "watch your object now" (or vice-versa where the slave tells the leader that it has now started listening to its RPC object). Perhaps it's just a state payload on as part of the leader's heartbeat where the slave includes its current state (i.e. waiting for leader, ready, etc).

Not sure how useful up/down state is. The purpose I can image is to handle a situation when the leader crashes (blacklisted) when removing the instance, so a new leader is aware about unfinished removal. But I suppose the new leader would eventually figure out that the instance need to be removed, if it really misbehaves. Also the fact that the old leader crashed (blacklisted) may be an indication that we should not trust its decision to remove the instance.

The purpose of tracking the clients is to allow for the cleanup of orphaned "rbd_mirror_instance.XYZ" objects. Once the leader has confimed that it has removed the "rbd_mirror_instance.XYZ" object for the blacklisted slave, it can remove the entry from the table. Therefore, perhaps "down" is not the best term -- perhaps "blacklisting" or "pruning".

dillaman · 2017-02-13T20:37:20Z

@trociny

May be when an instance creates and starts watching "rbd_mirror_instance." object it also should acquire the lock on it? Then I could just use break_lock to blacklist it.

Sorry -- yes, that is the goal. The original design had the rbd-mirror daemons taking new "rbd-mirror" ManagedLocks on each image, but I (in my head at least) simplified this to just the instance-specific object.

dillaman · 2017-02-13T20:52:20Z

src/cls/rbd/cls_rbd.cc

+  bufferlist bl;
+  ::encode(state, bl);
+
+  int r = cls_cxx_map_set_val(hctx, instance_key(instance_id), &bl);


Assuming the current approach where the slave writes its record before watching the "rbd_mirror_object" is kept, we should probably disallow the transition from down -> up state since that shouldn't be possible.

Actually, I was thinking about providing functions like below instead of instances_set:

instances_add/register -- add a record in up state

instances_down/pruning -- change a record state to down

instances_remove/unregister -- remove a record

I can do any way: either add a check to instances_set or introduce add/down functions instead of set. What do you like more?

dillaman · 2017-02-13T20:53:35Z

src/tools/rbd_mirror/InstanceWatcher.h

+			     uint64_t notifier_id, bufferlist &bl);
+
+private:
+  std::string m_instance_id;


Nit: can you create a state diagram?

dillaman · 2017-02-13T20:54:16Z

src/tools/rbd_mirror/InstanceWatcher.cc

+  assert(m_lock.is_locked());
+
+  librados::ObjectWriteOperation op;
+  librbd::cls_client::mirror_instances_set(&op, m_instance_id,


The instance should be recorded before the "rbd_instance.XYZ" object is created to prevent untracked, orphaned objects in the case of failure.

trociny · 2017-02-13T21:53:17Z

The purpose of tracking the clients is to allow for the cleanup of orphaned "rbd_mirror_instance.XYZ" objects. Once the leader has confirmed that it has removed the "rbd_mirror_instance.XYZ" object for the blacklisted slave, it can remove the entry from the table.

Agree, that the record should be removed only after the object is removed. But still "down" state does not look necessary. The record could remain in up state while pruning -- if pruning is interrupted, and a new leader reads records, it will assume the instance is up but will eventually prune it after not receiving responses on heartbeats. So the difference with "down" state that it would do this a little faster. Or do I miss some other scenario when "down" state would be useful?

Actually I am fine with "down" state, just want to make this clear.

trociny · 2017-02-13T22:04:03Z

@dillaman Agree, that instance adding a record about itself instead of sending a message to the leader may be simpler (no need to wait for ack from the leader), so I think I will do this way. But still we need a way to notify the leader when a new record is added, so it could refresh instances table (or just add instance from notification). Without this notification it will learn about the new instance after sending a heartbeat and receiving the response, but the instance might be already dead at this time. Just periodic refresh could work thought does not look very nice. What do you suggest?

dillaman · 2017-02-14T01:48:59Z

@trociny

Agree, that the record should be removed only after the object is removed. But still "down" state does not look necessary. The record could remain in up state while pruning -- if pruning is interrupted, and a new leader reads records, it will assume the instance is up but will eventually prune it after not receiving responses on heartbeats. So the difference with "down" state that it would do this a little faster. Or do I miss some other scenario when "down" state would be useful?

The failure case was the rationale -- but you are correct that even with a leader failure it should eventually figure things out. If you leave out the state now and a better use-case is determined, we have plenty of time to add it back in before the Luminous release. For a first-pass of A/A support, leaving it out results in fewer code paths to test.

Agree, that instance adding a record about itself instead of sending a message to the leader may be simpler (no need to wait for ack from the leader), so I think I will do this way. But still we need a way to notify the leader when a new record is added, so it could refresh instances table (or just add instance from notification). Without this notification it will learn about the new instance after sending a heartbeat and receiving the response, but the instance might be already dead at this time. Just periodic refresh could work thought does not look very nice. What do you suggest?

I was just trying to keep it simple. Note that the leader only has to read the table once to initialize its peer instance table. Right now the heartbeat is sent out pretty regularly (5 secs?) so I don't think it's too long of a period to wait to have the leader detect a new peer. The peer could die at any time and the leader would need to handle it regardless after X number of missed heartbeat acks from the client.

trociny · 2017-02-14T08:27:01Z

@dillaman

I was just trying to keep it simple. Note that the leader only has to read the table once to initialize its peer instance table. Right now the heartbeat is sent out pretty regularly (5 secs?) so I don't think it's too long of a period to wait to have the leader detect a new peer. The peer could die at any time and the leader would need to handle it regardless after X number of missed heartbeat acks from the client.

Let's consider a case:

leader reads the table on start
a new instance is started, registers itself in the table and creates object
this new instance crushes just after the start, before the leader sends the next heartbeat

As a result the instance object will exist until the leader is restarted and the table is reread (may remain forever if the user decided not to start rbd-mirror for this pool any more).

Not big deal, still it would be nice if we handle this.

E.g. before adding the record the instance would send heartbeat_ack (a little confusing message name would be though) so the leader added it to its in memory table.

dillaman · 2017-02-14T13:58:04Z

@trociny I think I am comfortable w/ that possibility since it eliminates extra messages and provides a consistent "rbd_instance.XYZ" bootstrap path for both the leader and the slaves. The (temporarily) orphaned object issue could be avoided by periodically transitioning the leader (i.e. every X hrs, the current leader automatically releases the lock so all other instances can attempt to acquire it). That would clean up the orphans and would continuously exercise the leader failover logic.

trociny · 2017-02-15T16:06:44Z

@dillaman updated

trociny · 2017-02-15T18:28:02Z

@dillaman sorry I forgot to add acquire/release lock steps on "rbd_mirror_instance.XYZ" object. Working on it

trociny · 2017-02-15T20:51:45Z

@dillaman updated

trociny · 2017-02-16T14:09:08Z

@dillaman test_InstanceWatcher.cc updated (I had to fix a test that worked on librados test stab but failed on a real cluster).

dillaman · 2017-02-17T01:03:48Z

src/cls/rbd/cls_rbd.cc

+ * none
+ *
+ * Output:
+ * @param std::map<std::string, cls::rbd::MirrorInstanceState>: instance id to state map


Nit: update output

dillaman · 2017-02-17T01:03:57Z

src/cls/rbd/cls_rbd.cc

+/**
+ * Input:
+ * @param instance_id (std::string)
+ * @param state (cls::rbd::MirrorInstanceState)


Nit: update input

dillaman · 2017-02-17T01:14:54Z

src/tools/rbd_mirror/InstanceWatcher.cc

+  assert(!m_instance_lock);
+
+  m_instance_lock.reset(
+    new librbd::ManagedLock<I>(m_ioctx, m_work_queue, m_oid, this,


Nit: can you add a static ManagedLock::create method and avoid the direct allocation? That would allow the mock test to combine the specialized ManagedLock and MockManagedLock. If you switch from using a unique_ptr, you can also add a ManagedLock::destroy method (like ImageCtx::destroy) to mock out the deallocation.

dillaman · 2017-02-17T01:15:26Z

src/tools/rbd_mirror/InstanceWatcher.h

+#ifndef CEPH_RBD_MIRROR_INSTANCE_WATCHER_H
+#define CEPH_RBD_MIRROR_INSTANCE_WATCHER_H
+
+#include "librbd/ManagedLock.h"


Nit: can this be forward declared since it's just keeping a pointer?

trociny · 2017-02-20T16:51:32Z

@dillaman
Updated according to your comments. I have also extended this for "not owner" InstanceWatcher. This mode is going to be used by the leader for InstanceWatcher RPC messages. This "slave" instance is also planned to be used to break the owner lock and unregister the instance if it is dead. For this I planned to add InstanceWatcher take_instance_ownership method. After the method returns it is supposed that the previous owner is black listed if it was still alive. After the instance is owned, shut_down will do necessary instance cleanup.

But evaluating this I faced with the following issues:

It is possible that "owner" and "not owner" instances are started almost simultaneously, and then "not owner" instance watcher init might fail because instance object is not created yet.
When the "owner" instance shuts down, no notifications are planned to be sent to the leader. So the leader will learn the instance is down only after missing heartbeat acks. At this moment normally the instance object (the leader is still listening on) is already removed, so I expect some error messages from the watcher (which is not very nice). And actually I can't see how the leader can tell if it was normal instance shut down or the instance got stuck.

(1) is not difficult to resolve -- if "not owner" instance watcher init failed, and if the instance is alive it will send heartbeat ack soon, and we can re-init the watcher on this event. If no heartbeat ack received we believe it is dead and start instance removal procedure.

For (2) it is more complicated. I can imaging the following states when take_instance_ownership is called:

the previous owner has already shut down in a clean way removing the object and the record
the previous owner gets stuck and the object is locked
the previous owner crashed leaving only record (or both object and record)
the previous owner is just shutting down...

I suppose the take_instance_ownership procedure could be:

try to acquire the lock, if it succeeds of fails with ENOENT (object removed), we are a new owner.
get locker, if it failed with ENOENT we are a new owner.
break lock and go to (1)

Not sure what to do if get locker on step (2) returns not ENOENT error though

trociny · 2017-02-20T16:56:10Z

@dillaman If my reasoning above is not quite clear I can push PR for [1] subtask (which is currently in a messy state, so I need some time to cleanup)

http://tracker.ceph.com/issues/18784

dillaman · 2017-02-20T23:57:12Z

@trociny

Updated according to your comments. I have also extended this for "not owner" InstanceWatcher. This mode is going to be used by the leader for InstanceWatcher RPC messages. This "slave" instance is also planned to be used to break the owner lock and unregister the instance if it is dead. For this I planned to add InstanceWatcher take_instance_ownership method. After the method returns it is supposed that the previous owner is black listed if it was still alive. After the instance is owned, shut_down will do necessary instance cleanup.

Why would the leader need to watch on the instance? I would think we only need to expose a "break lock" helper that invokes a state machine that retrieves the current lock owner for the instance object and invokes the break lock state machine. It shouldn't need to go through watching the instance object or attempting to lock the instance object. If the instance needs to send a direct message to the leader, it should send it to the instance object of the leader. I guess I was envisioning that regardless of leader or non-leader status, the instances would bootstrap up an instance watcher for themselves.

When the "owner" instance shuts down, no notifications are planned to be sent to the leader. So the leader will learn the instance is down only after missing heartbeat acks. At this moment normally the instance object (the leader is still listening on) is already removed, so I expect some error messages from the watcher (which is not very nice). And actually I can't see how the leader can tell if it was normal instance shut down or the instance got stuck.

We could add a nice "I am nicely shutting down" message that could be sent to the leader's instance watcher. Perhaps we just need an instance "state" message that can be sent out for "ready" / "not ready" for image assignments -- and in an orderly shut down the instance can send "not ready" and can send "image XYX released" messages to the leader so that the images can be quickly re-assigned to another instance (or we just have a "ready for images", "shutting down", and "shut down" state to prevent assignment of new images when in the "shutting down" state and the assumption that all replayers have been properly shut down when the "shut down" state is reached).

trociny · 2017-02-21T07:40:49Z

@dillaman

Why would the leader need to watch on the instance? I would think we only need to expose a "break lock" helper that invokes a state machine that retrieves the current lock owner for the instance object and invokes the break lock state machine. It shouldn't need to go through watching the instance object or attempting to lock the instance object. If the instance needs to send a direct message to the leader, it should send it to the instance object of the leader. I guess I was envisioning that regardless of leader or non-leader status, the instances would bootstrap up an instance watcher for themselves.

I thought Watcher RPC messages would be using the same instance object. I.e. if the leader needs to send "Image Acquire" RPC message to an instance, it sends it to this instance object, and the instance sends "Image Acquired" ACK back to the leader through the same instance object.

Now I see you want ACK messages are sent through the leader instance object. I will try this way. It looks like doable, not sure yet though about complications (races) that might appear on a leader change or an instance shut down (crash). Intuitively, using one object would make handle this easier.

trociny · 2017-02-21T14:47:13Z

@dillaman updated

dillaman

lgtm

dillaman · 2017-02-22T18:42:03Z

src/test/rbd_mirror/test_InstanceWatcher.cc

+  std::string m_instance_id;
+  std::string m_oid;
+
+  virtual void SetUp() {


Nit: we are switching over to use the "override" keyword everywhere in derived classes in-place of "virtual".

dillaman · 2017-02-22T18:42:25Z

src/test/rbd_mirror/test_mock_InstanceWatcher.cc

+  std::string m_instance_id;
+  std::string m_oid;
+
+  virtual void SetUp() {