Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror A/A: InstanceWatcher watch/notify stub for leader/follower RPC #13312

Merged
merged 3 commits into from Feb 24, 2017

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Feb 8, 2017

Fixes: http://tracker.ceph.com/issues/18783
Signed-off-by: Mykola Golub mgolub@mirantis.com

@trociny
Copy link
Contributor Author

trociny commented Feb 8, 2017

@dillaman Implemented according to [1], though there are some questions to discuss:

  1. We could use "rbd_mirror_leader" object for the purposes we introduced "rbd_mirror_instances".

  2. Instead of adding a record to "rbd_mirror_instances" the instance could send a notification to the leader, so only the leader would manage instances records.

  3. Not sure how useful up/down state is. The purpose I can image is to handle a situation when the leader crashes (blacklisted) when removing the instance, so a new leader is aware about unfinished removal. But I suppose the new leader would eventually figure out that the instance need to be removed, if it really misbehaves. Also the fact that the old leader crashed (blacklisted) may be an indication that we should not trust its decision to remove the instance.

[1] http://tracker.ceph.com/issues/18783

@trociny
Copy link
Contributor Author

trociny commented Feb 10, 2017

@dillaman Yet another question. Working on the next subtask [1] I need to blacklist an instance that does not respond to heartbeats. The problem is to learn that instance address. I could use list_watchers on rbd_mirror_instance.<client instance id> object and find the address matching watchers instance id, but this does not look very nice to me.

May be when an instance creates and starts watching "rbd_mirror_instance." object it also should acquire the lock on it? Then I could just use break_lock to blacklist it.

[1] http://tracker.ceph.com/issues/18784

@dillaman
Copy link

dillaman commented Feb 13, 2017

@trociny

We could use "rbd_mirror_leader" object for the purposes we introduced "rbd_mirror_instances".

I'm fine w/ that -- I just suggested a new object to keep its purpose clear.

Instead of adding a record to "rbd_mirror_instances" the instance could send a notification to the leader, so only the leader would manage instances records.

The trouble is that we need to ensure that the client id state mapping table has a record for a client before said client creates its watch/notify object. Otherwise, we could have uncontrolled growth of orphaned objects. If the slave needs to send a message to the leader saying "i'm alive" (or the leader automatically adds the up entry when it receives the ack), we need a new broadcast message on the "rbd_mirror_leader" object from the leader to the slave saying "watch your object now" (or vice-versa where the slave tells the leader that it has now started listening to its RPC object). Perhaps it's just a state payload on as part of the leader's heartbeat where the slave includes its current state (i.e. waiting for leader, ready, etc).

Not sure how useful up/down state is. The purpose I can image is to handle a situation when the leader crashes (blacklisted) when removing the instance, so a new leader is aware about unfinished removal. But I suppose the new leader would eventually figure out that the instance need to be removed, if it really misbehaves. Also the fact that the old leader crashed (blacklisted) may be an indication that we should not trust its decision to remove the instance.

The purpose of tracking the clients is to allow for the cleanup of orphaned "rbd_mirror_instance.XYZ" objects. Once the leader has confimed that it has removed the "rbd_mirror_instance.XYZ" object for the blacklisted slave, it can remove the entry from the table. Therefore, perhaps "down" is not the best term -- perhaps "blacklisting" or "pruning".

@dillaman
Copy link

@trociny

May be when an instance creates and starts watching "rbd_mirror_instance." object it also should acquire the lock on it? Then I could just use break_lock to blacklist it.

Sorry -- yes, that is the goal. The original design had the rbd-mirror daemons taking new "rbd-mirror" ManagedLocks on each image, but I (in my head at least) simplified this to just the instance-specific object.

bufferlist bl;
::encode(state, bl);

int r = cls_cxx_map_set_val(hctx, instance_key(instance_id), &bl);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the current approach where the slave writes its record before watching the "rbd_mirror_object" is kept, we should probably disallow the transition from down -> up state since that shouldn't be possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was thinking about providing functions like below instead of instances_set:

  • instances_add/register -- add a record in up state
  • instances_down/pruning -- change a record state to down
  • instances_remove/unregister -- remove a record

I can do any way: either add a check to instances_set or introduce add/down functions instead of set. What do you like more?

uint64_t notifier_id, bufferlist &bl);

private:
std::string m_instance_id;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you create a state diagram?

assert(m_lock.is_locked());

librados::ObjectWriteOperation op;
librbd::cls_client::mirror_instances_set(&op, m_instance_id,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance should be recorded before the "rbd_instance.XYZ" object is created to prevent untracked, orphaned objects in the case of failure.

@trociny
Copy link
Contributor Author

trociny commented Feb 13, 2017

The purpose of tracking the clients is to allow for the cleanup of orphaned "rbd_mirror_instance.XYZ" objects. Once the leader has confirmed that it has removed the "rbd_mirror_instance.XYZ" object for the blacklisted slave, it can remove the entry from the table.

Agree, that the record should be removed only after the object is removed. But still "down" state does not look necessary. The record could remain in up state while pruning -- if pruning is interrupted, and a new leader reads records, it will assume the instance is up but will eventually prune it after not receiving responses on heartbeats. So the difference with "down" state that it would do this a little faster. Or do I miss some other scenario when "down" state would be useful?

Actually I am fine with "down" state, just want to make this clear.

@trociny
Copy link
Contributor Author

trociny commented Feb 13, 2017

@dillaman Agree, that instance adding a record about itself instead of sending a message to the leader may be simpler (no need to wait for ack from the leader), so I think I will do this way. But still we need a way to notify the leader when a new record is added, so it could refresh instances table (or just add instance from notification). Without this notification it will learn about the new instance after sending a heartbeat and receiving the response, but the instance might be already dead at this time. Just periodic refresh could work thought does not look very nice. What do you suggest?

@dillaman
Copy link

@trociny

Agree, that the record should be removed only after the object is removed. But still "down" state does not look necessary. The record could remain in up state while pruning -- if pruning is interrupted, and a new leader reads records, it will assume the instance is up but will eventually prune it after not receiving responses on heartbeats. So the difference with "down" state that it would do this a little faster. Or do I miss some other scenario when "down" state would be useful?

The failure case was the rationale -- but you are correct that even with a leader failure it should eventually figure things out. If you leave out the state now and a better use-case is determined, we have plenty of time to add it back in before the Luminous release. For a first-pass of A/A support, leaving it out results in fewer code paths to test.

Agree, that instance adding a record about itself instead of sending a message to the leader may be simpler (no need to wait for ack from the leader), so I think I will do this way. But still we need a way to notify the leader when a new record is added, so it could refresh instances table (or just add instance from notification). Without this notification it will learn about the new instance after sending a heartbeat and receiving the response, but the instance might be already dead at this time. Just periodic refresh could work thought does not look very nice. What do you suggest?

I was just trying to keep it simple. Note that the leader only has to read the table once to initialize its peer instance table. Right now the heartbeat is sent out pretty regularly (5 secs?) so I don't think it's too long of a period to wait to have the leader detect a new peer. The peer could die at any time and the leader would need to handle it regardless after X number of missed heartbeat acks from the client.

@trociny
Copy link
Contributor Author

trociny commented Feb 14, 2017

@dillaman

I was just trying to keep it simple. Note that the leader only has to read the table once to initialize its peer instance table. Right now the heartbeat is sent out pretty regularly (5 secs?) so I don't think it's too long of a period to wait to have the leader detect a new peer. The peer could die at any time and the leader would need to handle it regardless after X number of missed heartbeat acks from the client.

Let's consider a case:

  1. leader reads the table on start
  2. a new instance is started, registers itself in the table and creates object
  3. this new instance crushes just after the start, before the leader sends the next heartbeat

As a result the instance object will exist until the leader is restarted and the table is reread (may remain forever if the user decided not to start rbd-mirror for this pool any more).

Not big deal, still it would be nice if we handle this.

E.g. before adding the record the instance would send heartbeat_ack (a little confusing message name would be though) so the leader added it to its in memory table.

@dillaman
Copy link

@trociny I think I am comfortable w/ that possibility since it eliminates extra messages and provides a consistent "rbd_instance.XYZ" bootstrap path for both the leader and the slaves. The (temporarily) orphaned object issue could be avoided by periodically transitioning the leader (i.e. every X hrs, the current leader automatically releases the lock so all other instances can attempt to acquire it). That would clean up the orphans and would continuously exercise the leader failover logic.

@trociny
Copy link
Contributor Author

trociny commented Feb 15, 2017

@dillaman updated

@trociny
Copy link
Contributor Author

trociny commented Feb 15, 2017

@dillaman sorry I forgot to add acquire/release lock steps on "rbd_mirror_instance.XYZ" object. Working on it

@trociny
Copy link
Contributor Author

trociny commented Feb 15, 2017

@dillaman updated

@trociny trociny force-pushed the wip-18783 branch 2 times, most recently from e5c3599 to eab475b Compare February 16, 2017 13:06
@trociny
Copy link
Contributor Author

trociny commented Feb 16, 2017

@dillaman test_InstanceWatcher.cc updated (I had to fix a test that worked on librados test stab but failed on a real cluster).

* none
*
* Output:
* @param std::map<std::string, cls::rbd::MirrorInstanceState>: instance id to state map

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: update output

/**
* Input:
* @param instance_id (std::string)
* @param state (cls::rbd::MirrorInstanceState)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: update input

assert(!m_instance_lock);

m_instance_lock.reset(
new librbd::ManagedLock<I>(m_ioctx, m_work_queue, m_oid, this,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you add a static ManagedLock::create method and avoid the direct allocation? That would allow the mock test to combine the specialized ManagedLock and MockManagedLock. If you switch from using a unique_ptr, you can also add a ManagedLock::destroy method (like ImageCtx::destroy) to mock out the deallocation.

#ifndef CEPH_RBD_MIRROR_INSTANCE_WATCHER_H
#define CEPH_RBD_MIRROR_INSTANCE_WATCHER_H

#include "librbd/ManagedLock.h"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can this be forward declared since it's just keeping a pointer?

@trociny
Copy link
Contributor Author

trociny commented Feb 20, 2017

@dillaman
Updated according to your comments. I have also extended this for "not owner" InstanceWatcher. This mode is going to be used by the leader for InstanceWatcher RPC messages. This "slave" instance is also planned to be used to break the owner lock and unregister the instance if it is dead. For this I planned to add InstanceWatcher take_instance_ownership method. After the method returns it is supposed that the previous owner is black listed if it was still alive. After the instance is owned, shut_down will do necessary instance cleanup.

But evaluating this I faced with the following issues:

  1. It is possible that "owner" and "not owner" instances are started almost simultaneously, and then "not owner" instance watcher init might fail because instance object is not created yet.

  2. When the "owner" instance shuts down, no notifications are planned to be sent to the leader. So the leader will learn the instance is down only after missing heartbeat acks. At this moment normally the instance object (the leader is still listening on) is already removed, so I expect some error messages from the watcher (which is not very nice). And actually I can't see how the leader can tell if it was normal instance shut down or the instance got stuck.

(1) is not difficult to resolve -- if "not owner" instance watcher init failed, and if the instance is alive it will send heartbeat ack soon, and we can re-init the watcher on this event. If no heartbeat ack received we believe it is dead and start instance removal procedure.

For (2) it is more complicated. I can imaging the following states when take_instance_ownership is called:

  • the previous owner has already shut down in a clean way removing the object and the record
  • the previous owner gets stuck and the object is locked
  • the previous owner crashed leaving only record (or both object and record)
  • the previous owner is just shutting down...

I suppose the take_instance_ownership procedure could be:

  1. try to acquire the lock, if it succeeds of fails with ENOENT (object removed), we are a new owner.
  2. get locker, if it failed with ENOENT we are a new owner.
  3. break lock and go to (1)

Not sure what to do if get locker on step (2) returns not ENOENT error though

@trociny
Copy link
Contributor Author

trociny commented Feb 20, 2017

@dillaman If my reasoning above is not quite clear I can push PR for [1] subtask (which is currently in a messy state, so I need some time to cleanup)

http://tracker.ceph.com/issues/18784

@dillaman
Copy link

@trociny

Updated according to your comments. I have also extended this for "not owner" InstanceWatcher. This mode is going to be used by the leader for InstanceWatcher RPC messages. This "slave" instance is also planned to be used to break the owner lock and unregister the instance if it is dead. For this I planned to add InstanceWatcher take_instance_ownership method. After the method returns it is supposed that the previous owner is black listed if it was still alive. After the instance is owned, shut_down will do necessary instance cleanup.

Why would the leader need to watch on the instance? I would think we only need to expose a "break lock" helper that invokes a state machine that retrieves the current lock owner for the instance object and invokes the break lock state machine. It shouldn't need to go through watching the instance object or attempting to lock the instance object. If the instance needs to send a direct message to the leader, it should send it to the instance object of the leader. I guess I was envisioning that regardless of leader or non-leader status, the instances would bootstrap up an instance watcher for themselves.

When the "owner" instance shuts down, no notifications are planned to be sent to the leader. So the leader will learn the instance is down only after missing heartbeat acks. At this moment normally the instance object (the leader is still listening on) is already removed, so I expect some error messages from the watcher (which is not very nice). And actually I can't see how the leader can tell if it was normal instance shut down or the instance got stuck.

We could add a nice "I am nicely shutting down" message that could be sent to the leader's instance watcher. Perhaps we just need an instance "state" message that can be sent out for "ready" / "not ready" for image assignments -- and in an orderly shut down the instance can send "not ready" and can send "image XYX released" messages to the leader so that the images can be quickly re-assigned to another instance (or we just have a "ready for images", "shutting down", and "shut down" state to prevent assignment of new images when in the "shutting down" state and the assumption that all replayers have been properly shut down when the "shut down" state is reached).

@trociny
Copy link
Contributor Author

trociny commented Feb 21, 2017

@dillaman

Why would the leader need to watch on the instance? I would think we only need to expose a "break lock" helper that invokes a state machine that retrieves the current lock owner for the instance object and invokes the break lock state machine. It shouldn't need to go through watching the instance object or attempting to lock the instance object. If the instance needs to send a direct message to the leader, it should send it to the instance object of the leader. I guess I was envisioning that regardless of leader or non-leader status, the instances would bootstrap up an instance watcher for themselves.

I thought Watcher RPC messages would be using the same instance object. I.e. if the leader needs to send "Image Acquire" RPC message to an instance, it sends it to this instance object, and the instance sends "Image Acquired" ACK back to the leader through the same instance object.

Now I see you want ACK messages are sent through the leader instance object. I will try this way. It looks like doable, not sure yet though about complications (races) that might appear on a leader change or an instance shut down (crash). Intuitively, using one object would make handle this easier.

@trociny
Copy link
Contributor Author

trociny commented Feb 21, 2017

@dillaman updated

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

std::string m_instance_id;
std::string m_oid;

virtual void SetUp() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we are switching over to use the "override" keyword everywhere in derived classes in-place of "virtual".

std::string m_instance_id;
std::string m_oid;

virtual void SetUp() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: switch to "override" keyword

: instance_ids(instance_ids), on_finish(on_finish) {
}

virtual void finish(int r) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: switch to "override" keyword

instance_watcher.remove(this);
}

virtual void finish(int r) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: switch to "override" keyword


InstanceWatcher(librados::IoCtx &io_ctx, ContextWQ *work_queue,
const boost::optional<std::string> &id = boost::none);
virtual ~InstanceWatcher();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: switch to "override" keyword

void remove(Context *on_finish);

protected:
virtual void handle_notify(uint64_t notify_id, uint64_t handle,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: switch to "override" keyword

@@ -292,6 +295,14 @@ int Replayer::init()
return r;
}

m_instance_watcher.reset(
new InstanceWatcher<>(m_local_io_ctx, m_threads->work_queue));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use a new static create method to reduce the workload when we get mock tests for Replayer

Mykola Golub added 3 commits February 23, 2017 12:57
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@trociny
Copy link
Contributor Author

trociny commented Feb 23, 2017

@dillaman updated

@dillaman dillaman merged commit 38405b8 into ceph:master Feb 24, 2017
@trociny trociny deleted the wip-18783 branch February 28, 2017 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants