New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd-mirror A/A: InstanceWatcher watch/notify stub for leader/follower RPC #13312
Conversation
@dillaman Implemented according to [1], though there are some questions to discuss:
|
@dillaman Yet another question. Working on the next subtask [1] I need to blacklist an instance that does not respond to heartbeats. The problem is to learn that instance address. I could use list_watchers on May be when an instance creates and starts watching "rbd_mirror_instance." object it also should acquire the lock on it? Then I could just use |
I'm fine w/ that -- I just suggested a new object to keep its purpose clear.
The trouble is that we need to ensure that the client id state mapping table has a record for a client before said client creates its watch/notify object. Otherwise, we could have uncontrolled growth of orphaned objects. If the slave needs to send a message to the leader saying "i'm alive" (or the leader automatically adds the up entry when it receives the ack), we need a new broadcast message on the "rbd_mirror_leader" object from the leader to the slave saying "watch your object now" (or vice-versa where the slave tells the leader that it has now started listening to its RPC object). Perhaps it's just a state payload on as part of the leader's heartbeat where the slave includes its current state (i.e. waiting for leader, ready, etc).
The purpose of tracking the clients is to allow for the cleanup of orphaned "rbd_mirror_instance.XYZ" objects. Once the leader has confimed that it has removed the "rbd_mirror_instance.XYZ" object for the blacklisted slave, it can remove the entry from the table. Therefore, perhaps "down" is not the best term -- perhaps "blacklisting" or "pruning". |
Sorry -- yes, that is the goal. The original design had the rbd-mirror daemons taking new "rbd-mirror" ManagedLocks on each image, but I (in my head at least) simplified this to just the instance-specific object. |
src/cls/rbd/cls_rbd.cc
Outdated
bufferlist bl; | ||
::encode(state, bl); | ||
|
||
int r = cls_cxx_map_set_val(hctx, instance_key(instance_id), &bl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the current approach where the slave writes its record before watching the "rbd_mirror_object" is kept, we should probably disallow the transition from down -> up state since that shouldn't be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was thinking about providing functions like below instead of instances_set:
- instances_add/register -- add a record in up state
- instances_down/pruning -- change a record state to down
- instances_remove/unregister -- remove a record
I can do any way: either add a check to instances_set or introduce add/down functions instead of set. What do you like more?
uint64_t notifier_id, bufferlist &bl); | ||
|
||
private: | ||
std::string m_instance_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you create a state diagram?
assert(m_lock.is_locked()); | ||
|
||
librados::ObjectWriteOperation op; | ||
librbd::cls_client::mirror_instances_set(&op, m_instance_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance should be recorded before the "rbd_instance.XYZ" object is created to prevent untracked, orphaned objects in the case of failure.
Agree, that the record should be removed only after the object is removed. But still "down" state does not look necessary. The record could remain in up state while pruning -- if pruning is interrupted, and a new leader reads records, it will assume the instance is up but will eventually prune it after not receiving responses on heartbeats. So the difference with "down" state that it would do this a little faster. Or do I miss some other scenario when "down" state would be useful? Actually I am fine with "down" state, just want to make this clear. |
@dillaman Agree, that instance adding a record about itself instead of sending a message to the leader may be simpler (no need to wait for ack from the leader), so I think I will do this way. But still we need a way to notify the leader when a new record is added, so it could refresh instances table (or just add instance from notification). Without this notification it will learn about the new instance after sending a heartbeat and receiving the response, but the instance might be already dead at this time. Just periodic refresh could work thought does not look very nice. What do you suggest? |
The failure case was the rationale -- but you are correct that even with a leader failure it should eventually figure things out. If you leave out the state now and a better use-case is determined, we have plenty of time to add it back in before the Luminous release. For a first-pass of A/A support, leaving it out results in fewer code paths to test.
I was just trying to keep it simple. Note that the leader only has to read the table once to initialize its peer instance table. Right now the heartbeat is sent out pretty regularly (5 secs?) so I don't think it's too long of a period to wait to have the leader detect a new peer. The peer could die at any time and the leader would need to handle it regardless after X number of missed heartbeat acks from the client. |
Let's consider a case:
As a result the instance object will exist until the leader is restarted and the table is reread (may remain forever if the user decided not to start rbd-mirror for this pool any more). Not big deal, still it would be nice if we handle this. E.g. before adding the record the instance would send heartbeat_ack (a little confusing message name would be though) so the leader added it to its in memory table. |
@trociny I think I am comfortable w/ that possibility since it eliminates extra messages and provides a consistent "rbd_instance.XYZ" bootstrap path for both the leader and the slaves. The (temporarily) orphaned object issue could be avoided by periodically transitioning the leader (i.e. every X hrs, the current leader automatically releases the lock so all other instances can attempt to acquire it). That would clean up the orphans and would continuously exercise the leader failover logic. |
@dillaman updated |
@dillaman sorry I forgot to add acquire/release lock steps on "rbd_mirror_instance.XYZ" object. Working on it |
@dillaman updated |
e5c3599
to
eab475b
Compare
@dillaman test_InstanceWatcher.cc updated (I had to fix a test that worked on librados test stab but failed on a real cluster). |
src/cls/rbd/cls_rbd.cc
Outdated
* none | ||
* | ||
* Output: | ||
* @param std::map<std::string, cls::rbd::MirrorInstanceState>: instance id to state map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: update output
src/cls/rbd/cls_rbd.cc
Outdated
/** | ||
* Input: | ||
* @param instance_id (std::string) | ||
* @param state (cls::rbd::MirrorInstanceState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: update input
assert(!m_instance_lock); | ||
|
||
m_instance_lock.reset( | ||
new librbd::ManagedLock<I>(m_ioctx, m_work_queue, m_oid, this, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you add a static ManagedLock::create
method and avoid the direct allocation? That would allow the mock test to combine the specialized ManagedLock
and MockManagedLock
. If you switch from using a unique_ptr
, you can also add a ManagedLock::destroy
method (like ImageCtx::destroy
) to mock out the deallocation.
#ifndef CEPH_RBD_MIRROR_INSTANCE_WATCHER_H | ||
#define CEPH_RBD_MIRROR_INSTANCE_WATCHER_H | ||
|
||
#include "librbd/ManagedLock.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can this be forward declared since it's just keeping a pointer?
@dillaman But evaluating this I faced with the following issues:
(1) is not difficult to resolve -- if "not owner" instance watcher init failed, and if the instance is alive it will send heartbeat ack soon, and we can re-init the watcher on this event. If no heartbeat ack received we believe it is dead and start instance removal procedure. For (2) it is more complicated. I can imaging the following states when
I suppose the
Not sure what to do if get locker on step (2) returns not ENOENT error though |
@dillaman If my reasoning above is not quite clear I can push PR for [1] subtask (which is currently in a messy state, so I need some time to cleanup) |
Why would the leader need to watch on the instance? I would think we only need to expose a "break lock" helper that invokes a state machine that retrieves the current lock owner for the instance object and invokes the break lock state machine. It shouldn't need to go through watching the instance object or attempting to lock the instance object. If the instance needs to send a direct message to the leader, it should send it to the instance object of the leader. I guess I was envisioning that regardless of leader or non-leader status, the instances would bootstrap up an instance watcher for themselves.
We could add a nice "I am nicely shutting down" message that could be sent to the leader's instance watcher. Perhaps we just need an instance "state" message that can be sent out for "ready" / "not ready" for image assignments -- and in an orderly shut down the instance can send "not ready" and can send "image XYX released" messages to the leader so that the images can be quickly re-assigned to another instance (or we just have a "ready for images", "shutting down", and "shut down" state to prevent assignment of new images when in the "shutting down" state and the assumption that all replayers have been properly shut down when the "shut down" state is reached). |
I thought Watcher RPC messages would be using the same instance object. I.e. if the leader needs to send "Image Acquire" RPC message to an instance, it sends it to this instance object, and the instance sends "Image Acquired" ACK back to the leader through the same instance object. Now I see you want ACK messages are sent through the leader instance object. I will try this way. It looks like doable, not sure yet though about complications (races) that might appear on a leader change or an instance shut down (crash). Intuitively, using one object would make handle this easier. |
@dillaman updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
std::string m_instance_id; | ||
std::string m_oid; | ||
|
||
virtual void SetUp() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we are switching over to use the "override" keyword everywhere in derived classes in-place of "virtual".
std::string m_instance_id; | ||
std::string m_oid; | ||
|
||
virtual void SetUp() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: switch to "override" keyword
: instance_ids(instance_ids), on_finish(on_finish) { | ||
} | ||
|
||
virtual void finish(int r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: switch to "override" keyword
instance_watcher.remove(this); | ||
} | ||
|
||
virtual void finish(int r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: switch to "override" keyword
|
||
InstanceWatcher(librados::IoCtx &io_ctx, ContextWQ *work_queue, | ||
const boost::optional<std::string> &id = boost::none); | ||
virtual ~InstanceWatcher(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: switch to "override" keyword
void remove(Context *on_finish); | ||
|
||
protected: | ||
virtual void handle_notify(uint64_t notify_id, uint64_t handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: switch to "override" keyword
src/tools/rbd_mirror/Replayer.cc
Outdated
@@ -292,6 +295,14 @@ int Replayer::init() | |||
return r; | |||
} | |||
|
|||
m_instance_watcher.reset( | |||
new InstanceWatcher<>(m_local_io_ctx, m_threads->work_queue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use a new static create
method to reduce the workload when we get mock tests for Replayer
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
… RPC Fixes: http://tracker.ceph.com/issues/18783 Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@dillaman updated |
Fixes: http://tracker.ceph.com/issues/18783
Signed-off-by: Mykola Golub mgolub@mirantis.com