New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd-mirror A/A: leader should track up/down rbd-mirror instances #13571
Conversation
assert(m_lock.is_locked()); | ||
assert(!m_instances); | ||
|
||
m_instances.reset(new Instances<I>(m_threads, m_ioctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use a ::create
method so the mocks can be tested
@@ -32,6 +33,17 @@ struct HeartbeatPayload { | |||
void dump(Formatter *f) const; | |||
}; | |||
|
|||
struct HeartbeatAckPayload { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that you could directly use the ack payload sent in response to the HeartbeatPayload
message. That ensures the acks are sent to the leader that actually initiated the message -- and eliminates any potential race during a leader transition.
https://github.com/ceph/ceph/blob/master/src/include/rados/librados.hpp#L1078
@@ -762,6 +835,24 @@ void LeaderWatcher<I>::handle_heartbeat(Context *on_notify_ack) { | |||
m_acquire_attempts = 0; | |||
cancel_timer_task(); | |||
get_locker(); | |||
notify_heartbeat_ack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: suggest using the implicit ack generated below by on_notify_ack->complete(0)
@dillaman updated |
@dillaman some issues have been found and fixed after testing:
Still I am observing some oddities in tests I need to investigate. I updated Replayer admin socket to output current list of instances. Also, I am thinking if just looking for notifier_id when parsing heartbeat ack is enough. A random client can be watching the object at that time and will be added to instances table and later removed. This should not cause any issues, still may be it is a good idea to add payload to ack? |
@dillaman I believe I have fixed the tests instabilities I observed. They were due to too small "remove after" timeout, set initially to I set it to |
@dillaman As for jenkins failure, I was able to reproduce this running It fails here
when the test is waiting on
But right now I don't have an idea why it shows up only sporadically and if this is a sign of some bug in the LeaderWatcher. |
So, for me it looks like sometimes when the second MirrorStatusWatcher is starting on acquire the leader lock, the first watcher (that should have been already unregistered) is still returned by list_watchers. |
@dillaman I created a PR for I added "skip" for this test on librados test stub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
MockManagedLock::get_instance().construct(); | ||
} | ||
|
||
virtual ~ManagedLock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: override vs virtual
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman ManagedLock is a base (not derived) class. I think "virtual" is correct in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup -- sorry. Thought it was a derived class from the quick scan.
src/tools/rbd_mirror/LeaderWatcher.h
Outdated
@@ -33,6 +34,7 @@ class LeaderWatcher : protected librbd::Watcher { | |||
}; | |||
|
|||
LeaderWatcher(Threads *threads, librados::IoCtx &io_ctx, Listener *listener); | |||
virtual ~LeaderWatcher(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: override vs virtual
MirrorStatusWatcher(librados::IoCtx &io_ctx, ContextWQ *work_queue); | ||
virtual ~MirrorStatusWatcher(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: override vs virtual
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Fixes: http://tracker.ceph.com/issues/18784 Signed-off-by: Mykola Golub <mgolub@mirantis.com>
…work Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@dillaman I have addressed all your comments but ManagedLock mock class. |
@dillaman Observing some strange test failures after rebase, need some time to investigate the root cause. |
I observed a crush running ceph_test_rbd_mirror, but it looks like something has been broken in the master recently -- observing the same crush in the pure master running e.g.
|
retest this please |
retest this please |
2 similar comments
retest this please |
retest this please |
Fixes: http://tracker.ceph.com/issues/18784
Signed-off-by: Mykola Golub mgolub@mirantis.com