Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror: add support for active/passive daemon instances #12948

Merged
merged 14 commits into from
Feb 1, 2017

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Jan 16, 2017

No description provided.

@trociny trociny changed the title rbd-mirror: add support for active/passive daemon instances [DNM] rbd-mirror: add support for active/passive daemon instances Jan 16, 2017
@trociny
Copy link
Contributor Author

trociny commented Jan 16, 2017

@dillaman updated according to your comments.

Still I am not sure about my approach for LeaderWatcher state machine. I use variables "acquiring" and "releasing" to make sure we don't start new state transition when we still execution the previous one. So if we are in transition state, leader watcher messages (that may trigger transition) are ignored, which I suppose is wrong.

Not sure what is correct to do in this case. May be I could use ManagedLock hooks to serialize events properly?

void notify_lock_acquired(Context *on_finish);
void notify_lock_released(Context *on_finish);

virtual void handle_heartbeat(Context *on_ack) = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these methods are unused in this commit and refactored into protected, non-virtual methods in another commit


namespace {

static const uint64_t NOTIFY_TIMEOUT_MS = 5000;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unused variable

Context *on_notify_ack) {
dout(20) << "heartbeat" << dendl;

handle_heartbeat(on_notify_ack);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why do you have these overloaded methods just invoke other methods instead of just combining the two?


namespace {

static const uint64_t NOTIFY_TIMEOUT_MS = 5000;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unused variable

#include "librbd/Watcher.h"

namespace librbd {
class ImageCtx;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not used in this header

Mutex::Locker locker(m_lock);

if (r < 0) {
derr << "error acquiring leader lock: " << cpp_strerror(r)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if -EAGAIN, no need to output an error message -- just a debug message

Mutex::Locker locker(m_lock);

if (r < 0) {
derr << "error retrieving leader lock owner: " << cpp_strerror(r) << dendl;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if -ENOENT, no need to output error message. Should you immediately retry to acquire the lock in that case?

}

void check_leader_alive(utime_t &now, int heartbeat_interval);

void notify_heartbeat(Context *on_finish);
void notify_lock_acquired(Context *on_finish);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these two methods (notify_lock_acquired and notify_lock_released ) need to be public? You have overloaded versions that interact w/ the internal state machine and these stand-alone ones.

Mutex::Locker locker(m_lock);
if (!m_stopping && !m_releasing && !m_acquiring) {
if (m_leader_lock->is_lock_owner()) {
dout(0) << "got another leader heartbeat, releasing lock" << dendl;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to do that here (or below). You can have a transient network hiccup and the Objecter could replay old messages from before this instance acquired the lock. I'd just ignore it since you should have been blacklisted if the lock was broken.

* REGISTER_WATCH * lock_released received |
* | /-------------------------\
* v v |
* ACQUIRE_LEADER_LOCK * *> GET_LOCKER -> <peon>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: peon -> passive or secondary

@trociny
Copy link
Contributor Author

trociny commented Jan 23, 2017

@dillaman significantly modified:

  1. All possible logic is moved from Replayer to LeaderWatcher class. The Replayer only uses LeaderWatcher's is_leader method to check the current status and post_acquire/pre_release hooks to start/stop image replayers on the leader lock state change.

  2. Timer is used to send hartbeat messages and initiate break lock.

  3. The ManagedLocker's post/pre acquire/release hooks are used to change the replayer state (stop image replayers, init/shut down status watcher) on acquiring/releasing the leader lock. This way I hope to avoid issues when the leader state change is requested while we are still in transition state.

  4. "release leader" asok command is added (useful for testing, but users might find it useful too, e.g. to release leader for some pools so other instance could take care of these).

  5. I need to track in-fly break_lock and get_locker operations for proper shut down. My implementation does not look very nice though. Do you have any suggestions? I suppose I could use AsyncOpTracker, but actually I think may be ManagedLock should be taught to track these requests and wait on shut down?

@dillaman
Copy link

@trociny For (5), I think it would be good for ManagedLock to track these async requests and delay shutdown until they are completed.

@trociny trociny force-pushed the wip-17020 branch 2 times, most recently from 810a7a1 to 0eb941a Compare January 26, 2017 07:56
@@ -0,0 +1,270 @@
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
Copy link

@dillaman dillaman Jan 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a basic mock test case for the leader watcher? It would require templatizing the LeaderWatcher class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is in progress.

return 0;
}

void LeaderWatcher::shut_down() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to make this an async method w/ helper methods now that you have to unlock/lock over and over around to get around the fact that it is internally calling async methods.

release_leader_lock();
}

Context *LeaderWatcher::async(Context * ctx) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename to create_async_context? or update the existing librbd::util::create_async_context to be able to handle either an ImageCtx-like param or a ContextWQ-like param.

m_locker = locker;
}

schedule_timer_task("break leader", 2, false,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two missed heartbeats results in a break lock? This should probably be configurable and be more aligned to 30 seconds like the watch timeout.


assert(m_lock.is_locked());

std::shared_ptr<librbd::managed_lock::Locker> locker(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use a member variable for this in-flight request storage? You shouldn't have concurrent get locker requests in-flight, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I expect concurrent get locker requests. They are generated by these events:

  • lock acquire failed,
  • lock break fail,
  • lock released,
  • "heartbeat" message received,
  • "lock acquired" message received,

as these are both internal and external events, it needs some serialization if we want to avoid concurrent requests. In my previous version I had a check in get_locker to skip if there is already an in-flight request. This requires two additional member variables (locker and a "in-flight" boolean). Using a shared pointer looked a little nicer to me, but I don't have a strong opinion here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also just create a custom struct context wrapper that has its own locker member variable. That would combine the FunctionContext and Locker allocations into a single wrapper.

@@ -147,6 +160,13 @@ class MirrorAdminSocketHook : public AdminSocketHook {
if (r == 0) {
commands[command] = new FlushCommand(mirror);
}

command = "rbd mirror release leader";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: perhaps "rbd mirror leader release" in case new leader asok commands are added in the future

@@ -152,6 +166,13 @@ class ReplayerAdminSocketHook : public AdminSocketHook {
if (r == 0) {
commands[command] = new FlushCommand(replayer);
}

command = "rbd mirror release leader " + name;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: same comment here

set_sources(ImageIds());
if (!m_image_replayers.empty()) {
Mutex::Locker timer_locker(m_threads->timer_lock);
Context *task = new FunctionContext(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be cleaned up? The nested FunctionContexts are hard to read.

bool is_leader();
void release_leader();

virtual void post_acquire_leader_handler(Context *on_finish) = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just create a listener interface to avoid having to inherit from this class just to receive callbacks.

@trociny
Copy link
Contributor Author

trociny commented Jan 30, 2017

@dillaman It looks ready for review now.

@trociny trociny changed the title [DNM] rbd-mirror: add support for active/passive daemon instances rbd-mirror: add support for active/passive daemon instances Jan 30, 2017
@dillaman
Copy link

@trociny Thanks -- I'll start a deep review today. Did you see the build failure?


LeaderWatcher(Threads *threads, librados::IoCtx &io_ctx);

void add_listener(Listener *listener);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to need the add/remove listener methods and not just pass in the listener in the constructor? Is there a one-to-many relationship somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no any particular reason so far. It just looked more generic and could be useful for testing (to add test hooks). I can also imagine some useful applications like a cli watcher, though it is unlikely we have such and this always can be changed. So if you prefer the listener is passed in the constructor, I will update this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern was the added complexity w/ handling multiple observers and how to properly handle adding / removing them if a callback is in-flight. The code becomes cleaner w/o support for multiple.


assert(m_lock.is_locked());

Context *ctx = create_async_context_callback(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: librbd::Watcher::register_watch is guaranteed to be an async callback already. Did you hit some oddity here that required an extra async wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I didn't. I did hit it in unregister_watch, when the client was blacklisted, and just changed this for safety too (even if it is guaranteed to be an async callback now, I can't see why someone might not change this in the future, forgetting to update LeaderWatcher).

If you don't like unnecessary async calls I will update this. Though I would like we have some clear conventions (markers in header file?) to know if a function is guaranteed to be an async callback without need to look in the code. Or update functions like unregister_watch to provide this guarantee.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine w/ leaving it -- no harm and this isn't a high performance code path. It was more of a curiosity.

void handle_get_locker(int r, librbd::managed_lock::Locker& locker);

void acquire_leader_lock(bool reset_attempt_counter);
void acquire_leader_lock();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd just update the call sites to this override to use acquire_leader_lock(false) directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to pass it to schedule_timer_task, which expects void(*callback)().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

if (is_leader(m_lock)) {
dout(5) << "got another leader lock_acquired, ignoring" << dendl;
} else {
cancel_timer_task();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "acquire" and "release" notifications reset m_acquire_attempts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "release" case it is already reset , but it is missed for "acquire". Thanks, I will update.


namespace {

static const uint64_t NOTIFY_TIMEOUT_MS = 5000;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment still open

void Replayer::set_sources(const ImageIds &image_ids)
{
dout(20) << "enter" << dendl;

assert(m_lock.is_locked());

if (!m_init_images.empty()) {
if (!m_init_images.empty() && !m_stopping.read()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move && to end of this line

@trociny
Copy link
Contributor Author

trociny commented Jan 31, 2017

@dillaman updated

@@ -0,0 +1,207 @@
#!/bin/sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny Just realized while running the tests that this needs an associate suite test case to execute it under teuthology

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman The tests starts/stops rbd-daemons, while under teuthology it is out of the script control.

I suppose I can modify the script so when it is being run under teuthology it executes only a subset of tests (may be using "leader release" instead of stop in some cases). I will investigate this tomorrow. May be you already have some hints or suggestions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny Hmm -- perhaps what we really just need (long term) is a "thrasher" task like the ones for OSDs/MDSs. I'll create a separate tracker ticket for that.

@dillaman
Copy link

dillaman commented Feb 1, 2017

@trociny Running the rbd_mirror_ha.sh script locally has failed for me 3 out of 4 times. I will try to investigate more tomorrow.

Mykola Golub added 6 commits February 1, 2017 10:55
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub added 7 commits February 1, 2017 10:55
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Fixes: http://tracker.ceph.com/issues/17019
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
…ired/released

Fixes: http://tracker.ceph.com/issues/17020
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
by optionally specifyning daemon instance after cluster name and
colon, like:

  start_mirror ${cluster}:${instance}

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
This var is mostly used when running rbd_mirror test scripts on
teuthology.  It can be used locally though to speedup re-running the
tests:

Set a test temp directory:

  export RBD_MIRROR_TEMDIR=/tmp/tmp.rbd_mirror

Run the tests the first time with NOCLEANUP flag (the cluster and
daemons are not stopped on finish):

  RBD_MIRROR_NOCLEANUP=1 ../qa/workunits/rbd/rbd_mirror.sh

Now, to re-run the test without restarting the cluster, run cleanup
with USE_EXISTING_CLUSTER flag:

  RBD_MIRROR_USE_EXISTING_CLUSTER=1 \
    ../qa/workunits/rbd/rbd_mirror_ha.sh cleanup

and then run the tests:

  RBD_MIRROR_USE_EXISTING_CLUSTER=1
    ../qa/workunits/rbd/rbd_mirror_ha.sh

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@trociny
Copy link
Contributor Author

trociny commented Feb 1, 2017

@dillaman I was able to reproduce rbd_mirror_ha.sh test failure after several runs (not sure it was your case though). In my case it failed on "crash leader" test due to too short timeout in rbd_mirror_ha.sh wait_for_leader (forgot to update the test after I added acquire attempts to the LeaderWatcher). I updated the PR.

Also, I have added a suite test, trying to run this on teuthology...

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@trociny
Copy link
Contributor Author

trociny commented Feb 1, 2017

@dillaman

Also, I have added a suite test, trying to run this on teuthology

But I like your idea with a "thrasher" task too.

@dillaman
Copy link

dillaman commented Feb 1, 2017

@trociny Thanks -- works for me now

@dillaman dillaman merged commit 8b8023c into ceph:master Feb 1, 2017
@trociny trociny deleted the wip-17020 branch February 2, 2017 14:22
@runsisi
Copy link
Contributor

runsisi commented Mar 9, 2017

@dillaman after this PR, we can not add more than one peer for a local pool anymore?

@dillaman
Copy link

dillaman commented Mar 9, 2017

@runsisi That has never been supported (yet) -- not sure why you think this PR removed that capability.

@runsisi
Copy link
Contributor

runsisi commented Mar 10, 2017

@dillaman did some tests on the latest Jewel release, i.e., 10.2.6, results are showed as follows, it shows the images in cluster cs4 and cluster cs5 were synced to cluster cs6 successfully.

so i think the N remote pools -> 1 local pool feature works as expected.

do you mean the bidirectional mirroring, i.e., N remote pools <-> 1 local pool feature has not been supported, and the N remote pools -> 1 local pool feature is not what the RBD mirroring supposed to support?

[root@cs6 ~]# rbd info cs4i1 --cluster cs4
rbd image 'cs4i1':
        ...
        mirroring global id: 2b1be645-c753-40da-ac1f-103711cae407
        mirroring primary: true

[root@cs6 ~]# rbd info cs5i1 --cluster cs5 
rbd image 'cs5i1':
        ...
        mirroring global id: e59d1217-e29a-4d51-ae52-c6a573085d7f
        mirroring primary: true
[root@cs6 ~]# rbd mirror pool info rbd
Mode: image
Peers: 
  UUID                                 NAME CLIENT       
  6ccc4f09-ea5c-4a12-8f7f-ebc976b09131 cs5  client.admin 
  a5e58893-a4a8-4413-917c-53602ad4c411 cs4  client.admin

[root@cs6 ~]# rbd ls
cs4i1
cs5i1

[root@cs6 ~]# rbd info cs4i1
rbd image 'cs4i1':
        ...
        mirroring global id: 2b1be645-c753-40da-ac1f-103711cae407
        mirroring primary: false

[root@cs6 ~]# rbd info cs5i1
rbd image 'cs5i1':
        ...
        mirroring global id: e59d1217-e29a-4d51-ae52-c6a573085d7f
        mirroring primary: false

@dillaman
Copy link

@runsisi N remote peers syncing to 1 local cluster is definitely a scenario we have never tested nor designed against. Feel free to submit a PR to fix whatever got broken but N clusters -> 1 cluster is a very odd scenario that could run into other issues in the future (e.g. name collisions).

@runsisi
Copy link
Contributor

runsisi commented Mar 10, 2017

@dillaman yeah, thank you for your explanation.

but then i am wondering why the rbd cli add the support to add more than one peer for a local pool?

@dillaman
Copy link

@runsisi It's definitely a long-term goal to support N-to-N replication. We would like to have a better understanding of required use-cases in the real-world before we dive into such an endeavor. It probably would be a good idea to prevent the rbd CLI from adding more than one peer to a pool in the meantime.

@runsisi
Copy link
Contributor

runsisi commented Mar 10, 2017

@dillaman i see, jason thank you very much!

@dillaman
Copy link

@runsisi In the meantime, if you have cluster cs4 and cluster cs5 use different pool names, they could replicate to the associated two pools in cluster cs6.

@runsisi
Copy link
Contributor

runsisi commented Mar 10, 2017

@dillaman yes, i suppose this is the designed behavior :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants