Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror: replace remote pool polling with add/remove notifications #12364

Merged
merged 4 commits into from Mar 17, 2017

Conversation

dillaman
Copy link

@dillaman dillaman commented Dec 7, 2016

No description provided.

trociny pushed a commit that referenced this pull request Dec 8, 2016
[DNM] rbd-mirror: replace remote pool polling with add/remove notifications #12364
@trociny
Copy link
Contributor

trociny commented Dec 9, 2016

@dillaman observing crashes when running rbd_mirror(_stress).sh on teuthology:

http://pulpito.ceph.com/trociny-2016-12-09_06:43:36-rbd-wip-mgolub-testing---basic-mira/

and locally:

2016-12-09 11:09:33.119818 7f783754cc40 -1 rbd::mirror::Mirror: 0x7f7840a47140 update_replayers: removing blacklisted replayer for uuid: 91c12cc7-d872-4cef-a12e-8b03c00b5aad cluster: cluster2 client: client.admin
2016-12-09 11:09:33.119843 7f783754cc40  5 rbd::mirror::PoolWatcher: 0x7f7840c6e830 shut_down: 
2016-12-09 11:09:33.119845 7f783754cc40  5 rbd::mirror::PoolWatcher: 0x7f7840c6e830 unregister_watcher: 
2016-12-09 11:09:33.119849 7f783754cc40  5 rbd::mirror::PoolWatcher: 0x7f7840c6e830 operator(): unregister_watcher: r=0
2016-12-09 11:09:33.120949 7f783754cc40 -1 /home/mgolub/ceph/ceph.upstream/src/librbd/Watcher.cc: In function 'virtual librbd::Watcher::~Watcher()' thread 7f783754cc40 time 2016-12-09 11:09:33.119857
/home/mgolub/ceph/ceph.upstream/src/librbd/Watcher.cc: 81: FAILED assert(m_watch_state != WATCH_STATE_REGISTERED)

 ceph version 11.0.2-2357-g9e1803f (9e1803fbaf6c82b47ac9ccf56a65a07fb4bcd8b6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x7f7837173a32]
 2: (librbd::Watcher::~Watcher()+0x337) [0x7f7837010937]
 3: (rbd::mirror::MirrorStatusWatchCtx::Watcher::~Watcher()+0x17) [0x7f7836f15947]
 4: (rbd::mirror::Replayer::~Replayer()+0x349) [0x7f7836f11b69]
 5: (std::_Rb_tree<std::pair<long, rbd::mirror::peer_t>, std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > >, std::_Select1st<std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > > >, std::less<std::pair<long, rbd::mirror::peer_t> >, std::allocator<std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > > >)+0x3b) [0x7f7836f0c1ab]
 6: (rbd::mirror::Mirror::update_replayers(std::map<long, std::set<rbd::mirror::peer_t, std::less<rbd::mirror::peer_t>, std::allocator<rbd::mirror::peer_t> >, std::less<long>, std::allocator<std::pair<long const, std::set<rbd::mirror::peer_t, std::less<rbd::mirror::peer_t>, std::allocator<rbd::mirror::peer_t> > > > > const&)+0x763) [0x7f7836f06253]
 7: (rbd::mirror::Mirror::run()+0xde) [0x7f7836f06dee]
 8: (main()+0x20e) [0x7f7836efbfde]
 9: (__libc_start_main()+0xf5) [0x7f782b946b45]
 10: (()+0x21e246) [0x7f7836f04246]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2016-12-09 11:32:35.302769 7f4f80ff9700 -1 /home/mgolub/ceph/ceph.upstream/src/common/RWLock.h: In function 'void RWLock::get_read() const' thread 7f4f80ff9700 time 2016-12-09 11:32:35.301393
/home/mgolub/ceph/ceph.upstream/src/common/RWLock.h: 105: FAILED assert(r == 0)

 ceph version 11.0.2-2357-g9e1803f (9e1803fbaf6c82b47ac9ccf56a65a07fb4bcd8b6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x7f4fc1eb7a32]
 2: (()+0x2128c2) [0x7f4fc1c3c8c2]
 3: (librbd::ExclusiveLock<librbd::ImageCtx>::send_reacquire_lock()+0x4c8) [0x7f4fc1cd4738]
 4: (librbd::ExclusiveLock<librbd::ImageCtx>::reacquire_lock(Context*)+0x121) [0x7f4fc1cd56c1]
 5: (librbd::ImageWatcher<librbd::ImageCtx>::handle_rewatch_complete(int)+0x144) [0x7f4fc1cf6754]
 6: (librbd::Watcher::handle_rewatch(int)+0x437) [0x7f4fc1d55c97]
 7: (librbd::watcher::RewatchRequest::finish(int)+0x8d) [0x7f4fc1d57aed]
 8: (librbd::watcher::RewatchRequest::handle_rewatch(int)+0x105) [0x7f4fc1d58405]
 9: (librados::C_AioSafe::finish(int)+0x1d) [0x7f4fb8fa377d]
 10: (Context::complete(int)+0x9) [0x7f4fc1c597f9]
 11: (Finisher::finisher_thread_entry()+0x1f4) [0x7f4fc1eb6cd4]
 12: (()+0x80a4) [0x7f4fb8cf00a4]
 13: (clone()+0x6d) [0x7f4fb674f04d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

@dillaman
Copy link
Author

@trociny Pushed a fix for that crash

@trociny
Copy link
Contributor

trociny commented Dec 14, 2016

@dillaman looks like related:

[ RUN      ] TestLibRBD.FlushCacheWithCopyupOnExternalSnapshot
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:4853: Failure
      Expected: 0
To be equal to: rbd.clone(ioctx, name.c_str(), "one", ioctx, clone_name.c_str(), (1ULL<<0), &order)
      Which is: -2
[  FAILED  ] TestLibRBD.FlushCacheWithCopyupOnExternalSnapshot (10 ms)

@dillaman dillaman force-pushed the wip-rbd-mirror-notifications branch 2 times, most recently from 6eb4019 to 8b071d7 Compare December 15, 2016 16:51
}

void expect_mirroring_watcher_is_unregister(MockMirroringWatcher &mock_mirroring_watcher,
bool unregistered) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman May be the function name was supposed to be expect_mirroring_watcher_is_unregistered?

@dillaman
Copy link
Author

dillaman commented Jan 7, 2017

Note: there is a race condition with the ImageDeleter that cannot be fixed until #10896 is merged

@trociny
Copy link
Contributor

trociny commented Jan 11, 2017

Outdated

@trociny trociny closed this Jan 11, 2017
@trociny
Copy link
Contributor

trociny commented Jan 11, 2017

Sorry, wrong.

@trociny trociny reopened this Jan 11, 2017
@trociny
Copy link
Contributor

trociny commented Feb 13, 2017

@dillaman May be a subset of this PR (namely, "utilize global image id as internal unique key" and "preliminary support to track multiple remote peer image sources") could be merged as a separate PR, not waiting for #10896?

I am just starting working on InstanceReplayerInterface [1] and it looks to me it would be good if your patches were merged before.

[1] http://tracker.ceph.com/issues/18785

if (r < 0) {
derr << "error resolving remote pool " << m_remote_pool_id
derr << "error resolving remote pool " << m_local_pool_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman I think this error message is confusing now -- I suppose users would thinking the ID in the message is from the remote cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- probably should just pass in the pool name string since there really isn't a need to look it up.

@dillaman
Copy link
Author

@trociny Sure -- I'll pull out the parts I can into a cleanup PR

@dillaman dillaman force-pushed the wip-rbd-mirror-notifications branch 3 times, most recently from 1f5ec42 to d49b4dc Compare March 15, 2017 15:41
@dillaman dillaman changed the title [DNM] rbd-mirror: replace remote pool polling with add/remove notifications rbd-mirror: replace remote pool polling with add/remove notifications Mar 15, 2017
@trociny trociny self-assigned this Mar 15, 2017
stop_image_replayers(on_finish);
});
ctx = create_async_context_callback(m_threads->work_queue, ctx);
m_threads->timer->add_event_after(1, ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman Observing teuthology failures [1]

It looks like all cases are due to the timer lock is not held here:

/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.0-1436-gb9f5564/rpm/el7/BUILD/ceph-12.0.0-1436-gb9f5564/src/common/Timer.cc: 127: FAILED assert(lock.is_locked())
 ceph version 12.0.0-1436-gb9f5564 (b9f556438d3c7612e9c0a042c8d8f0959cabd3a0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f351bd9c1a0]
 2: (()+0x2b840d) [0x7f351bd9440d]
 3: (rbd::mirror::Replayer::stop_image_replayers(Context*)+0x179) [0x7f3524d19119]
 4: (rbd::mirror::Replayer::handle_shut_down_pool_watcher(int, Context*)+0xb6) [0x7f3524d194a6]
 5: (FunctionContext::finish(int)+0x2a) [0x7f3524d1dc0a]
 6: (Context::complete(int)+0x9) [0x7f3524d1cc29]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7f351bda4649]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f351bda5660]
 9: (()+0x7dc5) [0x7f351a686dc5]
 10: (clone()+0x6d) [0x7f351914573d]

[1] http://pulpito.ceph.com/trociny-2017-03-16_07:53:00-rbd-wip-mgolub-testing---basic-smithi/

@dillaman dillaman force-pushed the wip-rbd-mirror-notifications branch from d49b4dc to 2bdf25a Compare March 16, 2017 14:17
}

virtual void handle_rewatch_complete(int r) override {
m_pool_watcher->handle_rewatch_complete(r);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using both virtual and override intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just very old code from before the switch-over -- I'll correct it

m_pool_watcher->handle_rewatch_complete(r);
}

virtual void handle_mode_updated(cls::rbd::MirrorMode mirror_mode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override


virtual void handle_image_updated(cls::rbd::MirrorImageState state,
const std::string &remote_image_id,
const std::string &global_image_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override

}
for (auto &updated_image : m_updated_images) {
updated_image.invalid = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for loop, taking that after the code above we should have a single (invalid) in-flight request?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- missed this during a refactor

}

virtual void handle_update(const ImageIds &added_image_ids,
const ImageIds &removed_image_ids) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both virtual and override, it is intentional?

}

virtual void handle_update(const ImageIds &added_image_ids,
const ImageIds &removed_image_ids) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override

Jason Dillaman added 4 commits March 16, 2017 16:45
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The local image id set should be up-to-date when attempting to
determine which images need to be deleted.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@dillaman dillaman force-pushed the wip-rbd-mirror-notifications branch from 2bdf25a to b8e70d5 Compare March 16, 2017 21:06
@trociny
Copy link
Contributor

trociny commented Mar 17, 2017

@ceph-jenkins retest this please

Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@trociny
Copy link
Contributor

trociny commented Mar 17, 2017

@ceph-jenkins try again: retest this please

@trociny trociny merged commit 3cea3ac into ceph:master Mar 17, 2017
@dillaman dillaman deleted the wip-rbd-mirror-notifications branch March 17, 2017 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants