rbd-mirror: track images via global image id #13416

dillaman · 2017-02-14T17:28:17Z

No description provided.

trociny · 2017-02-14T20:50:05Z

@dillaman Don't you think we can observe the false alarms you fixed in "qa/workunits/rbd" commits when testing our stable branches? I mean that it might make sense to backport these and a separate PR would make it easier?

Also, the commit log message "handle teuthology-specific race rbd-mirror race" looks too racy.

dillaman · 2017-02-14T20:52:22Z

@trociny

Also, the commit log message "handle teuthology-specific race rbd-mirror race" looks too racy.

lol -- indeed. I'll put both of those fixes into a backport PR.

trociny · 2017-02-14T21:37:05Z

src/tools/rbd_mirror/Replayer.h

-  };
-
-  std::set<InitImageInfo> m_init_images;
+  std::set<ImageId> m_init_images;


trociny · 2017-02-14T21:43:28Z

src/librbd/Watcher.cc

+
+    std::swap(unregister_watch_ctx, m_unregister_watch_ctx);
+    if (r < 0) {
+      lderr(m_cct) << ": failed to register watch: " << cpp_strerror(r)


While here it would be nice to fix the error message prefix (add this and remove ':').

trociny · 2017-02-14T21:58:03Z

src/tools/rbd_mirror/pool_watcher/RefreshImagesRequest.cc

+
+template <typename I>
+void RefreshImagesRequest<I>::handle_mirror_image_list(int r) {
+  dout(10) << ": r=" << r << dendl;


redundant ": " here and in other messages below.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>

…ources Signed-off-by: Jason Dillaman <dillaman@redhat.com>

Signed-off-by: Jason Dillaman <dillaman@redhat.com>

…tion Signed-off-by: Jason Dillaman <dillaman@redhat.com>

dillaman · 2017-02-15T00:21:46Z

@trociny updates pushed

trociny

@dillaman LGTM
Though I observed one test failure [1]. It failed in "create many images" stress test, because the replayers for many images had not been started.

I have failed to reproduce this rerunning the test many times and don't have an idea if the cause is your changes. It looks like an already existent issue.

[1] http://qa-proxy.ceph.com/teuthology/trociny-2017-02-14_22:14:05-rbd-wip-mgolub-testing---basic-smithi/815813/teuthology.log

dillaman · 2017-02-15T15:42:29Z

@trociny The ImageReplayer class instances for the down images are complaining that "start: already running" every minute so I don't think it's related to any changes in this PR. I'll see if I can locally reproduce to figure out how it got stuck in a pseudo-started state for the primary images. Most likely some odd ImageReplayer shut down (due to the image being primary) race.

trociny · 2017-02-15T15:54:09Z

@dillaman I saw these "already started" messages but for me they looked like were for "normally started" replayers? Here are couple of examples:

2017-02-14 23:00:41.990556 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:00:41.990570 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8019700 [2/9c5abc40-0734-4d0e-a50a-9b471a103884] start: already running
2017-02-14 23:00:41.990575 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8022130 [2/9f912876-1044-47bc-bbb1-9f357a136781] start: already running
...
2017-02-14 23:01:11.997393 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:11.997405 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8019700 [2/9c5abc40-0734-4d0e-a50a-9b471a103884] start: already running
2017-02-14 23:01:11.997410 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8022130 [2/9f912876-1044-47bc-bbb1-9f357a136781] start: already running

and the expected reported status for these images up+stopped on cluster1:

2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_2:
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   619e3619-260d-48be-9712-3785f0c49e44
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

2017-02-14T23:17:48.842 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_6:
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   9c5abc40-0734-4d0e-a50a-9b471a103884
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_7:
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   9f912876-1044-47bc-bbb1-9f357a136781
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.845 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

I interpreted "already started" errors as that the daemon was so slow by some reason that previous "try to start, detect it is primary" iteration was still in progress when the new one was started.

dillaman · 2017-02-15T16:08:04Z

@trociny In that case -- I'd hold off on merging this until I can repeat and ensure this isn't a regression.

dillaman · 2017-02-16T01:10:06Z

@trociny I've had the stress test running in a constant loop on my machine all day w/o fail until just now where the rbd CLI hung attempting to acquire the image status. It looks like it was a low-level objecter issue since it had a hung request that required restarting the OSD to recover. If the rbd-mirror daemon experienced a similar osdc / osd bug like I hit, that could possibly explain the error messages about "already started" if it was attempting to shut-down but librados was effectively non-responsive.

TL;DR: I think this PR is functional and I need to see if I can track down the osd / osdc issue that caused a permanently hung request.

trociny · 2017-02-16T07:34:13Z

@dillaman Was that hung permanent in your case? Because in that teuthology test failure it looked like rather a slowdown, not a permanent hung:

log$ zcat *mirr*gz |grep '619e3619-260d-48be-9712-3785f0c49e44.*start: already running' |sort
2017-02-14 22:56:11.928297 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:00:41.990556 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:11.997393 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:42.003361 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:03:12.018426 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:03:42.023971 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:04:12.029526 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:04:42.033766 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:15:12.340505 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running

Note, the errors were not observed permanently. For me it looked like there were periods of time when the image replayer was too slow in its periodic run (start -> check the image -> detect it is primary -> stop), which lasted for up to 1.5 minutes and during these periods 'already running' errors observed from the next (30 sec interval) attempts. I thought the cause of this slowdown might have been a mess introduced by the injected socket failures.

dillaman added cleanup rbd labels Feb 14, 2017

dillaman requested a review from trociny February 14, 2017 18:08

trociny self-assigned this Feb 14, 2017

trociny added the wip-mgolub-testing label Feb 14, 2017

dillaman force-pushed the wip-rbd-mirror-cleanup branch from 16ddb47 to d6b1328 Compare February 14, 2017 21:00

trociny reviewed Feb 14, 2017

View reviewed changes

Jason Dillaman added 7 commits February 14, 2017 17:41

test: added missing IoCtx copy/assignment methods in librados_test_stub

c35d307

Signed-off-by: Jason Dillaman <dillaman@redhat.com>

rbd-mirror: utilize global image id as internal unique key

ced7187

Signed-off-by: Jason Dillaman <dillaman@redhat.com>

rbd-mirror: preliminary support to track multiple remote peer image s…

7565acc

…ources Signed-off-by: Jason Dillaman <dillaman@redhat.com>

librbd: improved state tracking within object watcher

77d7d8f

Signed-off-by: Jason Dillaman <dillaman@redhat.com>

cls/rbd: async versions for dir_list and mirror_image_list

4b97611

Signed-off-by: Jason Dillaman <dillaman@redhat.com>

rbd-mirror: async mirror image refresh state machine

7c45b13

Signed-off-by: Jason Dillaman <dillaman@redhat.com>

test: fix unused function warnings due to explicit template instantia…

34016f7

…tion Signed-off-by: Jason Dillaman <dillaman@redhat.com>

dillaman force-pushed the wip-rbd-mirror-cleanup branch from d6b1328 to 34016f7 Compare February 15, 2017 00:21

trociny approved these changes Feb 15, 2017

View reviewed changes

trociny merged commit 321dc61 into ceph:master Feb 16, 2017

dillaman deleted the wip-rbd-mirror-cleanup branch February 16, 2017 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbd-mirror: track images via global image id #13416

rbd-mirror: track images via global image id #13416

dillaman commented Feb 14, 2017

trociny commented Feb 14, 2017

dillaman commented Feb 14, 2017

trociny Feb 14, 2017

trociny Feb 14, 2017

trociny Feb 14, 2017

dillaman commented Feb 15, 2017

trociny left a comment

dillaman commented Feb 15, 2017

trociny commented Feb 15, 2017

dillaman commented Feb 15, 2017

dillaman commented Feb 16, 2017

trociny commented Feb 16, 2017

rbd-mirror: track images via global image id #13416

rbd-mirror: track images via global image id #13416

Conversation

dillaman commented Feb 14, 2017

trociny commented Feb 14, 2017

dillaman commented Feb 14, 2017

trociny Feb 14, 2017

Choose a reason for hiding this comment

trociny Feb 14, 2017

Choose a reason for hiding this comment

trociny Feb 14, 2017

Choose a reason for hiding this comment

dillaman commented Feb 15, 2017

trociny left a comment

Choose a reason for hiding this comment

dillaman commented Feb 15, 2017

trociny commented Feb 15, 2017

dillaman commented Feb 15, 2017

dillaman commented Feb 16, 2017

trociny commented Feb 16, 2017