Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror: track images via global image id #13416

Merged
merged 7 commits into from Feb 16, 2017

Conversation

dillaman
Copy link

No description provided.

@trociny
Copy link
Contributor

trociny commented Feb 14, 2017

@dillaman Don't you think we can observe the false alarms you fixed in "qa/workunits/rbd" commits when testing our stable branches? I mean that it might make sense to backport these and a separate PR would make it easier?

Also, the commit log message "handle teuthology-specific race rbd-mirror race" looks too racy.

@dillaman
Copy link
Author

@trociny

Also, the commit log message "handle teuthology-specific race rbd-mirror race" looks too racy.

lol -- indeed. I'll put both of those fixes into a backport PR.

};

std::set<InitImageInfo> m_init_images;
std::set<ImageId> m_init_images;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImageIds?


std::swap(unregister_watch_ctx, m_unregister_watch_ctx);
if (r < 0) {
lderr(m_cct) << ": failed to register watch: " << cpp_strerror(r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While here it would be nice to fix the error message prefix (add this and remove ':').


template <typename I>
void RefreshImagesRequest<I>::handle_mirror_image_list(int r) {
dout(10) << ": r=" << r << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant ": " here and in other messages below.

Jason Dillaman added 7 commits February 14, 2017 17:41
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
…ources

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
…tion

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@dillaman
Copy link
Author

@trociny updates pushed

Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman LGTM
Though I observed one test failure [1]. It failed in "create many images" stress test, because the replayers for many images had not been started.

I have failed to reproduce this rerunning the test many times and don't have an idea if the cause is your changes. It looks like an already existent issue.

[1] http://qa-proxy.ceph.com/teuthology/trociny-2017-02-14_22:14:05-rbd-wip-mgolub-testing---basic-smithi/815813/teuthology.log

@dillaman
Copy link
Author

@trociny The ImageReplayer class instances for the down images are complaining that "start: already running" every minute so I don't think it's related to any changes in this PR. I'll see if I can locally reproduce to figure out how it got stuck in a pseudo-started state for the primary images. Most likely some odd ImageReplayer shut down (due to the image being primary) race.

@trociny
Copy link
Contributor

trociny commented Feb 15, 2017

@dillaman I saw these "already started" messages but for me they looked like were for "normally started" replayers? Here are couple of examples:

2017-02-14 23:00:41.990556 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:00:41.990570 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8019700 [2/9c5abc40-0734-4d0e-a50a-9b471a103884] start: already running
2017-02-14 23:00:41.990575 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8022130 [2/9f912876-1044-47bc-bbb1-9f357a136781] start: already running
...
2017-02-14 23:01:11.997393 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:11.997405 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8019700 [2/9c5abc40-0734-4d0e-a50a-9b471a103884] start: already running
2017-02-14 23:01:11.997410 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8022130 [2/9f912876-1044-47bc-bbb1-9f357a136781] start: already running

and the expected reported status for these images up+stopped on cluster1:

2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_2:
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   619e3619-260d-48be-9712-3785f0c49e44
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

2017-02-14T23:17:48.842 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_6:
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   9c5abc40-0734-4d0e-a50a-9b471a103884
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_7:
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   9f912876-1044-47bc-bbb1-9f357a136781
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.845 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

I interpreted "already started" errors as that the daemon was so slow by some reason that previous "try to start, detect it is primary" iteration was still in progress when the new one was started.

@dillaman
Copy link
Author

@trociny In that case -- I'd hold off on merging this until I can repeat and ensure this isn't a regression.

@dillaman
Copy link
Author

@trociny I've had the stress test running in a constant loop on my machine all day w/o fail until just now where the rbd CLI hung attempting to acquire the image status. It looks like it was a low-level objecter issue since it had a hung request that required restarting the OSD to recover. If the rbd-mirror daemon experienced a similar osdc / osd bug like I hit, that could possibly explain the error messages about "already started" if it was attempting to shut-down but librados was effectively non-responsive.

TL;DR: I think this PR is functional and I need to see if I can track down the osd / osdc issue that caused a permanently hung request.

@trociny
Copy link
Contributor

trociny commented Feb 16, 2017

@dillaman Was that hung permanent in your case? Because in that teuthology test failure it looked like rather a slowdown, not a permanent hung:

log$ zcat *mirr*gz |grep '619e3619-260d-48be-9712-3785f0c49e44.*start: already running' |sort
2017-02-14 22:56:11.928297 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:00:41.990556 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:11.997393 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:42.003361 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:03:12.018426 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:03:42.023971 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:04:12.029526 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:04:42.033766 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:15:12.340505 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running

Note, the errors were not observed permanently. For me it looked like there were periods of time when the image replayer was too slow in its periodic run (start -> check the image -> detect it is primary -> stop), which lasted for up to 1.5 minutes and during these periods 'already running' errors observed from the next (30 sec interval) attempts. I thought the cause of this slowdown might have been a mess introduced by the injected socket failures.

@trociny trociny merged commit 321dc61 into ceph:master Feb 16, 2017
@dillaman dillaman deleted the wip-rbd-mirror-cleanup branch February 16, 2017 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants