New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd-mirror: track images via global image id #13416
Conversation
@dillaman Don't you think we can observe the false alarms you fixed in "qa/workunits/rbd" commits when testing our stable branches? I mean that it might make sense to backport these and a separate PR would make it easier? Also, the commit log message "handle teuthology-specific race rbd-mirror race" looks too racy. |
lol -- indeed. I'll put both of those fixes into a backport PR. |
16ddb47
to
d6b1328
Compare
src/tools/rbd_mirror/Replayer.h
Outdated
}; | ||
|
||
std::set<InitImageInfo> m_init_images; | ||
std::set<ImageId> m_init_images; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImageIds?
src/librbd/Watcher.cc
Outdated
|
||
std::swap(unregister_watch_ctx, m_unregister_watch_ctx); | ||
if (r < 0) { | ||
lderr(m_cct) << ": failed to register watch: " << cpp_strerror(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While here it would be nice to fix the error message prefix (add this
and remove ':').
|
||
template <typename I> | ||
void RefreshImagesRequest<I>::handle_mirror_image_list(int r) { | ||
dout(10) << ": r=" << r << dendl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant ": " here and in other messages below.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
…ources Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
…tion Signed-off-by: Jason Dillaman <dillaman@redhat.com>
d6b1328
to
34016f7
Compare
@trociny updates pushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman LGTM
Though I observed one test failure [1]. It failed in "create many images" stress test, because the replayers for many images had not been started.
I have failed to reproduce this rerunning the test many times and don't have an idea if the cause is your changes. It looks like an already existent issue.
@trociny The |
@dillaman I saw these "already started" messages but for me they looked like were for "normally started" replayers? Here are couple of examples:
and the expected reported status for these images up+stopped on cluster1:
I interpreted "already started" errors as that the daemon was so slow by some reason that previous "try to start, detect it is primary" iteration was still in progress when the new one was started. |
@trociny In that case -- I'd hold off on merging this until I can repeat and ensure this isn't a regression. |
@trociny I've had the stress test running in a constant loop on my machine all day w/o fail until just now where the rbd CLI hung attempting to acquire the image status. It looks like it was a low-level objecter issue since it had a hung request that required restarting the OSD to recover. If the rbd-mirror daemon experienced a similar osdc / osd bug like I hit, that could possibly explain the error messages about "already started" if it was attempting to shut-down but librados was effectively non-responsive. TL;DR: I think this PR is functional and I need to see if I can track down the osd / osdc issue that caused a permanently hung request. |
@dillaman Was that hung permanent in your case? Because in that teuthology test failure it looked like rather a slowdown, not a permanent hung:
Note, the errors were not observed permanently. For me it looked like there were periods of time when the image replayer was too slow in its periodic run (start -> check the image -> detect it is primary -> stop), which lasted for up to 1.5 minutes and during these periods 'already running' errors observed from the next (30 sec interval) attempts. I thought the cause of this slowdown might have been a mess introduced by the injected socket failures. |
No description provided.