New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: fix dump_ops_in_flight races #8044
Conversation
Code consistency Signed-off-by: David Zafman <dzafman@redhat.com>
Haven't written an automated test case, yet. Rebased to jewel so waiting for my own make check to verify. |
// Transition from true -> false without locks being held | ||
// Can never see final_decode_needed == false and partial_decode_needed == true | ||
volatile bool partial_decode_needed; | ||
volatile bool final_decode_needed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just looking up the volatile keyword here, but I'm not sure this works, strictly speaking. http://en.cppreference.com/w/cpp/language/cv and https://en.wikipedia.org/wiki/Volatile_(computer_programming)#In_C_and_C.2B.2B are both pretty clear that volatile
is not "for inter-thread communication"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, can we use atomic<bool>
instead here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that'll get expensive; don't we dereference those bools every time we look at an MOSDOp? I think we're just using those as flags to determine if we need to grab a lock and do the decode before looking at memory, or if we can look at the memory directly, which means presumably with careful coding it's okay if we read a stale value (we'd just grab the lock needlessly). But apparently it's being done wrongly, or maybe it really is just that the memory reads are being reordered incorrectly (which is what volatile
prevents)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, volatile really isn't appropriate here. atomic<bool>
would be cleanest.
Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Doing this for clarity, since I haven't identified it as a bug. Signed-off-by: David Zafman <dzafman@redhat.com>
Another crash caused by a dump_ops_in_flight similar to ceph#8885 Signed-off-by: David Zafman <dzafman@redhat.com>
{ | ||
RWLock::RLocker l(tracker->lock); | ||
if (tracker->tracking_enabled) { | ||
tracker->register_inflight_op(&xitem); | ||
events.push_back(make_pair(initiated_at, "initiated")); | ||
is_tracked = true; | ||
// Tells tracking_start() whether to set is_tracked or not | ||
is_registered = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this does solve the problem, but I think the better solution would be to move this code into a TrackedOp::register method and call it in OpTracker::create_request. That way, we never have a half-constructed object registered in the first place and can avoid having both is_tracked and is_registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, we can have just is_tracked rather than both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the fix for this by moving that constructor code into tracking_start() so that is_registered isn't needed.
Use is_tracked to prevent TrackedOp::dump() from trying to call virtual function while still in OpRequest constructor. Fixes: ceph#8885 Signed-off-by: David Zafman <dzafman@redhat.com>
…op() Make tracking_enabled and the lock private. Signed-off-by: David Zafman <dzafman@redhat.com>
lgtm |
passed my rados run |
Fixes: #8885