Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: fix dump_ops_in_flight races #8044

Merged
merged 7 commits into from Mar 25, 2016
Merged

osd: fix dump_ops_in_flight races #8044

merged 7 commits into from Mar 25, 2016

Conversation

dzafman
Copy link
Contributor

@dzafman dzafman commented Mar 11, 2016

Fixes: #8885

Code consistency

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented Mar 11, 2016

Haven't written an automated test case, yet. Rebased to jewel so waiting for my own make check to verify.

// Transition from true -> false without locks being held
// Can never see final_decode_needed == false and partial_decode_needed == true
volatile bool partial_decode_needed;
volatile bool final_decode_needed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just looking up the volatile keyword here, but I'm not sure this works, strictly speaking. http://en.cppreference.com/w/cpp/language/cv and https://en.wikipedia.org/wiki/Volatile_(computer_programming)#In_C_and_C.2B.2B are both pretty clear that volatile is not "for inter-thread communication"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, can we use atomic<bool> instead here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'll get expensive; don't we dereference those bools every time we look at an MOSDOp? I think we're just using those as flags to determine if we need to grab a lock and do the decode before looking at memory, or if we can look at the memory directly, which means presumably with careful coding it's okay if we read a stale value (we'd just grab the lock needlessly). But apparently it's being done wrongly, or maybe it really is just that the memory reads are being reordered incorrectly (which is what volatile prevents)...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, volatile really isn't appropriate here. atomic<bool> would be cleanest.

Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Doing this for clarity, since I haven't identified it as a bug.

Signed-off-by: David Zafman <dzafman@redhat.com>
Another crash caused by a dump_ops_in_flight similar to ceph#8885

Signed-off-by: David Zafman <dzafman@redhat.com>
{
RWLock::RLocker l(tracker->lock);
if (tracker->tracking_enabled) {
tracker->register_inflight_op(&xitem);
events.push_back(make_pair(initiated_at, "initiated"));
is_tracked = true;
// Tells tracking_start() whether to set is_tracked or not
is_registered = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this does solve the problem, but I think the better solution would be to move this code into a TrackedOp::register method and call it in OpTracker::create_request. That way, we never have a half-constructed object registered in the first place and can avoid having both is_tracked and is_registered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, we can have just is_tracked rather than both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the fix for this by moving that constructor code into tracking_start() so that is_registered isn't needed.

Use is_tracked to prevent TrackedOp::dump() from trying to call
virtual function while still in OpRequest constructor.

Fixes: ceph#8885

Signed-off-by: David Zafman <dzafman@redhat.com>
…op()

Make tracking_enabled and the lock private.

Signed-off-by: David Zafman <dzafman@redhat.com>
@athanatos
Copy link
Contributor

lgtm

@liewegas
Copy link
Member

passed my rados run

@liewegas liewegas changed the title Fix dump_ops_in_flight races osd: fix dump_ops_in_flight races Mar 25, 2016
@liewegas liewegas merged commit bc3dcd2 into ceph:jewel Mar 25, 2016
@dzafman dzafman deleted the wip-8885 branch March 29, 2016 00:53
@dzafman dzafman restored the wip-8885 branch March 29, 2016 01:01
@dzafman dzafman deleted the wip-8885 branch March 29, 2016 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants