Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jewel: rgw: multisite: coroutine deadlock in RGWMetaSyncCR after ECANCELED errors #12738

Merged
merged 1 commit into from Feb 1, 2017

Conversation

smithfarm
Copy link
Contributor

RGWCoroutinesManager::run() was setting ret = -ECANCELED to break out of
the loop when it sees going_down. coroutines that failed with -ECANCELED
were confusing this logic and leading to coroutine deadlock assertions
below. when we hit the going_down case, set a 'canceled' flag, and check
that flag when deciding whether to break out of the loop

Fixes: http://tracker.ceph.com/issues/17465

Signed-off-by: Casey Bodley <cbodley@redhat.com>
(cherry picked from commit 73cd8df)
@smithfarm smithfarm self-assigned this Jan 2, 2017
@smithfarm smithfarm added this to the jewel milestone Jan 2, 2017
@smithfarm smithfarm changed the title jewel: multisite: coroutine deadlock in RGWMetaSyncCR after ECANCELED errors jewel: rgw: multisite: coroutine deadlock in RGWMetaSyncCR after ECANCELED errors Jan 2, 2017
@smithfarm smithfarm merged commit 5834732 into ceph:jewel Feb 1, 2017
@smithfarm smithfarm deleted the wip-18286-jewel branch February 1, 2017 22:01
@smithfarm
Copy link
Contributor Author

@smithfarm
Copy link
Contributor Author

(11:46:45 AM) smithfarm: owasserm: thanks. For jewel integration rgw, then, what it comes down to is verifying that these 6 valgrind failures are all libtcmalloc-related: http://pulpito.front.sepia.ceph.com/smithfarm-2017-01-31_12:35:14-rgw-wip-jewel-backports-distro-basic-smithi/
(11:46:58 AM) smithfarm: owasserm: I will do that now
(11:47:05 AM) owasserm: smithfarm, thanks
(11:47:33 AM) smithfarm: owasserm: and assuming they are tcmalloc related, you said I can directly merge all the rgw PRs? Or do you want me to ask you for review in the PRs first?
(11:47:53 AM) owasserm: smithfarm, yes you can merge them
(11:48:19 AM) smithfarm: ok, will merge and do at least one or two more rgw runs before passing 10.2.6 to QE

smithfarm pushed a commit to smithfarm/ceph that referenced this pull request Feb 2, 2017
…operly

Catch decode errors so osd doesn't crash on corrupt OI_ATTR or SS_ATTR
Use boost::optional<> to make current state clearer
Create next_clone as needed using head/curclone
Add equivalent logic after getting to end of scrubmap.objects

Fixes: ceph#12738

Signed-off-by: David Zafman <dzafman@redhat.com>
(cherry picked from commit a23036c)

Conflicts:
	src/osd/ReplicatedPG.cc (no num_objects_pinned in hammer)
	src/osd/ReplicatedPG.h (no get_temp_recovery_object() in hammer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants