Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: use explicit flag to cancel RGWCoroutinesManager::run() #12452

Merged
merged 1 commit into from Dec 14, 2016

Conversation

cbodley
Copy link
Contributor

@cbodley cbodley commented Dec 12, 2016

RGWCoroutinesManager::run() was setting ret = -ECANCELED to break out of
the loop when it sees going_down. coroutines that failed with -ECANCELED
were confusing this logic and leading to coroutine deadlock assertions
below. when we hit the going_down case, set a 'canceled' flag, and check
that flag when deciding whether to break out of the loop

Fixes: http://tracker.ceph.com/issues/17465

RGWCoroutinesManager::run() was setting ret = -ECANCELED to break out of
the loop when it sees going_down. coroutines that failed with -ECANCELED
were confusing this logic and leading to coroutine deadlock assertions
below. when we hit the going_down case, set a 'canceled' flag, and check
that flag when deciding whether to break out of the loop

Fixes: http://tracker.ceph.com/issues/17465

Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley
Copy link
Contributor Author

cbodley commented Dec 12, 2016

reproduced the deadlock with test_multi.py by injecting ECANCELED errors:

diff --git a/src/rgw/rgw_sync.cc b/src/rgw/rgw_sync.cc
index c44305e..c9761ac 100644
--- a/src/rgw/rgw_sync.cc
+++ b/src/rgw/rgw_sync.cc
@@ -1159,7 +1159,12 @@ int RGWMetaSyncSingleEntryCR::operate() {
     retcode = 0;
     for (tries = 0; tries < NUM_TRANSIENT_ERROR_RETRIES; tries++) {
       if (sync_status != -ENOENT) {
+        retcode = -ECANCELED;
-          yield call(new RGWMetaStoreEntryCR(sync_env, raw_key, md_bl));
       } else {
           yield call(new RGWMetaRemoveEntryCR(sync_env, raw_key));
       }

with this PR applied, the deadlock is avoided. will schedule for teuthology

Copy link
Member

@yehudasa yehudasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cbodley
Copy link
Contributor Author

cbodley commented Dec 14, 2016

http://pulpito.ceph.com/cbodley-2016-12-13_10:08:00-rgw-wip-rgw-fix-cr-deadlock---basic-mira/ completed with only valgrind failures and selinux denials

@yehudasa yehudasa merged commit a5bc1f4 into master Dec 14, 2016
@cbodley cbodley deleted the wip-rgw-fix-cr-deadlock branch December 14, 2016 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants