Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw multisite: fix for assertion in RGWMetaSyncCR #10743

Merged
merged 1 commit into from Aug 23, 2016

Conversation

cbodley
Copy link
Contributor

@cbodley cbodley commented Aug 16, 2016

while testing with multiple gateways per zone, i started seeing
omap-get-keys requests fail with EIO. this led to a failed
'assert(next)' in RGWMetaSyncCR due to a bug in error handling

Fixes: http://tracker.ceph.com/issues/17044

@@ -1743,7 +1743,10 @@ class RGWMetaSyncCR : public RGWCoroutine {
}
}
// wait for each shard to complete
collect(&ret, NULL);
while (ret == 0 && num_spawned() > 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, bad copy/paste. this should be num_spawned() > 0

while testing with multiple gateways per zone, i started seeing
omap-get-keys requests fail with EIO. this led to a failed
'assert(next)' in RGWMetaSyncCR due to a bug in error handling

Fixes: http://tracker.ceph.com/issues/17044

Signed-off-by: Casey Bodley <cbodley@redhat.com>
collect(&ret, NULL);
while (ret == 0 && num_spawned() > 0) {
yield wait_for_child();
collect(&ret, nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley what's the point of collect() here if we don't do anything with ret? could just drain_all(), or do something with the error (e.g., log it somewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use ret a few lines down:

         if (ret < 0) {
           return set_cr_error(ret);
         }
         // advance to the next period
         assert(next);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley maybe still add some log message there anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point, there's already a long chain of error messages. i don't think RGWMetaSyncCR has anything useful to add:

-- 10.17.151.111:0/3466678136 <== osd.0 10.17.151.111:6804/29151 215 ==== osd_op_reply(193 meta.full-sync.index.0 [omap-get-vals 0~16] v0'0 uv0 ack = -2 ((2) No such file or directory)) v7 ==== 142+0+0 (2605646485 0 0) 0xb808340 con 0xb716900
cr:s=0xbc20a50:op=0xbb1b000:21RGWRadosGetOmapKeysCR: operate() returned r=-5
rgw meta sync: ERROR: full_sync(): RGWRadosGetOmapKeysCR() returned ret=-5
rgw meta sync: sync: full_sync: shard_id=0 r=-5
cr:s=0xbc20a50:op=0xb6e9e00:18RGWMetaSyncShardCR: operate() returned r=-5
cr:s=0xbc20a50:op=0xb7e4a00:25RGWMetaSyncShardControlCR: operate() returned r=-5

(though maybe you can help me understand where that -EIO is coming from? that omap-get-vals reply says -ENOENT)

@yehudasa
Copy link
Member

@cbodley just that log maybe, other than that lgtm

@yehudasa yehudasa merged commit d8a02b9 into ceph:master Aug 23, 2016
@cbodley cbodley deleted the wip-rgw-meta-sync-assert branch August 23, 2016 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants