New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw multisite: fix for assertion in RGWMetaSyncCR #10743
Conversation
@@ -1743,7 +1743,10 @@ class RGWMetaSyncCR : public RGWCoroutine { | |||
} | |||
} | |||
// wait for each shard to complete | |||
collect(&ret, NULL); | |||
while (ret == 0 && num_spawned() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, bad copy/paste. this should be num_spawned() > 0
while testing with multiple gateways per zone, i started seeing omap-get-keys requests fail with EIO. this led to a failed 'assert(next)' in RGWMetaSyncCR due to a bug in error handling Fixes: http://tracker.ceph.com/issues/17044 Signed-off-by: Casey Bodley <cbodley@redhat.com>
6663408
to
ced2eb2
Compare
collect(&ret, NULL); | ||
while (ret == 0 && num_spawned() > 0) { | ||
yield wait_for_child(); | ||
collect(&ret, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley what's the point of collect() here if we don't do anything with ret
? could just drain_all(), or do something with the error (e.g., log it somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use ret
a few lines down:
if (ret < 0) {
return set_cr_error(ret);
}
// advance to the next period
assert(next);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley maybe still add some log message there anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point, there's already a long chain of error messages. i don't think RGWMetaSyncCR has anything useful to add:
-- 10.17.151.111:0/3466678136 <== osd.0 10.17.151.111:6804/29151 215 ==== osd_op_reply(193 meta.full-sync.index.0 [omap-get-vals 0~16] v0'0 uv0 ack = -2 ((2) No such file or directory)) v7 ==== 142+0+0 (2605646485 0 0) 0xb808340 con 0xb716900
cr:s=0xbc20a50:op=0xbb1b000:21RGWRadosGetOmapKeysCR: operate() returned r=-5
rgw meta sync: ERROR: full_sync(): RGWRadosGetOmapKeysCR() returned ret=-5
rgw meta sync: sync: full_sync: shard_id=0 r=-5
cr:s=0xbc20a50:op=0xb6e9e00:18RGWMetaSyncShardCR: operate() returned r=-5
cr:s=0xbc20a50:op=0xb7e4a00:25RGWMetaSyncShardControlCR: operate() returned r=-5
(though maybe you can help me understand where that -EIO is coming from? that omap-get-vals reply says -ENOENT)
@cbodley just that log maybe, other than that lgtm |
while testing with multiple gateways per zone, i started seeing
omap-get-keys requests fail with EIO. this led to a failed
'assert(next)' in RGWMetaSyncCR due to a bug in error handling
Fixes: http://tracker.ceph.com/issues/17044