New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon: MonClient may hang on pinging an unresponsive monitor #9259
Conversation
lgtm.
|
@xiexingguo we see "TestRados.test_ping_monitor ... ERROR" in run http://pulpito.ceph.com/yuriw-2016-05-24_20:27:36-rados-wip-yuri-testing-distro-basic-smithi/ Suspecting this PR is at fault per IRC with @athanatos . |
@yuriw Thanks, yuri. I'll check and verify it locally. |
@tchaikov I retested this above test case that failed these changes and it passed(http://daisycloud.org:9091/xxg-2016-06-13_15:24:56-rados-wip-xxg-testing---basic-plana/468/). Could you review (and retest, if possible) this pr for me? Thanks! |
@xiexingguo It sounds like the failure is non-deterministic. You need to change the test so that we don't get false failures. |
On timedout, WaitUntil() method return a positive ETIMEDOUT error number. It never returns -ETIMEDOUT. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
cf2d42b
to
9f9a8aa
Compare
Fixed test case and retested here: |
@@ -143,8 +143,10 @@ def test_ping_monitor(self): | |||
cmd = {'prefix': 'mon dump', 'format':'json'} | |||
ret, buf, out = self.rados.mon_command(json.dumps(cmd), b'') | |||
for mon in json.loads(buf.decode('utf8'))['mons']: | |||
buf = json.loads(self.rados.ping_monitor(mon['name'])) | |||
assert buf.get('health') | |||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you elaborate a little bit on this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below:
I retested this above test case that failed these changes and it passed(http://daisycloud.org:9091/xxg-2016-06-13_15:24:56-rados-wip-xxg-testing---basic-plana/468/).
So the problem I guess is it happened that one of the monitors was unable to up and respond in 300s, which wouldn't be a problem before this change as we would wait forever(so perhaps this test case was going to die under this case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaikov To be more specific:
before this change this test case won't be a problem as we'll ping and wait for response forever.
But this change allow monitor to return error on ETIMEDOUT which will fail the test...
On timedout, WaitUntil() method return a positive ETIMEDOUT error number.
It never returns -ETIMEDOUT.
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn