Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: MonClient may hang on pinging an unresponsive monitor #9259

Merged
merged 2 commits into from Aug 5, 2016

Conversation

xiexingguo
Copy link
Member

On timedout, WaitUntil() method return a positive ETIMEDOUT error number.
It never returns -ETIMEDOUT.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

@tchaikov tchaikov changed the title mon: MonClient may hang on pinging monitor forever mon: MonClient may hang on pinging a dead monitor May 23, 2016
@tchaikov tchaikov changed the title mon: MonClient may hang on pinging a dead monitor mon: MonClient may hang on pinging an unresponsive monitor May 23, 2016
@tchaikov
Copy link
Contributor

lgtm.

ping_monitor() is only used by rados_ping_monitor() in librados and ceph's ping commnad, so it's not necessary to backport this fix.

@yuriw
Copy link
Contributor

yuriw commented May 25, 2016

@xiexingguo
Copy link
Member Author

@yuriw Thanks, yuri. I'll check and verify it locally.

@xiexingguo
Copy link
Member Author

@tchaikov I retested this above test case that failed these changes and it passed(http://daisycloud.org:9091/xxg-2016-06-13_15:24:56-rados-wip-xxg-testing---basic-plana/468/).
So the problem I guess is it happened that one of the monitors was unable to up and respond in 300s, which wouldn't be a problem before this change as we would wait forever(so perhaps this test case was going to die under this case?)

Could you review (and retest, if possible) this pr for me? Thanks!

@athanatos
Copy link
Contributor

@xiexingguo It sounds like the failure is non-deterministic. You need to change the test so that we don't get false failures.

On timedout, WaitUntil() method return a positive ETIMEDOUT error number.
It never returns -ETIMEDOUT.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo
Copy link
Member Author

@@ -143,8 +143,10 @@ def test_ping_monitor(self):
cmd = {'prefix': 'mon dump', 'format':'json'}
ret, buf, out = self.rados.mon_command(json.dumps(cmd), b'')
for mon in json.loads(buf.decode('utf8'))['mons']:
buf = json.loads(self.rados.ping_monitor(mon['name']))
assert buf.get('health')
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate a little bit on this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below:
I retested this above test case that failed these changes and it passed(http://daisycloud.org:9091/xxg-2016-06-13_15:24:56-rados-wip-xxg-testing---basic-plana/468/).
So the problem I guess is it happened that one of the monitors was unable to up and respond in 300s, which wouldn't be a problem before this change as we would wait forever(so perhaps this test case was going to die under this case?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov To be more specific:
before this change this test case won't be a problem as we'll ping and wait for response forever.
But this change allow monitor to return error on ETIMEDOUT which will fail the test...

@yuriw yuriw merged commit 23e7318 into ceph:master Aug 5, 2016
@tchaikov tchaikov self-assigned this Aug 5, 2016
@xiexingguo xiexingguo deleted the xxg-wip-fix-monclientpinger branch August 5, 2016 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants