Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async: set nonce before starting the workers #12390

Merged
merged 1 commit into from Dec 12, 2016

Conversation

tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Dec 8, 2016

otherwise workers will respond with different nonces to peers.
and remove nonce from Processor. as there is only one nonce for each
Messenger at a given time.

Signed-off-by: Kefu Chai kchai@redhat.com

@ghost
Copy link

ghost commented Dec 8, 2016

jenkins test this please (jenkins long paths hit again http://tracker.ceph.com/issues/15249)

p->start();
} else {
ldout(cct, 10) << __func__ << " will try " << bind_addr
<< " and avoid ports " << new_avoid << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this out of loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tchaikov tchaikov force-pushed the wip-start-after-setting-nonce branch from 3aeefb2 to ff37bca Compare December 8, 2016 15:22
@wjwithagen
Copy link
Contributor

@tchaikov
I will testrun tonight.

@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 8, 2016

@wjwithagen please hold on until the tests pass. i am still fixing this PR, it fails 2 tests.

@tchaikov tchaikov force-pushed the wip-start-after-setting-nonce branch from ff37bca to ac9e2fc Compare December 8, 2016 18:54
@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 8, 2016

@wjwithagen fixed. good to test.

@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 8, 2016

changelog

  • should be if (r) in AsyncMessenger::bind() and AsyncMessenger::rebind(), not if (!r).

@wjwithagen
Copy link
Contributor

@tchaikov
One of them is socket path too long??
I have a feeling for already quite some time that they are better of in /tmp/var/run or some place like that or /tmp/td/....

otherwise workers will respond with difference nonces to peers.
and remove nonce from Processor. as there is only one nonce for each
Messenger at a given time.

Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov tchaikov force-pushed the wip-start-after-setting-nonce branch from ac9e2fc to aac1a3e Compare December 8, 2016 19:03
@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 8, 2016

changelog

  • start() all bound processors instead of the processors[0..i). they are equivalent, but the new version is easier to understand.

@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 8, 2016

@wjwithagen no, i didn't take that one into account. the two failed tests were the osd-osd-markdown.sh and cephtool-test-mon.sh. but i believe that they should pass now.

@tchaikov tchaikov self-assigned this Dec 8, 2016
@wjwithagen
Copy link
Contributor

@tchaikov
Oke will grab the code and test

@wjwithagen
Copy link
Contributor

Still the same problem:
cephtool-test-mon.sh freezes on
ceph-master/build/bin/ceph tell osd.0 version
repeating over and over in the log:

2016-12-08 21:44:04.450362 b9fad00  1 -- 127.0.0.1:6800/96528 --> 127.0.0.1:7204/0 -- pg_stats(0 pgs tid 97 v 0) v1 -- 0xc29b180 con 0x0
2016-12-08 21:44:04.655291 bb9df80  1 -- 127.0.0.1:6800/96528 <== mon.2 127.0.0.1:7204/0 367 ==== pg_stats_ack(0 pgs tid 97) v1 ==== 4+0+0 (0 0 0) 0xc61e400 con 0xbbce000
2016-12-08 21:44:06.325820 b9f8d80  1 -- 127.0.0.1:6800/1096528 >> - conn(0xc65e800 :6800 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=16 -
2016-12-08 21:44:06.326674 b9f8d80  1 -- 127.0.0.1:6800/1096528 >> - conn(0xc65e800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_bulk peer close file descriptor 16
2016-12-08 21:44:06.326773 b9f8d80  1 -- 127.0.0.1:6800/1096528 >> - conn(0xc65e800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_until read failed
2016-12-08 21:44:06.326803 b9f8d80  1 -- 127.0.0.1:6800/1096528 >> - conn(0xc65e800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection read peer banner and addr failed
2016-12-08 21:44:06.326920 b9f8d80  0 -- 127.0.0.1:6800/1096528 >> - conn(0xc65e800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2016-12-08 21:44:06.327034 b9f8d80  1 -- 127.0.0.1:6800/1096528 reap_dead start

So the pid is 96528, as seen from talking to mon.2.
But the other connection has its nonce set 1000000 higher from a rebind.

However if I run ceph from again the similar setup from the commandline is produces the following error:

2016-12-08 21:51:24.829097 812c6fe00  0 -- 127.0.0.1:0/2757266831 >> 127.0.0.1:6800/96528 conn(0x815426000 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1096528 not 127.0.0.1:6800/96528 - wrong node!
2016-12-08 21:51:25.043623 812c6fe00  0 -- 127.0.0.1:0/2757266831 >> 127.0.0.1:6800/96528 conn(0x815426000 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1096528 not 127.0.0.1:6800/96528 - wrong node!
2016-12-08 21:51:25.470788 812c6fe00  0 -- 127.0.0.1:0/2757266831 >> 127.0.0.1:6800/96528 conn(0x815426000 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1096528 not 127.0.0.1:6800/96528 - wrong node!
2016-12-08 21:51:26.272885 812c6fe00  0 -- 127.0.0.1:0/2757266831 >> 127.0.0.1:6800/96528 conn(0x815426000 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1096528 not 127.0.0.1:6800/96528 - wrong node!
2016-12-08 21:51:27.876517 812c6fe00  0 -- 127.0.0.1:0/2757266831 >> 127.0.0.1:6800/96528 conn(0x815426000 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1096528 not 127.0.0.1:6800/96528 - wrong node!

The pid of the ceph-osd is actually 1 higher than the nonce.???
and these are its sockets:

jenkins  ceph-osd   96529 14 tcp4   *:6800                *:*
jenkins  ceph-osd   96529 15 tcp4   127.0.0.1:6800        *:*
jenkins  ceph-osd   96529 21 stream /home/jenkins/workspace/ceph-master/build/src/test/td/t-7202/out/osd.0.asok
jenkins  ceph-osd   96529 32 tcp4   127.0.0.1:62506       127.0.0.1:7204
jenkins  ceph-osd   96529 41 tcp4   127.0.0.1:52840       127.0.0.1:6806
jenkins  ceph-osd   96529 43 tcp4   127.0.0.1:62581       127.0.0.1:6808
jenkins  ceph-osd   96529 45 tcp4   127.0.0.1:43692       127.0.0.1:6807
jenkins  ceph-osd   96529 46 tcp4   127.0.0.1:6804        *:*
jenkins  ceph-osd   96529 47 tcp4   127.0.0.1:34029       127.0.0.1:6810
jenkins  ceph-osd   96529 48 tcp4   127.0.0.1:6805        *:*
jenkins  ceph-osd   96529 50 tcp4   127.0.0.1:59044       127.0.0.1:6811
jenkins  ceph-osd   96529 57 tcp4   127.0.0.1:6804        127.0.0.1:33938
jenkins  ceph-osd   96529 58 tcp4   127.0.0.1:6805        127.0.0.1:24705
jenkins  ceph-osd   96529 59 tcp4   127.0.0.1:6800        127.0.0.1:13204
jenkins  ceph-osd   96529 62 tcp4   127.0.0.1:6805        127.0.0.1:40651
jenkins  ceph-osd   96529 64 tcp4   127.0.0.1:6804        127.0.0.1:25652
jenkins  ceph-osd   96529 65 tcp4   127.0.0.1:6800        127.0.0.1:35355

Haven't looked into the new code, but it suffers from exactly the same 'feature' ....

@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 9, 2016

@wjwithagen yeah, your issue is not root-caused yet. and again, i think the nonce of external messenger should not change once the osd-ceph is up and running.

@tchaikov tchaikov merged commit 9418e07 into ceph:master Dec 12, 2016
@tchaikov
Copy link
Contributor Author

@tchaikov tchaikov deleted the wip-start-after-setting-nonce branch December 12, 2016 16:03
@tchaikov tchaikov assigned yuyuyu101 and unassigned tchaikov Dec 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants