Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdm: fix leak when existing failure in ip network #13435

Merged
merged 1 commit into from Feb 18, 2017

Conversation

yuyuyu101
Copy link
Member

Signed-off-by: Haomai Wang haomai@xsky.com

Signed-off-by: Haomai Wang <haomai@xsky.com>
@yuyuyu101
Copy link
Member Author

@Adirl @orendu I find the leaking QP which is created but calling "try_connect" failed. So it happen in unhealthy cluster and no problem in healthy cluster.

@yuyuyu101
Copy link
Member Author

@Adirl plz test this pr in your cluster to verify current qp number and see whether matching our assuption..

@Adirl
Copy link

Adirl commented Feb 15, 2017

Great ! thanks
patch looks good,
i need to test and will send update

@yuyuyu101 yuyuyu101 merged commit 6dcd79c into ceph:master Feb 18, 2017
@yuyuyu101 yuyuyu101 deleted the wip-rdma-leak branch February 18, 2017 06:14
@DanielBar-On
Copy link
Contributor

@yuyuyu101
Confirming qp leak issue is resolved. Checked on 3 nodes with 13 osds.

with 1 osds up out of 13 in, we got-
1528 qps created ,9 qps active and 16 qps according to kernel system information.
then after a few minutes, results stayed pretty much the same with -
1580 qps created, 9 qps active and 14 qps according to kernel system information.

with 6 osds up out of 13 in, we got -
2314 qps created ,164 qps active and 182 qps according to kernel system information.
and again, after a few minutes, results didn't change with -
2407 qps created, 164 qps active and 182 qps according to kernel system information.

with 13 osds up out of 13 in, we got -
2287 qps created ,712 qps active and 737 qps according to kernel system information.
and again, after a few minutes -
2616 qps created, 712 qps active and 733 qps according to kernel system information.

@Adirl
Copy link

Adirl commented Feb 28, 2017

@DanielBo @yuyuyu101
qp numbers look good !

@DanielBar-On
Copy link
Contributor

@yuyuyu101
Encountered a new issue:
On a setup of 3 nodes, 1 mon, 13 osds we found that after killing osds and bringing them back up a few times, a machine will start showing a constantly increasing number of QPs according to the kernel system information. Created QPs counter increased as well but active QPs counter stayed the same.

Following the logs from the problematic node, this is what happens:
we get a "wrong node!", which calls the destructor "~RDMAConnectedSocketImpl ", then polling which is busy and the dead QPs never get destroyed (usually, polling would destroy the QPs)

The log is also attached.

2017-02-28 12:06:25.992135 7f9c3cc5f700 25 -- 11.0.0.4:6800/17441 >> 11.0.0.2:6800/5645 conn(0x7f9c4e324800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).read_until read_bulk recv_end is 0 left is 281 got 281
2017-02-28 12:06:25.992145 7f9c3cc5f700 20 -- 11.0.0.4:6800/17441 >> 11.0.0.2:6800/5645 conn(0x7f9c4e324800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect read peer addr 11.0.0.2:6800/24196 on socket 47
2017-02-28 12:06:25.992154 7f9c3cc5f700 0 -- 11.0.0.4:6800/17441 >> 11.0.0.2:6800/5645 conn(0x7f9c4e324800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 11.0.0.2:6800/24196 not 11.0.0.2:6800/5645 - wrong node!
2017-02-28 12:06:25.992164 7f9c3cc5f700 20 Event(0x7f9c4d12c340 nevent=5000 time_id=783).delete_file_event delete event started fd=47 mask=3 original mask is 3
2017-02-28 12:06:25.992167 7f9c3cc5f700 20 EpollDriver.del_event del event fd=47 cur_mask=3 delmask=3 to 4
2017-02-28 12:06:25.992172 7f9c3cc5f700 10 Event(0x7f9c4d12c340 nevent=5000 time_id=783).delete_file_event delete event end fd=47 mask=3 original mask is 0
2017-02-28 12:06:25.992176 7f9c3cc5f700 20 RDMAConnectedSocketImpl ~RDMAConnectedSocketImpl destruct.
2017-02-28 12:06:25.992179 7f9c3d460700 20 RDMAStack polling pool completion queue got 1 responses.
2017-02-28 12:06:25.992179 7f9c3cc5f700 20 Event(0x7f9c4d12c340 nevent=5000 time_id=783).delete_file_event delete event started fd=50 mask=1 original mask is 1
2017-02-28 12:06:25.992181 7f9c3d460700 25 RDMAStack got a tx cqe, bytes:281

osd9log.txt

@DanielBar-On
Copy link
Contributor

Hey @yuyuyu101, any idea on why this is happening?

@yuyuyu101
Copy link
Member Author

@DanielBar-On see #13810

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants