Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: destroy QueuePair if needed #13810

Merged
merged 1 commit into from Mar 7, 2017

Conversation

yuyuyu101
Copy link
Member

Signed-off-by: Haomai Wang haomai@xsky.com

Signed-off-by: Haomai Wang <haomai@xsky.com>
@Adirl
Copy link

Adirl commented Mar 6, 2017

@yuyuyu101
reviewing

Mutex::Locker l(lock); // FIXME reuse dead qp because creating one qp costs 1 ms
while (!dead_queue_pairs.empty()) {
ldout(cct, 10) << __func__ << " finally delete qp=" << dead_queue_pairs.back() << dendl;
delete dead_queue_pairs.back();
perf_logger->dec(l_msgr_rdma_active_queue_pair);
dead_queue_pairs.pop_back();
--num_dead_queue_pair;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to move this to ~QueuePair() to cover other cases of deleting qp's

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we only trace in queue qp..

@@ -202,13 +203,14 @@ void RDMADispatcher::polling()
// Additionally, don't delete qp while outstanding_buffers isn't empty,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't check inflight value, is this still true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we don't need to care inflight tx messages

@yuyuyu101 yuyuyu101 merged commit d124e6f into ceph:master Mar 7, 2017
@yuyuyu101 yuyuyu101 deleted the wip-rdma-inflight branch March 7, 2017 07:25
@Adirl
Copy link

Adirl commented Mar 9, 2017

@yuyuyu101

[cephuser@clx-ssp-055 ~]$ ceph -s
    cluster 68e56c22-d9d3-4680-872e-e547ab7fdf80
     health HEALTH_OK
     monmap e5: 4 mons at {clx-ssp-055=110.168.1.55:6789/0,clx-ssp-060=110.168.1.60:6789/0,clx-ssp-065=110.168.1.65:6789/0,clx-ssp-070=110.168.1.70:6789/0}
            election epoch 36, quorum 0,1,2,3 clx-ssp-055,clx-ssp-060,clx-ssp-065,clx-ssp-070
        mgr active: clx-ssp-060 standbys: clx-ssp-065, clx-ssp-070, clx-ssp-055
     osdmap e1404: 256 osds: 256 up, 256 in
            flags sortbitwise,require_jewel_osds,require_kraken_osds,require_luminous_osds
      pgmap v84918: 8192 pgs, 1 pools, 16000 GB data, 4000 kobjects
            48051 GB used, 45854 GB / 93905 GB avail
                8192 active+clean
/mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/msg/async/rdma/RDMAStack.cc: In function 'virtual RDMADispatcher::~RDMADispatcher()' thread 7fb71b472700 time 2017-03-09 16:33:55.296822
/mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/msg/async/rdma/RDMAStack.cc: 39: FAILED assert(dead_queue_pairs.empty())
 ceph version 12.0.0-1037-gf7e0f57 (f7e0f57f797e1bf6a80a7226bcc21765024e7e8a)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7fb721668bc0]
 2: (RDMADispatcher::~RDMADispatcher()+0x336) [0x7fb7217a3116]
 3: (RDMADispatcher::~RDMADispatcher()+0x9) [0x7fb7217a3149]
 4: (RDMAStack::~RDMAStack()+0x38) [0x7fb7217a1728]
 5: (CephContext::TypedSingletonWrapper<StackSingleton>::~TypedSingletonWrapper()+0x8a) [0x7fb72178311a]
 6: (CephContext::~CephContext()+0x47) [0x7fb72181b597]
 7: (CephContext::put()+0x17c) [0x7fb72181bc4c]
 8: (librados::RadosClient::~RadosClient()+0x1b0) [0x7fb729f78520]
 9: (librados::RadosClient::~RadosClient()+0x9) [0x7fb729f78579]
 10: (rados_shutdown()+0x2e) [0x7fb729f2b3ce]
 11: (()+0x17cc2) [0x7fb72a236cc2]
 12: (PyEval_EvalFrameEx()+0x730a) [0x7fb72c7cd00a]
 13: (PyEval_EvalCodeEx()+0x7ed) [0x7fb72c7cee3d]
 14: (PyEval_EvalFrameEx()+0x663c) [0x7fb72c7cc33c]
 15: (PyEval_EvalFrameEx()+0x67bd) [0x7fb72c7cc4bd]
 16: (PyEval_EvalCodeEx()+0x7ed) [0x7fb72c7cee3d]
 17: (()+0x70798) [0x7fb72c758798]
 18: (PyObject_Call()+0x43) [0x7fb72c7338e3]
 19: (()+0x5a8d5) [0x7fb72c7428d5]
 20: (PyObject_Call()+0x43) [0x7fb72c7338e3]
 21: (PyEval_CallObjectWithKeywords()+0x47) [0x7fb72c7c56f7]
 22: (()+0x1155c2) [0x7fb72c7fd5c2]
 23: (()+0x7dc5) [0x7fb72c4d3dc5]
 24: (clone()+0x6d) [0x7fb72baf821d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

@Adirl
Copy link

Adirl commented Mar 12, 2017

@yuyuyu101 have you seen this ?

@Adirl
Copy link

Adirl commented Mar 12, 2017

@yuyuyu101
here's another one on different cluster


    -2> 2017-03-12 10:12:01.131113 7f14bb7ff700  1 RDMAStack handle_async_event it's not forwardly stopped by us, reenable=0x7f14ca196a80
    -1> 2017-03-12 10:12:01.131124 7f14bb7ff700  1  RDMAConnectedSocketImpl fault tcp fd 23
     0> 2017-03-12 10:12:01.133464 7f14bb7ff700 -1 /mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f14bb7ff700 time 2017-03-12 10:12:01.131139
/mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/common/Mutex.cc: 113: FAILED assert(r == 0)



 ceph version 12.0.0-1037-gf7e0f57 (f7e0f57f797e1bf6a80a7226bcc21765024e7e8a)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f14c08c50e0]
 2: (Mutex::Lock(bool)+0x194) [0x7f14c088adb4]
 3: (RDMADispatcher::erase_qpn(unsigned int)+0x1d) [0x7f14c096c1ed]
 4: (RDMADispatcher::handle_async_event()+0x49c) [0x7f14c096c7ec]
 5: (RDMADispatcher::polling()+0x566) [0x7f14c096eff6]
 6: (()+0xb5220) [0x7f14bdc91220]
 7: (()+0x7dc5) [0x7f14be319dc5]
 8: (clone()+0x6d) [0x7f14bd3f921d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

@yuyuyu101
Copy link
Member Author

@Adirl fixed here #13905

@DanielBar-On
Copy link
Contributor

Couldn't replicate the crash. Regardless, the qp leak from #13435 is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants