New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common: release g_ceph_context before returns #11733
Conversation
LGTM |
testing manually with the reproducer to verify it fixes the issue, but unfortunately it crashes |
|
This looks right to me... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple acks, looks good.
@@ -433,6 +440,10 @@ void Log::start() | |||
|
|||
void Log::stop() | |||
{ | |||
// the caller cannot tell if Log is stopped or not in this case | |||
if (m_indirect_this && *m_indirect_this == nullptr && !is_started()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why !is_started() ? I'm under the impression that this makes the above call to pl->stop() a noop as this condition will always be true.
@liewegas i am changing this PR so the |
|
||
using Deleter = std::function<void(CephContext*)>; | ||
return std::unique_ptr<CephContext, Deleter>{g_ceph_context, | ||
[](CephContext *p) {p->put();}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since CephContext
is a ref-counted type, i think boost::intrusive_ptr
is a more natural fit than std::unique_ptr
- see src/rgw/rgw_main.cc for an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, will do. i thought the more restricted smart ptr for this purpose is std::unique_ptr
as what we need to do is but release the CephContext
after done with it as we are not using the boost::intrusive_ptr<CephContext>
else where yet, so there is no need for the other semantic offered by boost::intrusive_ptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley comment addressed and repushed.
@athanatos mind taking a look? |
I like it! |
LGTM! much better |
@dachary will do. |
changelog
|
./bin/unittest_erasure_code_plugin_jerasure' segfault. rerunning. |
@@ -61,24 +60,6 @@ TEST(ErasureCodePlugin, factory) | |||
} | |||
} | |||
|
|||
int main(int argc, char **argv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is different from test/unit.h and needs to be adapted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, it was an autotools leftover
changelog
@liewegas , sorry! we need to reschedule a rados qa test. |
crushtool times out. |
@@ -274,6 +274,9 @@ int main(int argc, const char **argv) | |||
auto cct = global_init(NULL, env_args, CEPH_ENTITY_TYPE_CLIENT, | |||
CODE_ENVIRONMENT_UTILITY, | |||
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE); | |||
// crushtool times out occasionally when quits. so do not | |||
// release the g_ceph_context. | |||
cct->get(); | |||
common_init_finish(g_ceph_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to comment on the root cause of this timeout. A leak there is harmless but knowing the root cause would be good to check if that's because of a problem that may be shared by other parts of the code, maybe with side effects that are more subtle and more difficult to diagnose afterwards. Am I making sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dachary i tried to reproduce this problem in a heavy loaded machine, the crushtool times out twice out of 200 runs. it could take more than 6 seconds or 16 seconds to complete the check. and i tried to catch this by wrapping the crushtool with a python script which launches gdb to print out the backtrace of all threads if it times out but the gdb reports that the debugged process is a zombie and cannot be attached.
in short, 1) the root cause is still unknown. 2) the leak is not a regression. i just make it more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
prior to this change, global_init() could create a new CephContext and assign it to g_ceph_context. it's our responsibilty to release the CephContext explicitly using cct->put() before the application quits. but sometimes, we fail to do so. in this change, global_init() will return an intrusive_ptr<CephContext>, which calls `g_ceph_context->put()` in its dtor. this ensures that the CephContext is always destroyed before main() returns. so the log is flushed before _log_exp_length is destroyed. there are two cases where global_pre_init() is called directly. - ceph_conf.cc: g_ceph_context->put() will be called by an intrusive_ptr<> deleter. - rgw_main.cc: global_init() is called later on on the success code path, so it will be taken care of. Fixes: http://tracker.ceph.com/issues/17762 Signed-off-by: Kefu Chai <kchai@redhat.com>
it is but a work around of occasionally timeout. Signed-off-by: Kefu Chai <kchai@redhat.com>
changelog
|
@@ -32,10 +34,19 @@ | |||
* initialization for you. | |||
*/ | |||
int main(int argc, char **argv) { | |||
std::vector<const char*> args; | |||
global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT, CODE_ENVIRONMENT_UTILITY, | |||
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE may be here so that tests are not influenced by an existing ceph installation on the developer machine (i.e. no attempt to read /etc/ceph/ceph.conf even if it's available). This is just a thought: I did not actually verify that's going to happen but ... that's what I thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prior this change, the tests using unit.h
are
- ceph_crypto.cc // unittest
- dns_message.h // disabled
- dns_resolve.cc // disabled
- crypto.cc // unittest
- daemon_config.cc // unittest
- formatter.cc // unittest
- gather.cc // unittest
- heartbeat_map.cc // unittest
- MonMap.cc // disabled
- signals.cc // unittest
the ones marked with "unittests" are unittests tested by "make check", so if jenkins is happy, we are good.
the tests which start using "unit.h" with this change, are all calling global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT, CODE_ENVIRONMENT_UTILITY, 0)
before this change, so their behavior is not changed.
so this change is safe, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
tested at
|
FYI -- this change caused rbd-nbd to immediately seg fault. I am fixing under http://tracker.ceph.com/issues/18070 |
http://tracker.ceph.com/issues/17762