Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: release g_ceph_context before returns #11733

Merged
merged 5 commits into from Nov 24, 2016
Merged

Conversation

tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Nov 2, 2016

@badone
Copy link
Contributor

badone commented Nov 2, 2016

LGTM

@ghost
Copy link

ghost commented Nov 2, 2016

testing manually with the reproducer to verify it fixes the issue, but unfortunately it crashes

@liewegas liewegas modified the milestone: kraken Nov 3, 2016
@tchaikov tchaikov changed the title log: do not call e->hint_size() if in log_on_exit() [DNM] log: do not call e->hint_size() if in log_on_exit() Nov 5, 2016
@tchaikov
Copy link
Contributor Author

tchaikov commented Nov 5, 2016

Log::entry() is still flushing when the program terminates, that's why the producer crashes.

@tchaikov tchaikov unassigned ghost and badone Nov 8, 2016
@liewegas
Copy link
Member

This looks right to me...

Copy link
Contributor

@badone badone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple acks, looks good.

@tchaikov tchaikov changed the title [DNM] log: do not call e->hint_size() if in log_on_exit() log: do not call e->hint_size() if in log_on_exit() Nov 14, 2016
@@ -433,6 +440,10 @@ void Log::start()

void Log::stop()
{
// the caller cannot tell if Log is stopped or not in this case
if (m_indirect_this && *m_indirect_this == nullptr && !is_started()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why !is_started() ? I'm under the impression that this makes the above call to pl->stop() a noop as this condition will always be true.

@tchaikov
Copy link
Contributor Author

@liewegas i am changing this PR so the global_init() will return a unique_ptr which calls g_ceph_context->put() in its dtor. and also fixes some other g_ceph_context leakage.

@tchaikov tchaikov changed the title log: do not call e->hint_size() if in log_on_exit() common: release g_ceph_context before returns Nov 15, 2016
@tchaikov tchaikov changed the title common: release g_ceph_context before returns [DNM] common: release g_ceph_context before returns Nov 15, 2016

using Deleter = std::function<void(CephContext*)>;
return std::unique_ptr<CephContext, Deleter>{g_ceph_context,
[](CephContext *p) {p->put();}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since CephContext is a ref-counted type, i think boost::intrusive_ptr is a more natural fit than std::unique_ptr - see src/rgw/rgw_main.cc for an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, will do. i thought the more restricted smart ptr for this purpose is std::unique_ptr as what we need to do is but release the CephContext after done with it as we are not using the boost::intrusive_ptr<CephContext> else where yet, so there is no need for the other semantic offered by boost::intrusive_ptr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley comment addressed and repushed.

@tchaikov
Copy link
Contributor Author

@athanatos mind taking a look?

@athanatos
Copy link
Contributor

I like it!

@liewegas
Copy link
Member

LGTM! much better

@ghost
Copy link

ghost commented Nov 23, 2016

@tchaikov excellent :-) In the commit message for f4c4f42 s/quites/quits/ + s/and ./and /. If at all possible it would be good to split f4c4f42 into several so that it's more readable (the unit.h change for instance).

@tchaikov
Copy link
Contributor Author

@dachary will do.

@tchaikov
Copy link
Contributor Author

changelog

  • just fixed the typos in commit message of f4c4f42

@tchaikov
Copy link
Contributor Author

./bin/unittest_erasure_code_plugin_jerasure' segfault. rerunning.

@@ -61,24 +60,6 @@ TEST(ErasureCodePlugin, factory)
}
}

int main(int argc, char **argv)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is different from test/unit.h and needs to be adapted

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, it was an autotools leftover

@tchaikov
Copy link
Contributor Author

changelog

  • do not take a refcount when constructing the intrusive_ptr<>.

@liewegas , sorry! we need to reschedule a rados qa test.

@tchaikov
Copy link
Contributor Author

crushtool times out.

@@ -274,6 +274,9 @@ int main(int argc, const char **argv)
auto cct = global_init(NULL, env_args, CEPH_ENTITY_TYPE_CLIENT,
CODE_ENVIRONMENT_UTILITY,
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);
// crushtool times out occasionally when quits. so do not
// release the g_ceph_context.
cct->get();
common_init_finish(g_ceph_context);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to comment on the root cause of this timeout. A leak there is harmless but knowing the root cause would be good to check if that's because of a problem that may be shared by other parts of the code, maybe with side effects that are more subtle and more difficult to diagnose afterwards. Am I making sense ?

Copy link
Contributor Author

@tchaikov tchaikov Nov 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dachary i tried to reproduce this problem in a heavy loaded machine, the crushtool times out twice out of 200 runs. it could take more than 6 seconds or 16 seconds to complete the check. and i tried to catch this by wrapping the crushtool with a python script which launches gdb to print out the backtrace of all threads if it times out but the gdb reports that the debugged process is a zombie and cannot be attached.

in short, 1) the root cause is still unknown. 2) the leak is not a regression. i just make it more obvious.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

prior to this change, global_init() could create a new CephContext
and assign it to g_ceph_context. it's our responsibilty to release
the CephContext explicitly using cct->put() before the application
quits. but sometimes, we fail to do so.

in this change, global_init() will return an intrusive_ptr<CephContext>,
which calls `g_ceph_context->put()` in its dtor. this ensures that
the CephContext is always destroyed before main() returns. so the
log is flushed before _log_exp_length is destroyed.

there are two cases where global_pre_init() is called directly.
- ceph_conf.cc: g_ceph_context->put() will be called by an intrusive_ptr<>
  deleter.
- rgw_main.cc: global_init() is called later on on the success code
  path, so it will be taken care of.

Fixes: http://tracker.ceph.com/issues/17762
Signed-off-by: Kefu Chai <kchai@redhat.com>
it is but a work around of occasionally timeout.

Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

changelog

  • unit.h: pass 0 instead of CINIT_FLAG_NO_DEFAULT_CONFIG_FILE. as we are also using the ceph.conf to customize the behavior of unit tests.

@@ -32,10 +34,19 @@
* initialization for you.
*/
int main(int argc, char **argv) {
std::vector<const char*> args;
global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT, CODE_ENVIRONMENT_UTILITY,
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CINIT_FLAG_NO_DEFAULT_CONFIG_FILE may be here so that tests are not influenced by an existing ceph installation on the developer machine (i.e. no attempt to read /etc/ceph/ceph.conf even if it's available). This is just a thought: I did not actually verify that's going to happen but ... that's what I thought.

Copy link
Contributor Author

@tchaikov tchaikov Nov 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dachary

prior this change, the tests using unit.h are

  • ceph_crypto.cc // unittest
  • dns_message.h // disabled
  • dns_resolve.cc // disabled
  • crypto.cc // unittest
  • daemon_config.cc // unittest
  • formatter.cc // unittest
  • gather.cc // unittest
  • heartbeat_map.cc // unittest
  • MonMap.cc // disabled
  • signals.cc // unittest

the ones marked with "unittests" are unittests tested by "make check", so if jenkins is happy, we are good.

the tests which start using "unit.h" with this change, are all calling global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT, CODE_ENVIRONMENT_UTILITY, 0) before this change, so their behavior is not changed.

so this change is safe, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

@tchaikov
Copy link
Contributor Author

tchaikov commented Nov 24, 2016

tested at

@tchaikov tchaikov merged commit 44aaeb7 into ceph:master Nov 24, 2016
@tchaikov tchaikov deleted the wip-17762 branch November 24, 2016 17:27
@dillaman
Copy link

FYI -- this change caused rbd-nbd to immediately seg fault. I am fixing under http://tracker.ceph.com/issues/18070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants