common: release g_ceph_context before returns #11733

tchaikov · 2016-11-02T06:09:44Z

http://tracker.ceph.com/issues/17762

badone · 2016-11-02T07:31:17Z

LGTM

ghost · 2016-11-02T08:58:52Z

testing manually with the reproducer to verify it fixes the issue, but unfortunately it crashes

tchaikov · 2016-11-05T17:27:58Z

failed tests are caused by #10049

tchaikov · 2016-11-05T17:49:15Z

Log::entry() is still flushing when the program terminates, that's why the producer crashes.

liewegas · 2016-11-10T23:23:04Z

This looks right to me...

badone

Multiple acks, looks good.

ghost · 2016-11-02T07:48:16Z

src/log/Log.cc

@@ -433,6 +440,10 @@ void Log::start()

 void Log::stop()
 {
+  // the caller cannot tell if Log is stopped or not in this case
+  if (m_indirect_this && *m_indirect_this == nullptr && !is_started()) {


why !is_started() ? I'm under the impression that this makes the above call to pl->stop() a noop as this condition will always be true.

tchaikov · 2016-11-15T07:51:35Z

@liewegas i am changing this PR so the global_init() will return a unique_ptr which calls g_ceph_context->put() in its dtor. and also fixes some other g_ceph_context leakage.

cbodley · 2016-11-15T14:55:30Z

src/global/global_init.cc

+
+  using Deleter = std::function<void(CephContext*)>;
+  return std::unique_ptr<CephContext, Deleter>{g_ceph_context,
+                                               [](CephContext *p) {p->put();}};


since CephContext is a ref-counted type, i think boost::intrusive_ptr is a more natural fit than std::unique_ptr - see src/rgw/rgw_main.cc for an example

agreed, will do. i thought the more restricted smart ptr for this purpose is std::unique_ptr as what we need to do is but release the CephContext after done with it as we are not using the boost::intrusive_ptr<CephContext> else where yet, so there is no need for the other semantic offered by boost::intrusive_ptr.

@cbodley comment addressed and repushed.

tchaikov · 2016-11-22T18:47:48Z

@athanatos mind taking a look?

athanatos · 2016-11-22T18:58:27Z

I like it!

liewegas · 2016-11-22T22:15:36Z

LGTM! much better

ghost · 2016-11-23T06:04:50Z

@tchaikov excellent :-) In the commit message for f4c4f42 s/quites/quits/ + s/and ./and /. If at all possible it would be good to split f4c4f42 into several so that it's more readable (the unit.h change for instance).

tchaikov · 2016-11-23T06:27:05Z

@dachary will do.

tchaikov · 2016-11-23T07:16:14Z

changelog

just fixed the typos in commit message of f4c4f42

tchaikov · 2016-11-23T09:29:48Z

./bin/unittest_erasure_code_plugin_jerasure' segfault. rerunning.

ghost · 2016-11-23T09:42:42Z

src/test/erasure-code/TestErasureCodePluginJerasure.cc

@@ -61,24 +60,6 @@ TEST(ErasureCodePlugin, factory)
  }
 }

-int main(int argc, char **argv)


this is different from test/unit.h and needs to be adapted

my bad, it was an autotools leftover

tchaikov · 2016-11-23T10:45:42Z

changelog

do not take a refcount when constructing the intrusive_ptr<>.

@liewegas , sorry! we need to reschedule a rados qa test.

tchaikov · 2016-11-23T11:44:37Z

crushtool times out.

ghost · 2016-11-23T13:52:18Z

src/tools/crushtool.cc

@@ -274,6 +274,9 @@ int main(int argc, const char **argv)
  auto cct = global_init(NULL, env_args, CEPH_ENTITY_TYPE_CLIENT,
 			 CODE_ENVIRONMENT_UTILITY,
 			 CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);
+  // crushtool times out occasionally when quits. so do not
+  // release the g_ceph_context.
+  cct->get();
  common_init_finish(g_ceph_context);


It would be nice to comment on the root cause of this timeout. A leak there is harmless but knowing the root cause would be good to check if that's because of a problem that may be shared by other parts of the code, maybe with side effects that are more subtle and more difficult to diagnose afterwards. Am I making sense ?

@dachary i tried to reproduce this problem in a heavy loaded machine, the crushtool times out twice out of 200 runs. it could take more than 6 seconds or 16 seconds to complete the check. and i tried to catch this by wrapping the crushtool with a python script which launches gdb to print out the backtrace of all threads if it times out but the gdb reports that the debugged process is a zombie and cannot be attached.

in short, 1) the root cause is still unknown. 2) the leak is not a regression. i just make it more obvious.

prior to this change, global_init() could create a new CephContext and assign it to g_ceph_context. it's our responsibilty to release the CephContext explicitly using cct->put() before the application quits. but sometimes, we fail to do so. in this change, global_init() will return an intrusive_ptr<CephContext>, which calls `g_ceph_context->put()` in its dtor. this ensures that the CephContext is always destroyed before main() returns. so the log is flushed before _log_exp_length is destroyed. there are two cases where global_pre_init() is called directly. - ceph_conf.cc: g_ceph_context->put() will be called by an intrusive_ptr<> deleter. - rgw_main.cc: global_init() is called later on on the success code path, so it will be taken care of. Fixes: http://tracker.ceph.com/issues/17762 Signed-off-by: Kefu Chai <kchai@redhat.com>

it is but a work around of occasionally timeout. Signed-off-by: Kefu Chai <kchai@redhat.com>

tchaikov · 2016-11-24T14:41:20Z

changelog

unit.h: pass 0 instead of CINIT_FLAG_NO_DEFAULT_CONFIG_FILE. as we are also using the ceph.conf to customize the behavior of unit tests.

ghost · 2016-11-24T15:36:10Z

src/test/unit.h

@@ -32,10 +34,19 @@
 * initialization for you.
 */
 int main(int argc, char **argv) {
-  std::vector<const char*> args;
-  global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT, CODE_ENVIRONMENT_UTILITY,
-	      CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);


CINIT_FLAG_NO_DEFAULT_CONFIG_FILE may be here so that tests are not influenced by an existing ceph installation on the developer machine (i.e. no attempt to read /etc/ceph/ceph.conf even if it's available). This is just a thought: I did not actually verify that's going to happen but ... that's what I thought.

@dachary

prior this change, the tests using unit.h are

ceph_crypto.cc // unittest

dns_message.h // disabled

dns_resolve.cc // disabled

crypto.cc // unittest

daemon_config.cc // unittest

formatter.cc // unittest

gather.cc // unittest

heartbeat_map.cc // unittest

MonMap.cc // disabled

signals.cc // unittest

the ones marked with "unittests" are unittests tested by "make check", so if jenkins is happy, we are good.

the tests which start using "unit.h" with this change, are all calling global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT, CODE_ENVIRONMENT_UTILITY, 0) before this change, so their behavior is not changed.

so this change is safe, right?

tchaikov · 2016-11-24T16:41:11Z

tested at

http://pulpito.ceph.com/kchai-2016-11-24_01:21:40-rados-wip-kefu-testing---basic-mira/ without the "unit.h: pass 0 instead of CINIT_FLAG_NO_DEFAULT_CONFIG_FILE" change
http://pulpito.ceph.com/kchai-2016-11-24_16:47:41-rados-wip-kefu-testing---basic-mira/ with the "unit.h: pass 0 instead of CINIT_FLAG_NO_DEFAULT_CONFIG_FILE" change

dillaman · 2016-11-29T16:43:15Z

FYI -- this change caused rbd-nbd to immediately seg fault. I am fixing under http://tracker.ceph.com/issues/18070

tchaikov added bug-fix common labels Nov 2, 2016

tchaikov assigned ghost and badone Nov 2, 2016

tchaikov force-pushed the wip-17762 branch from 0f56b66 to 93cfb3d Compare November 2, 2016 06:55

badone added the needs-qa label Nov 2, 2016

tchaikov mentioned this pull request Nov 2, 2016

[OSD] Wip log alloc predictor (improvement) #6641

Merged

tchaikov added the wip-kefu-testing label Nov 2, 2016

liewegas modified the milestone: kraken Nov 3, 2016

tchaikov changed the title ~~log: do not call e->hint_size() if in log_on_exit()~~ [DNM] log: do not call e->hint_size() if in log_on_exit() Nov 5, 2016

tchaikov removed needs-qa wip-kefu-testing labels Nov 5, 2016

tchaikov unassigned ghost and badone Nov 8, 2016

badone approved these changes Nov 10, 2016

View reviewed changes

tchaikov changed the title ~~[DNM] log: do not call e->hint_size() if in log_on_exit()~~ log: do not call e->hint_size() if in log_on_exit() Nov 14, 2016

ghost reviewed Nov 14, 2016

View reviewed changes

liewegas added the wip-sage-testing label Nov 14, 2016

tchaikov force-pushed the wip-17762 branch from 93cfb3d to f9fafaa Compare November 15, 2016 07:49

tchaikov changed the title ~~log: do not call e->hint_size() if in log_on_exit()~~ common: release g_ceph_context before returns Nov 15, 2016

tchaikov force-pushed the wip-17762 branch from f9fafaa to f1fbd2b Compare November 15, 2016 08:44

tchaikov changed the title ~~common: release g_ceph_context before returns~~ [DNM] common: release g_ceph_context before returns Nov 15, 2016

cbodley reviewed Nov 15, 2016

View reviewed changes

tchaikov force-pushed the wip-17762 branch from 36f8fda to f4c4f42 Compare November 22, 2016 19:13

tchaikov added the needs-qa label Nov 22, 2016

liewegas approved these changes Nov 22, 2016

View reviewed changes

liewegas added the wip-sage-testing label Nov 23, 2016

tchaikov force-pushed the wip-17762 branch from f4c4f42 to 2b73f41 Compare November 23, 2016 07:15

ghost reviewed Nov 23, 2016

View reviewed changes

tchaikov force-pushed the wip-17762 branch from 2b73f41 to bed4750 Compare November 23, 2016 10:43

ghost reviewed Nov 23, 2016

View reviewed changes

ghost mentioned this pull request Nov 23, 2016

test: WRITE_CEPH_UNITTEST_MAIN #11980

Closed

liewegas removed the wip-sage-testing label Nov 23, 2016

ghost approved these changes Nov 23, 2016

View reviewed changes

tchaikov added 2 commits November 24, 2016 22:38

crushtool: do not release g_ceph_context at exit

d305cc5

it is but a work around of occasionally timeout. Signed-off-by: Kefu Chai <kchai@redhat.com>

tchaikov force-pushed the wip-17762 branch from 3d48749 to d305cc5 Compare November 24, 2016 14:39

tchaikov added the wip-kefu-testing label Nov 24, 2016

ghost reviewed Nov 24, 2016

View reviewed changes

tchaikov merged commit 44aaeb7 into ceph:master Nov 24, 2016

tchaikov deleted the wip-17762 branch November 24, 2016 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: release g_ceph_context before returns #11733

common: release g_ceph_context before returns #11733

tchaikov commented Nov 2, 2016 •

edited by ghost

badone commented Nov 2, 2016

ghost commented Nov 2, 2016 •

edited by ghost

tchaikov commented Nov 5, 2016

tchaikov commented Nov 5, 2016 •

edited

liewegas commented Nov 10, 2016

badone left a comment

ghost Nov 2, 2016

tchaikov commented Nov 15, 2016

cbodley Nov 15, 2016

tchaikov Nov 16, 2016

tchaikov Nov 22, 2016

tchaikov commented Nov 22, 2016

athanatos commented Nov 22, 2016

liewegas commented Nov 22, 2016

ghost commented Nov 23, 2016

tchaikov commented Nov 23, 2016

tchaikov commented Nov 23, 2016

tchaikov commented Nov 23, 2016

ghost Nov 23, 2016

ghost Nov 23, 2016

tchaikov commented Nov 23, 2016

tchaikov commented Nov 23, 2016

ghost Nov 23, 2016

tchaikov Nov 23, 2016 •

edited

ghost Nov 23, 2016

tchaikov commented Nov 24, 2016

ghost Nov 24, 2016

tchaikov Nov 24, 2016 •

edited

ghost Nov 24, 2016

tchaikov commented Nov 24, 2016 •

edited

dillaman commented Nov 29, 2016

common: release g_ceph_context before returns #11733

common: release g_ceph_context before returns #11733

Conversation

tchaikov commented Nov 2, 2016 • edited by ghost

badone commented Nov 2, 2016

ghost commented Nov 2, 2016 • edited by ghost

tchaikov commented Nov 5, 2016

tchaikov commented Nov 5, 2016 • edited

liewegas commented Nov 10, 2016

badone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Nov 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Nov 22, 2016

athanatos commented Nov 22, 2016

liewegas commented Nov 22, 2016

ghost commented Nov 23, 2016

tchaikov commented Nov 23, 2016

tchaikov commented Nov 23, 2016

tchaikov commented Nov 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Nov 23, 2016

tchaikov commented Nov 23, 2016

Choose a reason for hiding this comment

tchaikov Nov 23, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Nov 24, 2016

Choose a reason for hiding this comment

tchaikov Nov 24, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Nov 24, 2016 • edited

dillaman commented Nov 29, 2016

tchaikov commented Nov 2, 2016 •

edited by ghost

ghost commented Nov 2, 2016 •

edited by ghost

tchaikov commented Nov 5, 2016 •

edited

tchaikov Nov 23, 2016 •

edited

tchaikov Nov 24, 2016 •

edited

tchaikov commented Nov 24, 2016 •

edited