global: do not start two daemons with a single pid-file #7075

shun-s · 2015-12-29T11:16:57Z

add a function named test_pidFile_in_use for checking whether pid-file is in using by another osd or monitor to advoid pid-file from deleted

Fixes: #13422
Signed-off-by: shun song song.shun3@zte.com.cn

shun-s · 2015-12-29T11:19:46Z

@tchaikov I mess #6763 up and reopen it here, please review. Thanks

shun-s · 2015-12-31T03:07:43Z

@tchaikov

fixed the pid_file empty problem which leads to "unexpected process terminate"
just put ::open(conf->pid_file.c_str(), ...)
please review, many thanks

tchaikov · 2016-01-04T05:46:27Z

src/global/pidfile.cc

+    return pidfile_remove();
+  }
+
+  fd = ::open(conf->pid_file.c_str(), O_CREAT|O_WRONLY, 0644);


fd is leaked?

yeah, fd is leaked on purpose, because pid-file needs to keep open whole lifecycle, and if let it be leaked, a global varible can be omitted.

trociny · 2016-01-04T14:28:53Z

Why does test_pid_file_in_use() call exit when it fails to open file but return error when it fails to lock? I do either way but the same in both cases.

'test_pid_file_in_use' name looks confusing, as it does not just tests for the pid file, but also locks it (and keeps open).

I would suggest:

rename test_pid_file_in_use() to pidfile_open();
keep the opened fd in static variable (similarly to pid_file variable);
in pidfile_remove(), close the opened fd after unlinking the file.

Also, it might be worth to modify pidfile_write() not to open the file again but use the stored fd, but we should be careful here not to overwrite some other file if pidfile was closed and descriptor reused. The same is for pidfile_remove().

I suggest looking at FreeBSD pidfile(3) interface and implementation:

https://www.freebsd.org/cgi/man.cgi?query=pidfile&sektion=3
https://github.com/freebsd/freebsd/blob/master/lib/libutil/pidfile.c

I would really like to have something like this. At least we could reuse their pidfile_verify() trick to be sure we don't overwrite some other file if the descriptor was reused.

shun-s · 2016-01-05T11:14:51Z

@trociny @tchaikov
refer to freeBSD, a new version has been completed, mainly repair these:

keep the opened fd in static variable
properly modify pidfile_write and pidfile_remove
ename test_pid_file_in_use() to pidfile_open() and add a new func pidfile_verify for avoiding pidfile overwrite.
other deficiency, like pidfile leak, open-pidfile-failure exit but lock failure retruns

shun-s · 2016-01-06T04:14:36Z

@trociny two changes have made, please review, many thanks

use 'static struct pidfh pfh' (not pointer) now
when pid_file is empty, return 0 instead of an error, so daemon can run without pidfile if someone specify not to use pidfile.
but how to specify not to use pidfile in cli when run daemon?

trociny · 2016-01-06T07:15:33Z

It looks much better to me now. Still, taking it is a C++, I would add a constructor to struct pidfh, which would initialize pf_fd to -1, and other variables to 0.

Also, right now you don't set pf_fd to -1 after close and it may lead to strange bugs, when the descriptor is reused. I would suggest to add a close() method to pidfh, which whould close the pf_fd and set it to -1.

Also, at some point you will need to squash your commits into one (google for "git interactive rebase" if it is new for your). You will need to push with '-f' flag after the rebase.

As for you question about how to specify not to use pidfile, by default pid_file config option is empty, and a daemon is started without pid file. You can redefine it either by specifying pid_file in ceph.conf or via command line (--pid-file /path). If pid_file is defined in ceph.conf but you want to start witout pid file, you can run it with --pid-file ''.

shun-s · 2016-01-07T14:16:40Z

@trociny @tchaikov , thanks for your careful direction, a constuctor and a close func have been added, please review again.

trociny · 2016-01-07T16:09:52Z

It looks good to me. I have not tested it though.

BTW, if you added unit tests (as a separate commit) it would be really nice! See src/test/test_*.cc for examples.

xiexingguo · 2016-01-08T02:12:50Z

src/global/pidfile.cc

  if (ret < 0) {
    derr << "write_pid_file: failed to write to pid file '"
-	 << pid_file << "': " << cpp_strerror(ret) << dendl;
-    VOID_TEMP_FAILURE_RETRY(::close(fd));
+	 << pfh.pf_path << "': " << cpp_strerror(ret) << dendl;


Potential fd leak here.
Shall put an extra
VOID_TEMP_FAILURE_RETRY(::close(pfh.pf_fd));
here.

@xiexingguo all clean work will be done by pidfile_remove, which is called when daemon shutdown. so it seems not to cause fd leak, am i right?

tchaikov · 2016-01-26T03:48:15Z

src/test/test_pidfile.sh

+   RUNID=`uuidgen`
+   run_mon $dir a --pid-file= --daemonize=$RUNID || { teardown_unexist_pidfile $dir; return 1; } 
+   run_osd $dir 0 --pid-file= --daemonize=$RUNID || { teardown_unexist_pidfile $dir; return 1; }
+   teardown_unexist_pidfile $dir || return 1


teardown_unexist_pidfile() never fails, i think. because rm -fr $dir can hardly fail. unless you set a flag when kill_complete is "1" in teardown_unexist_pidfile(). and return 1 if the flag is not set.

tchaikov · 2016-01-26T04:07:51Z

there is a little confusion about what man bash said when setting -e option

see #7075 (comment). i quoted the bash's man page.

The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test following the if or elif reserved words, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command's return value is being inverted with !.

your function call is "part of the test following the if or elif reserved words", so the shell does exit if it fails.

this time, i don't rebase the fix commit as to avoid low mistakes like different versions bettween github and my personal computer. if all is ok, then rebase all. does you ever meet this kind of problem?

you need to squash your commits for adding the tests into a single one, i am not sure what sort of problem you are referencing? but you might want to take a look at https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#preparing-and-sending-patches first.

tchaikov · 2016-01-26T04:09:47Z

src/test/test_pidfile.sh

@@ -0,0 +1,111 @@
+#!/bin/bash -e


as we discussed in the comments, -e does not help in our case.

i test if we remove -e, then run_mon $dir a "fail" will never fail even though mon actually fails.
at this situation, we have to add return 1 after every run_mon or run_osd. so keeping -e may look morebeautiful then seeing return 1 everwhere.

oh, you are right!

add functions named pidfile_open and pidfile_verify to avoid starting two daemons by a single pid-file Fixes: ceph#13422 Signed-off-by: shun song <song.shun3@zte.com.cn>

shun-s · 2016-01-26T08:38:01Z

@tchaikov
run_osd_again and run_mon_again func has been removed and run_mon is used.
please review, thanks

shun-s · 2016-01-26T08:41:16Z

@tchaikov only run_osd and run_mon have been called in test_pidfile.sh unittest, after this, you won't need to consider whether to add run_osd_again or run_mon_again funcs any more.
please review, thanks

tchaikov · 2016-01-26T10:00:37Z

src/test/test_pidfile.sh

+            if kill -9 $i 2> /dev/null ; then
+                kill_complete="0"
+                count=$((count+1))
+                sleep ${delays[$count]}


why would you "sleep" if you just killed a pid successfully?

i need a kill failure to tell that the pid daemon had been successfully killed.

@shun-s

i mean, we don't need to sleep if kill returns successfully.

if it returns successfully, you wait for a while just for killing another pid. i don't see the point here.

if "kill" fails, the only reason would be that the process is dead somehow, and the succeeding "kill"s would fail for the same reason. assuming kill -9 fails for 9 times, then count would be greater than "9", and ${delays[$count]} is empty. so sleep returns immediately with an error message, count would keep increasing in this never-ends loop.

i'd suggest we use -TERM for killing the daemons and take this into your consideration.

shun-s · 2016-01-27T00:57:28Z

@tchaikov
TEST_normal has been removed,
please review, thanks

tchaikov · 2016-01-27T03:58:57Z

@shun-s we are close, just some nits regarding to the teardown_unexist_pidfile().

in short

i don't think delay helps with "kill -9",
the last kill_complete does not help to tell if teardown_unexist_pidfile() is successful or not, not to mention it can hardly be "1"
probably you can follow the model of kill_daemons() in ceph-helper.sh to design your teardown helper function?

Fixes: ceph#13422 Signed-off-by: shun song <song.shun3@zte.com.cn>

shun-s · 2016-01-27T06:49:48Z

@tchaikov
i hope this would be the last commit of this pull request, hehe

about test, i have followed the model of kill_daemons in ceph-helpers.sh to design teardown helper function.
please take a look again, thanks

tchaikov · 2016-01-27T08:31:45Z

lgtm with the qa run.

shun-s · 2016-01-27T14:10:03Z

@tchaikov 3q a lot

global: do not start two daemons with a single pid-file Reviewed-by: Kefu Chai <kchai@redhat.com>

liewegas · 2016-01-30T13:48:17Z

I think this introduced http://tracker.ceph.com/issues/14575 ... can you take a look please?

tchaikov · 2016-01-30T16:06:38Z

@liewegas ack. will do.

shun-s force-pushed the shun-fix branch from 8474bdb to 17cb583 Compare December 29, 2015 11:17

shun-s force-pushed the shun-fix branch from 17cb583 to e881777 Compare December 29, 2015 11:37

shun-s changed the title ~~Global: test pid_file whether or not in use~~ Global: do not start two daemons with a single pid-file Dec 29, 2015

tchaikov added bug-fix core labels Dec 30, 2015

shun-s force-pushed the shun-fix branch from f6b88b2 to 59fda60 Compare December 31, 2015 03:04

tchaikov reviewed Jan 4, 2016
View reviewed changes

tchaikov self-assigned this Jan 4, 2016

shun-s force-pushed the shun-fix branch from 59fda60 to 096bc05 Compare January 4, 2016 13:18

shun-s force-pushed the shun-fix branch 2 times, most recently from 25b31c6 to a4eae40 Compare January 7, 2016 06:56

xiexingguo reviewed Jan 8, 2016
View reviewed changes

shun-s force-pushed the shun-fix branch from a87fd69 to af7b0b0 Compare January 8, 2016 17:17

tchaikov reviewed Jan 26, 2016
View reviewed changes

global/pidfile: do not start two daemons with a single pid-file

f2c0ef4

add functions named pidfile_open and pidfile_verify to avoid starting two daemons by a single pid-file Fixes: ceph#13422 Signed-off-by: shun song <song.shun3@zte.com.cn>

shun-s force-pushed the shun-fix branch 2 times, most recently from 39983b8 to bf68bd5 Compare January 26, 2016 08:36

tchaikov reviewed Jan 26, 2016
View reviewed changes

shun-s force-pushed the shun-fix branch from bf68bd5 to 3cb78b4 Compare January 27, 2016 00:55

shun-s force-pushed the shun-fix branch from 3cb78b4 to 12649ef Compare January 27, 2016 06:37

test: add unitest test_pidfile.sh

12649ef

Fixes: ceph#13422 Signed-off-by: shun song <song.shun3@zte.com.cn>

tchaikov added the needs-qa label Jan 27, 2016

liewegas added the wip-sage-testing label Jan 27, 2016

liewegas added a commit that referenced this pull request Jan 29, 2016

Merge pull request #7075 from shun-s/shun-fix

71501a3

global: do not start two daemons with a single pid-file Reviewed-by: Kefu Chai <kchai@redhat.com>

liewegas merged commit 71501a3 into ceph:master Jan 29, 2016

shun-s deleted the shun-fix branch March 29, 2016 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

global: do not start two daemons with a single pid-file #7075

global: do not start two daemons with a single pid-file #7075

shun-s commented Dec 29, 2015

shun-s commented Dec 29, 2015

shun-s commented Dec 31, 2015

tchaikov Jan 4, 2016

shun-s Jan 5, 2016

trociny commented Jan 4, 2016

shun-s commented Jan 5, 2016

shun-s commented Jan 6, 2016

trociny commented Jan 6, 2016

shun-s commented Jan 7, 2016

trociny commented Jan 7, 2016

xiexingguo Jan 8, 2016

shun-s Jan 8, 2016

tchaikov Jan 26, 2016

tchaikov commented Jan 26, 2016

tchaikov Jan 26, 2016

shun-s Jan 26, 2016

tchaikov Jan 26, 2016

shun-s commented Jan 26, 2016

shun-s commented Jan 26, 2016

tchaikov Jan 26, 2016

shun-s Jan 26, 2016

tchaikov Jan 27, 2016

shun-s commented Jan 27, 2016

tchaikov commented Jan 27, 2016

shun-s commented Jan 27, 2016

tchaikov commented Jan 27, 2016

shun-s commented Jan 27, 2016

liewegas commented Jan 30, 2016

tchaikov commented Jan 30, 2016

global: do not start two daemons with a single pid-file #7075

global: do not start two daemons with a single pid-file #7075

Conversation

shun-s commented Dec 29, 2015

shun-s commented Dec 29, 2015

shun-s commented Dec 31, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trociny commented Jan 4, 2016

shun-s commented Jan 5, 2016

shun-s commented Jan 6, 2016

trociny commented Jan 6, 2016

shun-s commented Jan 7, 2016

trociny commented Jan 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Jan 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shun-s commented Jan 26, 2016

shun-s commented Jan 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shun-s commented Jan 27, 2016

tchaikov commented Jan 27, 2016

shun-s commented Jan 27, 2016

tchaikov commented Jan 27, 2016

shun-s commented Jan 27, 2016

liewegas commented Jan 30, 2016

tchaikov commented Jan 30, 2016