New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global: do not start two daemons with a single pid-file #7075
Conversation
|
return pidfile_remove(); | ||
} | ||
|
||
fd = ::open(conf->pid_file.c_str(), O_CREAT|O_WRONLY, 0644); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fd is leaked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, fd is leaked on purpose, because pid-file needs to keep open whole lifecycle, and if let it be leaked, a global varible can be omitted.
Why does test_pid_file_in_use() call exit when it fails to open file but return error when it fails to lock? I do either way but the same in both cases. 'test_pid_file_in_use' name looks confusing, as it does not just tests for the pid file, but also locks it (and keeps open). I would suggest:
Also, it might be worth to modify pidfile_write() not to open the file again but use the stored fd, but we should be careful here not to overwrite some other file if pidfile was closed and descriptor reused. The same is for pidfile_remove(). I suggest looking at FreeBSD pidfile(3) interface and implementation: https://www.freebsd.org/cgi/man.cgi?query=pidfile&sektion=3 I would really like to have something like this. At least we could reuse their pidfile_verify() trick to be sure we don't overwrite some other file if the descriptor was reused. |
@trociny @tchaikov
|
@trociny two changes have made, please review, many thanks
|
It looks much better to me now. Still, taking it is a C++, I would add a constructor to struct pidfh, which would initialize pf_fd to -1, and other variables to 0. Also, right now you don't set pf_fd to -1 after close and it may lead to strange bugs, when the descriptor is reused. I would suggest to add a close() method to pidfh, which whould close the pf_fd and set it to -1. Also, at some point you will need to squash your commits into one (google for "git interactive rebase" if it is new for your). You will need to push with '-f' flag after the rebase. As for you question about how to specify not to use pidfile, by default pid_file config option is empty, and a daemon is started without pid file. You can redefine it either by specifying pid_file in ceph.conf or via command line (--pid-file /path). If pid_file is defined in ceph.conf but you want to start witout pid file, you can run it with --pid-file ''. |
25b31c6
to
a4eae40
Compare
It looks good to me. I have not tested it though. BTW, if you added unit tests (as a separate commit) it would be really nice! See src/test/test_*.cc for examples. |
if (ret < 0) { | ||
derr << "write_pid_file: failed to write to pid file '" | ||
<< pid_file << "': " << cpp_strerror(ret) << dendl; | ||
VOID_TEMP_FAILURE_RETRY(::close(fd)); | ||
<< pfh.pf_path << "': " << cpp_strerror(ret) << dendl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential fd leak here.
Shall put an extra
VOID_TEMP_FAILURE_RETRY(::close(pfh.pf_fd));
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiexingguo all clean work will be done by pidfile_remove, which is called when daemon shutdown. so it seems not to cause fd leak, am i right?
RUNID=`uuidgen` | ||
run_mon $dir a --pid-file= --daemonize=$RUNID || { teardown_unexist_pidfile $dir; return 1; } | ||
run_osd $dir 0 --pid-file= --daemonize=$RUNID || { teardown_unexist_pidfile $dir; return 1; } | ||
teardown_unexist_pidfile $dir || return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
teardown_unexist_pidfile() never fails, i think. because rm -fr $dir
can hardly fail. unless you set a flag when kill_complete is "1" in teardown_unexist_pidfile()
. and return 1
if the flag is not set.
see #7075 (comment). i quoted the bash's man page.
your function call is "part of the test following the if or elif reserved words", so the shell does exit if it fails.
you need to squash your commits for adding the tests into a single one, i am not sure what sort of problem you are referencing? but you might want to take a look at https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#preparing-and-sending-patches first. |
@@ -0,0 +1,111 @@ | |||
#!/bin/bash -e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed in the comments, -e
does not help in our case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i test if we remove -e, then run_mon $dir a "fail" will never fail even though mon actually fails.
at this situation, we have to add return 1 after every run_mon or run_osd. so keeping -e may look morebeautiful then seeing return 1 everwhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, you are right!
add functions named pidfile_open and pidfile_verify to avoid starting two daemons by a single pid-file Fixes: ceph#13422 Signed-off-by: shun song <song.shun3@zte.com.cn>
39983b8
to
bf68bd5
Compare
@tchaikov |
@tchaikov only run_osd and run_mon have been called in test_pidfile.sh unittest, after this, you won't need to consider whether to add run_osd_again or run_mon_again funcs any more. |
if kill -9 $i 2> /dev/null ; then | ||
kill_complete="0" | ||
count=$((count+1)) | ||
sleep ${delays[$count]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would you "sleep" if you just killed a pid successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need a kill failure to tell that the pid daemon had been successfully killed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean, we don't need to sleep if kill returns successfully.
- if it returns successfully, you wait for a while just for killing another pid. i don't see the point here.
- if "kill" fails, the only reason would be that the process is dead somehow, and the succeeding "kill"s would fail for the same reason. assuming
kill -9
fails for 9 times, then count would be greater than "9", and${delays[$count]}
is empty. sosleep
returns immediately with an error message, count would keep increasing in this never-ends loop.
i'd suggest we use -TERM
for killing the daemons and take this into your consideration.
@tchaikov |
@shun-s we are close, just some nits regarding to the in short
|
Fixes: ceph#13422 Signed-off-by: shun song <song.shun3@zte.com.cn>
@tchaikov about test, i have followed the model of kill_daemons in ceph-helpers.sh to design teardown helper function. |
lgtm with the qa run. |
@tchaikov 3q a lot |
global: do not start two daemons with a single pid-file Reviewed-by: Kefu Chai <kchai@redhat.com>
I think this introduced http://tracker.ceph.com/issues/14575 ... can you take a look please? |
@liewegas ack. will do. |
add a function named test_pidFile_in_use for checking whether pid-file is in using by another osd or monitor to advoid pid-file from deleted
Fixes: #13422
Signed-off-by: shun song song.shun3@zte.com.cn