Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global: do not start two daemons with a single pid-file #7075

Merged
merged 2 commits into from Jan 29, 2016

Conversation

shun-s
Copy link
Contributor

@shun-s shun-s commented Dec 29, 2015

add a function named test_pidFile_in_use for checking whether pid-file is in using by another osd or monitor to advoid pid-file from deleted

Fixes: #13422
Signed-off-by: shun song song.shun3@zte.com.cn

@shun-s
Copy link
Contributor Author

shun-s commented Dec 29, 2015

@tchaikov I mess #6763 up and reopen it here, please review. Thanks

@shun-s shun-s changed the title Global: test pid_file whether or not in use Global: do not start two daemons with a single pid-file Dec 29, 2015
@shun-s
Copy link
Contributor Author

shun-s commented Dec 31, 2015

@tchaikov

  1. fixed the pid_file empty problem which leads to "unexpected process terminate"
  2. just put ::open(conf->pid_file.c_str(), ...)
    please review, many thanks

return pidfile_remove();
}

fd = ::open(conf->pid_file.c_str(), O_CREAT|O_WRONLY, 0644);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fd is leaked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fd is leaked on purpose, because pid-file needs to keep open whole lifecycle, and if let it be leaked, a global varible can be omitted.

@trociny
Copy link
Contributor

trociny commented Jan 4, 2016

Why does test_pid_file_in_use() call exit when it fails to open file but return error when it fails to lock? I do either way but the same in both cases.

'test_pid_file_in_use' name looks confusing, as it does not just tests for the pid file, but also locks it (and keeps open).

I would suggest:

  • rename test_pid_file_in_use() to pidfile_open();
  • keep the opened fd in static variable (similarly to pid_file variable);
  • in pidfile_remove(), close the opened fd after unlinking the file.

Also, it might be worth to modify pidfile_write() not to open the file again but use the stored fd, but we should be careful here not to overwrite some other file if pidfile was closed and descriptor reused. The same is for pidfile_remove().

I suggest looking at FreeBSD pidfile(3) interface and implementation:

https://www.freebsd.org/cgi/man.cgi?query=pidfile&sektion=3
https://github.com/freebsd/freebsd/blob/master/lib/libutil/pidfile.c

I would really like to have something like this. At least we could reuse their pidfile_verify() trick to be sure we don't overwrite some other file if the descriptor was reused.

@shun-s
Copy link
Contributor Author

shun-s commented Jan 5, 2016

@trociny @tchaikov
refer to freeBSD, a new version has been completed, mainly repair these:

  1. keep the opened fd in static variable
  2. properly modify pidfile_write and pidfile_remove
  3. ename test_pid_file_in_use() to pidfile_open() and add a new func pidfile_verify for avoiding pidfile overwrite.
  4. other deficiency, like pidfile leak, open-pidfile-failure exit but lock failure retruns

@shun-s
Copy link
Contributor Author

shun-s commented Jan 6, 2016

@trociny two changes have made, please review, many thanks

  1. use 'static struct pidfh pfh' (not pointer) now
  2. when pid_file is empty, return 0 instead of an error, so daemon can run without pidfile if someone specify not to use pidfile.
    but how to specify not to use pidfile in cli when run daemon?

@trociny
Copy link
Contributor

trociny commented Jan 6, 2016

It looks much better to me now. Still, taking it is a C++, I would add a constructor to struct pidfh, which would initialize pf_fd to -1, and other variables to 0.

Also, right now you don't set pf_fd to -1 after close and it may lead to strange bugs, when the descriptor is reused. I would suggest to add a close() method to pidfh, which whould close the pf_fd and set it to -1.

Also, at some point you will need to squash your commits into one (google for "git interactive rebase" if it is new for your). You will need to push with '-f' flag after the rebase.

As for you question about how to specify not to use pidfile, by default pid_file config option is empty, and a daemon is started without pid file. You can redefine it either by specifying pid_file in ceph.conf or via command line (--pid-file /path). If pid_file is defined in ceph.conf but you want to start witout pid file, you can run it with --pid-file ''.

@shun-s shun-s force-pushed the shun-fix branch 2 times, most recently from 25b31c6 to a4eae40 Compare January 7, 2016 06:56
@shun-s
Copy link
Contributor Author

shun-s commented Jan 7, 2016

@trociny @tchaikov , thanks for your careful direction, a constuctor and a close func have been added, please review again.

@trociny
Copy link
Contributor

trociny commented Jan 7, 2016

It looks good to me. I have not tested it though.

BTW, if you added unit tests (as a separate commit) it would be really nice! See src/test/test_*.cc for examples.

if (ret < 0) {
derr << "write_pid_file: failed to write to pid file '"
<< pid_file << "': " << cpp_strerror(ret) << dendl;
VOID_TEMP_FAILURE_RETRY(::close(fd));
<< pfh.pf_path << "': " << cpp_strerror(ret) << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential fd leak here.
Shall put an extra
VOID_TEMP_FAILURE_RETRY(::close(pfh.pf_fd));
here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiexingguo all clean work will be done by pidfile_remove, which is called when daemon shutdown. so it seems not to cause fd leak, am i right?

RUNID=`uuidgen`
run_mon $dir a --pid-file= --daemonize=$RUNID || { teardown_unexist_pidfile $dir; return 1; }
run_osd $dir 0 --pid-file= --daemonize=$RUNID || { teardown_unexist_pidfile $dir; return 1; }
teardown_unexist_pidfile $dir || return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teardown_unexist_pidfile() never fails, i think. because rm -fr $dir can hardly fail. unless you set a flag when kill_complete is "1" in teardown_unexist_pidfile(). and return 1 if the flag is not set.

@tchaikov
Copy link
Contributor

there is a little confusion about what man bash said when setting -e option

see #7075 (comment). i quoted the bash's man page.

The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test following the if or elif reserved words, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command's return value is being inverted with !.

your function call is "part of the test following the if or elif reserved words", so the shell does exit if it fails.

this time, i don't rebase the fix commit as to avoid low mistakes like different versions bettween github and my personal computer. if all is ok, then rebase all. does you ever meet this kind of problem?

you need to squash your commits for adding the tests into a single one, i am not sure what sort of problem you are referencing? but you might want to take a look at https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#preparing-and-sending-patches first.

@@ -0,0 +1,111 @@
#!/bin/bash -e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed in the comments, -e does not help in our case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i test if we remove -e, then run_mon $dir a "fail" will never fail even though mon actually fails.
at this situation, we have to add return 1 after every run_mon or run_osd. so keeping -e may look morebeautiful then seeing return 1 everwhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you are right!

add functions named pidfile_open and pidfile_verify to avoid starting two daemons by a single pid-file

Fixes: ceph#13422
Signed-off-by: shun song <song.shun3@zte.com.cn>
@shun-s shun-s force-pushed the shun-fix branch 2 times, most recently from 39983b8 to bf68bd5 Compare January 26, 2016 08:36
@shun-s
Copy link
Contributor Author

shun-s commented Jan 26, 2016

@tchaikov
run_osd_again and run_mon_again func has been removed and run_mon is used.
please review, thanks

@shun-s
Copy link
Contributor Author

shun-s commented Jan 26, 2016

@tchaikov only run_osd and run_mon have been called in test_pidfile.sh unittest, after this, you won't need to consider whether to add run_osd_again or run_mon_again funcs any more.
please review, thanks

if kill -9 $i 2> /dev/null ; then
kill_complete="0"
count=$((count+1))
sleep ${delays[$count]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you "sleep" if you just killed a pid successfully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need a kill failure to tell that the pid daemon had been successfully killed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shun-s

i mean, we don't need to sleep if kill returns successfully.

  • if it returns successfully, you wait for a while just for killing another pid. i don't see the point here.
  • if "kill" fails, the only reason would be that the process is dead somehow, and the succeeding "kill"s would fail for the same reason. assuming kill -9 fails for 9 times, then count would be greater than "9", and ${delays[$count]} is empty. so sleep returns immediately with an error message, count would keep increasing in this never-ends loop.

i'd suggest we use -TERM for killing the daemons and take this into your consideration.

@shun-s
Copy link
Contributor Author

shun-s commented Jan 27, 2016

@tchaikov
TEST_normal has been removed,
please review, thanks

@tchaikov
Copy link
Contributor

@shun-s we are close, just some nits regarding to the teardown_unexist_pidfile().

in short

  • i don't think delay helps with "kill -9",
  • the last kill_complete does not help to tell if teardown_unexist_pidfile() is successful or not, not to mention it can hardly be "1"
  • probably you can follow the model of kill_daemons() in ceph-helper.sh to design your teardown helper function?

Fixes: ceph#13422
Signed-off-by: shun song <song.shun3@zte.com.cn>
@shun-s
Copy link
Contributor Author

shun-s commented Jan 27, 2016

@tchaikov
i hope this would be the last commit of this pull request, hehe

about test, i have followed the model of kill_daemons in ceph-helpers.sh to design teardown helper function.
please take a look again, thanks

@tchaikov
Copy link
Contributor

lgtm with the qa run.

@shun-s
Copy link
Contributor Author

shun-s commented Jan 27, 2016

@tchaikov 3q a lot

liewegas added a commit that referenced this pull request Jan 29, 2016
global: do not start two daemons with a single pid-file

Reviewed-by: Kefu Chai <kchai@redhat.com>
@liewegas liewegas merged commit 71501a3 into ceph:master Jan 29, 2016
@liewegas
Copy link
Member

I think this introduced http://tracker.ceph.com/issues/14575 ... can you take a look please?

@tchaikov
Copy link
Contributor

@liewegas ack. will do.

@shun-s shun-s deleted the shun-fix branch March 29, 2016 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants