Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Snapshot on signal #2253
Conversation
jyegerlehner
changed the title from
Add signal handler and early exit/snapshot to Solver. to Snapshot on signal
Apr 4, 2015
shelhamer
added the
enhancement
label
Apr 9, 2015
|
Looks cool, but I don't think you can safely lock a mutex in a signal handler, nor try to perform I/O. |
|
Thanks @flx42, I should have researched signals a bit more before I implemented this. |
|
@flx42 do you see any problem with this new implementation? |
|
The type should be "sig_atomic_t volatile". After that, I think it will be fine :) |
|
Thanks for the reply.
Hrmm.. all the examples I see, the qualifier precedes the type name, same as with "const" or "mutable". Please let me know if I'm missing something.
true and false are promoted to 1 and 0 per the language spec I'm pretty sure. Since they better convey the meaning, I'm inclined to leave the true and false in there. |
|
On Sat, Apr 18, 2015 at 12:48 PM, jyegerlehner notifications@github.com
No, you're right. That's what I meant.
It was really a nitpicking comment about style. But both ways will be fine. |
|
Another nitpick: you should avoid unnecessary empty lines changes in your patch. |
lukeyeager
referenced
this pull request
in NVIDIA/DIGITS
May 13, 2015
Closed
Upgrades to scheduler module #104
|
Any update on this? I want it. |
|
Hi Luke, I'm not aware of anything deficient about it. |
shelhamer
added the
JL
label
May 13, 2015
|
Well there is one thing about this that is debatable. Solver doesn't check to see if it should exit when it is doing test, only when it is training. This means if you send it SIGINT and it happens to be in the middle of test (not train), caffe keeps going until the test is finished, and only then does it exit (or snapshot). So you might have to wait a bit if you try to request it to stop when it's testing. I thought that was a good thing since it allows the test to run to completion and produce a valid test result. However, one might prefer that it be more responsive and stop right away regardless of whether it is testing, and throw away the test that is in-progress. We could make it behave that way with a bit of extra complexity. |
That sounds better to me. I would expect Ctrl+C to kill the process within a second or two. If you wait for testing to finish, you might have to wait several minutes. |
If anyone objects to Luke's preferred behaviour please speak up. Otherwise I'll plan to make that change. |
|
I can see an argument for either finishing the testing or quitting Thanks for the signal handling!
|
|
It would be nice to have both, if that's not too much complexity for you. |
|
Behaviour modified to respond to signals during testing or training, whereas before it just responded during training. And rebased off master. I tested these latest changes on a couple scenarios manually. But some of this code is so new if anyone else can test it that could give more confidence. |
Makes sense to me.
Anything for the cause, comrade! |
If we find we need or want that extra sophistication, I think my preference is to add it in a separate PR so that we may thereby proceed more incrementally. The changes to Solver started out very simple, and got more complex with the latest change and feels to me like pushing the limits of added risk in one PR. That said, if everyone really really wants it right now, I can do it. |
|
No, I agree, this should be done in a separate patch. |
|
I think this could use a "Ready for Review" label, if those make any difference. |
shelhamer
added focus RH
labels
Aug 5, 2015
jyegerlehner
commented on an outdated diff
Aug 15, 2015
jyegerlehner
commented on an outdated diff
Aug 15, 2015
| @@ -0,0 +1,91 @@ | ||
| +#include <boost/bind.hpp> | ||
| +#include <boost/thread/mutex.hpp> | ||
| +#include <boost/thread/thread.hpp> |
jyegerlehner
Contributor
|
jyegerlehner
commented on an outdated diff
Aug 15, 2015
|
LGTM. I shall review this PR within next week. |
|
OK thanks. I will squash the commits once the review is done and we know there aren't any more changes required. |
|
@jyegerlehner I just did a quick review today, and tested on my machine. Looks good to me! I'll finish reviewing this PR within tomorrow. (As far as I know, this PR won't apply to windows. But since we are not officially supporting windows at the moment, I'm not too worried about that.) One another potential enhancement related to this PR is that we may also support learning rate adjustment on signaling in the future (in a separate PR), so that one may adjust it during training based on e.g. learning curve from log, similar to some other deep learning tools. @jeffdonahue @longjon what do you think? |
ronghanghu
and 1 other
commented on an outdated diff
Aug 20, 2015
| +#ifndef INCLUDE_CAFFE_UTIL_SIGNAL_HANDLER_H_ | ||
| +#define INCLUDE_CAFFE_UTIL_SIGNAL_HANDLER_H_ | ||
| + | ||
| +#include "caffe/proto/caffe.pb.h" | ||
| +#include "caffe/solver.hpp" | ||
| + | ||
| +namespace caffe { | ||
| + | ||
| +class SignalHandler { | ||
| + public: | ||
| + // Contructor. Specify what action to take when a signal is received. | ||
| + SignalHandler(SolverParameter_Action SIGINT_action, | ||
| + SolverParameter_Action SIGHUP_action); | ||
| + ActionCallback GetActionFunction(); | ||
| + private: | ||
| + SignalHandler(); // Not implemented. |
ronghanghu
Member
|
ronghanghu
and 1 other
commented on an outdated diff
Aug 20, 2015
| @@ -234,6 +234,19 @@ message SolverParameter { | ||
| // If false, don't save a snapshot after training finishes. | ||
| optional bool snapshot_after_train = 28 [default = true]; | ||
| + | ||
| + // Enumeration of actions that a client of the Solver may request by | ||
| + // implementing the Solver's action request function, which a | ||
| + // a client may optionally provide in order to request early termination | ||
| + // or saving a snapshot without exiting. In the executable caffe, this | ||
| + // mechanism is used to allow the snapshot to be saved when stopping | ||
| + // execution with a SIGINT (Ctrl-C). | ||
| + enum Action { | ||
| + NONE = 0; // Take no special action. | ||
| + STOP = 1; // Stop training. snapshot_after_train controls whether a snapshot | ||
| + // is created. | ||
| + SNAPSHOT = 2; // Take a snapshot, and keep training. | ||
| + } |
ronghanghu
Member
|
|
@willyd How this will impact with your windows plans? |
|
@bhack On windows we would need to call SetConsoleCtrlHandler to handle SIGINT but don't think there is an equivalent to SIGHUP. A cross-plaform implementation is available in boost.asio. |
|
If there is still interest in #2537 I will avoid asio solution. |
ronghanghu
commented on the diff
Aug 20, 2015
| + | ||
| + struct sigaction sa; | ||
| + // Setup the sighub handler | ||
| + sa.sa_handler = &handle_signal; | ||
| + // Restart the system call, if at all possible | ||
| + sa.sa_flags = SA_RESTART; | ||
| + // Block every signal during the handler | ||
| + sigfillset(&sa.sa_mask); | ||
| + // Intercept SIGHUP and SIGINT | ||
| + if (sigaction(SIGHUP, &sa, NULL) == -1) { | ||
| + LOG(FATAL) << "Cannot install SIGHUP handler."; | ||
| + } | ||
| + if (sigaction(SIGINT, &sa, NULL) == -1) { | ||
| + LOG(FATAL) << "Cannot install SIGINT handler."; | ||
| + } | ||
| + } |
|
|
|
Completed a thorough pass today. This PR seems in good shape to me, handles signals via POSIX sigaction and address actions from solver's train & test loop in a call-back fashion. A side effect: this PR is platform-specific and may impact community windows ports. |
|
@ronghanghu The latest commits are intended to resolve the review issues you raised. Please let us know if they are not satisfactory. As far as the Travis build failing: this looks like it happened due to an error installing a package. Does anyone know: should I make a dummy commit to provoke it to try building again? Or am I missing something that's an actual problem I need to fix?. |
|
I restarted Travis CI and now all tests pass. I'll try to take a look today. |
ronghanghu
added the
ready for review
label
Aug 22, 2015
|
Seems ready to me :) Please squash into one commit. |
|
@jeffdonahue @longjon I would like to merge this PR if you don't have other concerns. Community windows ports can perhaps simply strip this feature with |
|
OK, do you prefer commits to be squashed? |
|
Yes, please squash into one commit, so that I can merge in this weekend. |
ronghanghu
added a commit
that referenced
this pull request
Aug 22, 2015
|
|
ronghanghu |
12e1432
|
ronghanghu
merged commit 12e1432
into
BVLC:master
Aug 22, 2015
1 check passed
|
this is great. thanks for the effort @jyegerlehner |
|
Sure thing @erogol. Glad to hear it's helpful. |
jyegerlehner
deleted the
jyegerlehner:snapshot_on_signal branch
Aug 31, 2015
This was referenced Sep 16, 2015
Coderx7
commented
May 23, 2016
|
Can you add an option to the solver so that users can take snap-shots at any given time by pressing a key combination like Ctrl-S for example? |
|
@Coderx7 I think that'd add to much complexity to the feature. If you want to snapshot at an arbitrary time without stopping training, just send SIGHUP ( |
Coderx7
commented
May 24, 2016
•
|
@ajtulloch yes, but at the same time it provides a very convenient and good feature to have, |
jyegerlehner commentedApr 3, 2015
This an implementation of the feature discussed in Issue 2012.
When you hit Ctrl-C to kill caffe (while training), it will now save a snapshot before exiting. Actually, the Solver just stops training, and a snapshot is only saved if
snapshot_after_trainis true.This is the default behavior which is configurable via the sigint_effect and sighup_effect command line options. Also by default, SIGHUP signal causes caffe to save a snapshot and continue training. So you can make caffe save a snapshot by sending it SIGHUP signal, e.g.:
kill -SIGHUP PIDwhere PID is the process id of caffe, which you can find by doing
ps -ef | grep caffe.The design has two pieces to it. 1.
Solveris modified slightly so that after each iteration it checks to see if its client wants it to either snapshot or exit. It does this via a callback function that a client can set on the Solver instance. If the callback hasn't been set, it just carries on as usual. So there is no breaking change to the Solver interface and the behavior of existing code shouldn't change. 2. The caffe executable provides the callback function to the Solver, and the callback is implemented on a SignalHandler that intercepts SIGINT and SIGHUP.