Skip to content

Added logic to handle kill signals in the bash script#205

Merged
tgianos merged 4 commits intoNetflix:developfrom
amitsharma1981:develop
Mar 23, 2016
Merged

Added logic to handle kill signals in the bash script#205
tgianos merged 4 commits intoNetflix:developfrom
amitsharma1981:develop

Conversation

@amitsharma1981
Copy link
Contributor

The run.sh script for each job now has a handle_kill_request method which gets triggered once it traps a SIGTERM (kill or kill -15) signal. This initiates a kill sequence as follows:

  1. Writes the Kill exit code (999) to genie.done file.
  2. The base process (run.sh) sends a kill to its children
  3. Then it checks every second for the next 30 seconds if the children have been killed
  4. If all children are dead it exits.
  5. If all children are not dead it sends a kill -9 to all children

The behavior of kill is slightly different between Linux and Non-Linux OS.

For a linux OS , the system launches the run.sh script using setsid. This results in run.sh, its children and all its grandchildren to launch in the same process group which is the pid of run.sh. Once this is done the kill logic uses the "-g" flag to send the kill and kill -9 signals to all at the processes in the process group. This signal is send to run.sh itself in this case. The kill signal is handled by run.sh and ignored, but the kill -9 kills the run.sh as well, which is fine as at this point there is nothing else it can do.

For a Non-Linux OS we simply launch the run.sh as a normal process and use its pid to send kill and kill -9 signals to its immediate children only, and hope that they handle the killing of their own children.

… for the job. The Kill(DELETE) HTTP request not just returns the status Accepted notifying user of its asynchronous behavior. Also added conditional logic for Linux vs Non-Linux operating systems.
Conflicts:
	genie-core/src/main/java/com/netflix/genie/core/jobs/JobConstants.java
…it kill and timeout. Fixed a bug where the timeout value for the job was getting lost before being persisted to the database
// is launched in process group id which is the same as the pid of the parent process
if (SystemUtils.IS_OS_LINUX) {
command.add("setsid");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to start having lots of OS level differences I'd say we should create subclasses for LinuxJobKickoffTask, OSXJobKickoffTask, etc... that are created as beans based on what OS it is. That may well be overkill for right now but it's something to keep in mind.

…ess groups instead of immediate children only on Linux OS.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 92.964% when pulling f9ee7dd on amitsharma1981:develop into 05268af on Netflix:develop.

@tgianos tgianos self-assigned this Mar 23, 2016
@tgianos tgianos added this to the 3.0.0 milestone Mar 23, 2016
@tgianos tgianos merged commit 2433f6d into Netflix:develop Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants