Permalink
Browse files

Add support for failure management infrastructure

Adds slurmctld/nonstop plugin, libsmdns.so library, slurm/smd_ns.h
and smd command. Also add regression tests for smd command.
  • Loading branch information...
1 parent 8a2a66d commit db7b2f7bdee701aabe28d48028d429b2cf02c508 @jette jette committed Feb 14, 2014
Showing with 11,188 additions and 53 deletions.
  1. +1 −0 Makefile.am
  2. +1 −0 Makefile.in
  3. +15 −13 configure
  4. +14 −12 configure.ac
  5. +5 −1 doc/html/man_index.shtml
  6. +2 −0 doc/man/man1/Makefile.am
  7. +2 −0 doc/man/man1/Makefile.in
  8. +1 −1 doc/man/man1/sdiag.1
  9. +114 −0 doc/man/man1/smd.1
  10. +2 −0 doc/man/man5/Makefile.am
  11. +2 −0 doc/man/man5/Makefile.in
  12. +147 −0 doc/man/man5/nonstop.conf.5
  13. +487 −0 slurm/smd_ns.h
  14. +29 −5 src/Makefile.am
  15. +8 −8 src/Makefile.in
  16. +20 −3 src/api/Makefile.am
  17. +38 −5 src/api/Makefile.in
  18. +2,684 −0 src/api/smd_ns.c
  19. +1 −1 src/plugins/slurmctld/Makefile.am
  20. +1 −1 src/plugins/slurmctld/Makefile.in
  21. +26 −0 src/plugins/slurmctld/nonstop/Makefile.am
  22. +813 −0 src/plugins/slurmctld/nonstop/Makefile.in
  23. +1,808 −0 src/plugins/slurmctld/nonstop/do_work.c
  24. +127 −0 src/plugins/slurmctld/nonstop/do_work.h
  25. +387 −0 src/plugins/slurmctld/nonstop/msg.c
  26. +47 −0 src/plugins/slurmctld/nonstop/msg.h
  27. +97 −0 src/plugins/slurmctld/nonstop/nonstop.c
  28. +385 −0 src/plugins/slurmctld/nonstop/read_config.c
  29. +105 −0 src/plugins/slurmctld/nonstop/read_config.h
  30. +19 −0 src/smd/Makefile.am
  31. +793 −0 src/smd/Makefile.in
  32. +643 −0 src/smd/automatic.c
  33. +331 −0 src/smd/manual.c
  34. +480 −0 src/smd/opt.c
  35. +55 −0 src/smd/smd.c
  36. +59 −0 src/smd/smd.h
  37. +7 −0 testsuite/expect/Makefile.am
  38. +9 −2 testsuite/expect/README
  39. +40 −1 testsuite/expect/globals
  40. +57 −0 testsuite/expect/test29.1
  41. +75 −0 testsuite/expect/test29.2
  42. +201 −0 testsuite/expect/test29.3
  43. +183 −0 testsuite/expect/test29.4
  44. +202 −0 testsuite/expect/test29.5
  45. +246 −0 testsuite/expect/test29.6
  46. +198 −0 testsuite/expect/test29.7
  47. +221 −0 testsuite/expect/test29.8
View
@@ -32,6 +32,7 @@ pkginclude_HEADERS = \
slurm/slurm.h \
slurm/slurmdb.h \
slurm/slurm_errno.h \
+ slurm/smd_ns.h \
slurm/spank.h
MAINTAINERCLEANFILES = \
View
@@ -543,6 +543,7 @@ pkginclude_HEADERS = \
slurm/slurm.h \
slurm/slurmdb.h \
slurm/slurm_errno.h \
+ slurm/smd_ns.h \
slurm/spank.h
MAINTAINERCLEANFILES = \
View

Large diffs are not rendered by default.

Oops, something went wrong.
View
@@ -447,28 +447,29 @@ AC_CONFIG_FILES([Makefile
src/sacct/Makefile
src/sacctmgr/Makefile
src/sreport/Makefile
- src/sstat/Makefile
- src/sshare/Makefile
src/salloc/Makefile
src/sbatch/Makefile
+ src/sbcast/Makefile
src/sattach/Makefile
+ src/scancel/Makefile
+ src/scontrol/Makefile
src/sdiag/Makefile
- src/sprio/Makefile
- src/srun/Makefile
- src/srun/libsrun/Makefile
- src/srun_cr/Makefile
+ src/sinfo/Makefile
+ src/slurmctld/Makefile
src/slurmd/Makefile
src/slurmd/common/Makefile
src/slurmd/slurmd/Makefile
src/slurmd/slurmstepd/Makefile
src/slurmdbd/Makefile
- src/slurmctld/Makefile
- src/sbcast/Makefile
- src/scontrol/Makefile
- src/scancel/Makefile
- src/squeue/Makefile
- src/sinfo/Makefile
src/smap/Makefile
+ src/smd/Makefile
+ src/sprio/Makefile
+ src/squeue/Makefile
+ src/srun/Makefile
+ src/srun/libsrun/Makefile
+ src/srun_cr/Makefile
+ src/sshare/Makefile
+ src/sstat/Makefile
src/strigger/Makefile
src/sview/Makefile
src/plugins/Makefile
@@ -585,6 +586,7 @@ AC_CONFIG_FILES([Makefile
src/plugins/select/serial/Makefile
src/plugins/slurmctld/Makefile
src/plugins/slurmctld/dynalloc/Makefile
+ src/plugins/slurmctld/nonstop/Makefile
src/plugins/slurmd/Makefile
src/plugins/switch/Makefile
src/plugins/switch/cray/Makefile
View
@@ -14,9 +14,11 @@ Documentation for other versions of Slurm is distributed with the code</b></p>
<tr><td><a href="sbcast.html">sbcast</a></td><td>transmit a file to the nodes allocated to a SLURM job.</td></tr>
<tr><td><a href="scancel.html">scancel</a></td><td>Used to signal jobs or job steps that are under the control of Slurm.</td></tr>
<tr><td><a href="scontrol.html">scontrol</a></td><td>Used view and modify Slurm configuration and state.</td></tr>
+<tr><td><a href="sdiag.html">sdiag</a></td><td>scheduling diagnostic tool.</td></tr>
<tr><td><a href="sinfo.html">sinfo</a></td><td>view information about SLURM nodes and partitions.</td></tr>
<tr><td><a href="slurm.html">slurm</a></td><td>SLURM system overview.</td></tr>
<tr><td><a href="smap.html">smap</a></td><td>graphically view information about SLURM jobs, partitions, and set configurations parameters.</td></tr>
+<tr><td><a href="smd.html">smd</a></td><td>Used to manage failures in a resource allocation.</td></tr>
<tr><td><a href="sprio.html">sprio</a></td><td>view the factors that comprise a job's scheduling priority</td></tr>
<tr><td><a href="sh5util.html">sh5util</a></td><td>merge utility for acct_gather_profile plugin.</td></tr>
<tr><td><a href="squeue.html">squeue</a></td><td>view information about jobs located in the SLURM scheduling queue.</td></tr>
@@ -30,8 +32,10 @@ Documentation for other versions of Slurm is distributed with the code</b></p>
<tr><td><a href="acct_gather.conf.html">acct_gather.conf</a></td><td>Slurm configuration file for the acct_gather plugins</td></tr>
<tr><td><a href="bluegene.conf.html">bluegene.conf</a></td><td>Slurm configuration file for BlueGene systems</td></tr>
<tr><td><a href="cgroup.conf.html">cgroup.conf</a></td><td>Slurm configuration file for the cgroup support</td></tr>
+<tr><td><a href="cray.conf.html">cray.conf</a></td><td>Slurm configuration file Cray systems.</td></tr>
<tr><td><a href="ext_sensors.conf.html">ext_sensors.conf</a></td><td>Slurm configuration file for the external sensor support</td></tr>
<tr><td><a href="gres.conf.html">gres.conf</a></td><td>Slurm configuration file for generic resource management.</td></tr>
+<tr><td><a href="nonstop.conf.html">nonstop.conf</a></td><td>Slurm configuration file for failure management.</td></tr>
<tr><td><a href="slurm.conf.html">slurm.conf</a></td><td>Slurm configuration file</td></tr>
<tr><td><a href="slurmdbd.conf.html">slurmdbd.conf</a></td><td>Slurm Database Daemon (SlurmDBD) configuration file</td></tr>
<tr><td><a href="topology.conf.html">topology.conf</a></td><td>Slurm configuration file for defining the network topology</td></tr>
@@ -44,6 +48,6 @@ Documentation for other versions of Slurm is distributed with the code</b></p>
</table>
-<p style="text-align:center;">Last modified 10 July 2013</p>
+<p style="text-align:center;">Last modified 14 February 2014</p>
<!--#include virtual="footer.txt"-->
View
@@ -13,6 +13,7 @@ man1_MANS = \
sinfo.1 \
slurm.1 \
smap.1 \
+ smd.1 \
sprio.1 \
sh5util.1 \
squeue.1 \
@@ -40,6 +41,7 @@ html_DATA = \
sdiag.html \
sinfo.html \
smap.html \
+ smd.html \
sprio.html \
sh5util.html \
squeue.html \
View
@@ -431,6 +431,7 @@ man1_MANS = \
sinfo.1 \
slurm.1 \
smap.1 \
+ smd.1 \
sprio.1 \
sh5util.1 \
squeue.1 \
@@ -455,6 +456,7 @@ EXTRA_DIST = $(man1_MANS) $(am__append_1)
@HAVE_MAN2HTML_TRUE@ sdiag.html \
@HAVE_MAN2HTML_TRUE@ sinfo.html \
@HAVE_MAN2HTML_TRUE@ smap.html \
+@HAVE_MAN2HTML_TRUE@ smd.html \
@HAVE_MAN2HTML_TRUE@ sprio.html \
@HAVE_MAN2HTML_TRUE@ sh5util.html \
@HAVE_MAN2HTML_TRUE@ squeue.html \
View
@@ -1,7 +1,7 @@
.TH "sdiag" "1" "SLURM 2.4" "December 2011" "SLURM Commands"
.SH "NAME"
.LP
-sdiag \- Diagnostic tool for SLURM
+sdiag \- Scheduling diagnostic tool for SLURM
.SH "SYNOPSIS"
.LP
View
@@ -0,0 +1,114 @@
+.TH SMD "1" "February 2014" "smd 14.03" "Slurm components"
+
+.SH "NAME"
+smd \- Used to manage failures in a resource allocation.
+
+.SH "SYNOPSIS"
+\fBsmd\fR [\fIOPTIONS\fR...] [\fIjob_id\fR]
+
+.SH "DESCRIPTION"
+.LP
+Slurm command used to manage failures in a resource allocation.
+
+.SH "OPTIONS"
+.TP
+\fB\-c\fR, \fB\-\-show-config\fR
+Shows the configuration of smd.
+.TP
+\fB\-d\fR, \fB\-\-drain-node\fR \fInode_name\fR
+Drains the hosts of the job (Note: Must include reason \fB\-R\fR).
+.TP
+\fB\-D\fR, \fB\-\-drop_node\fR \fInode_name\fR
+Drops the failed or failing host.
+.TP
+\fB\-e\fR, \fB\-\-extend-time\fR
+Extends the runtime of the job.
+.TP
+\fB\-f\fR, \fB\-\-faulty-nodes\fR \fInode_name\fR
+Gets the hosts that are failed or failing hosts.
+.TP
+\fB\-j\fR, \fB\-\-job_info\fR
+Gets the information of the specified job id.
+.TP
+\fB\-r\fR, \fB\-\-replace-node\fR \fInode_name\fR
+Replaces the drained host with a new one.
+.TP
+\fB\-v\fR, \fB\-\-verbose\fR
+Prints detailed event logging. Multiple \fB\-v\fR's will further
+increase the verbosity of logging. By default only errors will display.
+
+.SH "EXAMPLES"
+See configuration smd.
+.nf
+ > smd \-c
+ System Configuration:
+ ConfigurationFile: /etc/nonstop.conf
+ ControllerAddress: localhost
+ LibraryDebug: 0
+ ControllerPort: 9114
+ ReadTimeout: 10000
+ WriteTimeout: 10000
+ HotSpareCount: "debug:0"
+ MaxSpareNodeCount: 10
+ TimeLimitDelay: 600
+ TimeLimitDrop: 0
+ TimeLimitExtend: 2
+ UserDrainAllow: "alan,brenda"
+ UserDrainDeny: "none"
+.fi
+
+.PP
+Replace a failed node in a job allocation and extend its time limit.
+.nf
+ $ salloc \-N4 \-\-no\-kill bash
+ salloc: Granted job allocation 67
+ $ squeue
+ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
+ 67 debug bash jette R 0:48 4 tux[0\-3]
+ salloc: error: Node failure on tux2
+ $ smd \-f $SLURM_JOBID
+ Job 67 has 1 failed or failing hosts:
+ node tux2 cpu_count 1 state FAILED
+ $ smd \-r tux2 $SLURM_JOBID
+ Job 67 got node tux2 replaced with node tux4
+ $ squeue
+ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
+ 67 debug bash jette R 0:48 4 tux[0\-1,3\-4]
+ $ smd \-e 2 $SLURM_JOBID
+ Job 67 run time increased by 2min successfully
+.fi
+
+.PP
+Identify a failing node in a job allocation, drop it from the job allocation,
+and extend the job time limit.
+.nf
+ $ salloc \-N4 \-\-no\-kill bash
+ salloc: Granted job allocation 70
+ $ squeue
+ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
+ 69 debug bash jette R 0:48 4 tux[0\-3]
+ $ smd \-d tux3 \-R "Application X hangs" $SLURM_JOBID
+ Job 69 node tux2 is being drained
+ $ smd \-f
+ Job 69 has 1 failed or failing hosts:
+ node tux2 cpu_count 1 state FAILING
+ $ smd \-D tux2 $SLURM_JOBID
+ Job 69 node tux2 dropped successfully
+ $ squeue
+ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
+ 69 debug bash jette R 0:48 4 tux[0\-1,3]
+ $ smd \-e 2 $SLURM_JOBID
+ Job 67 run time increased by 2min successfully
+.fi
+
+.SH "COPYING"
+Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.
+.LP
+Slurm is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
+details.
+
+.SH "SEE ALSO"
+.LP
+nonstop.conf(5)
View
@@ -7,6 +7,7 @@ man5_MANS = \
cray.conf.5 \
ext_sensors.conf.5 \
gres.conf.5 \
+ nonstop.conf.5 \
slurm.conf.5 \
slurmdbd.conf.5 \
topology.conf.5 \
@@ -23,6 +24,7 @@ html_DATA = \
cray.conf.html \
ext_sensors.conf.html \
gres.conf.html \
+ nonstop.conf.html \
slurm.conf.html \
slurmdbd.conf.html \
topology.conf.html \
View
@@ -425,6 +425,7 @@ man5_MANS = \
cray.conf.5 \
ext_sensors.conf.5 \
gres.conf.5 \
+ nonstop.conf.5 \
slurm.conf.5 \
slurmdbd.conf.5 \
topology.conf.5 \
@@ -438,6 +439,7 @@ EXTRA_DIST = $(man5_MANS) $(am__append_1)
@HAVE_MAN2HTML_TRUE@ cray.conf.html \
@HAVE_MAN2HTML_TRUE@ ext_sensors.conf.html \
@HAVE_MAN2HTML_TRUE@ gres.conf.html \
+@HAVE_MAN2HTML_TRUE@ nonstop.conf.html \
@HAVE_MAN2HTML_TRUE@ slurm.conf.html \
@HAVE_MAN2HTML_TRUE@ slurmdbd.conf.html \
@HAVE_MAN2HTML_TRUE@ topology.conf.html \
Oops, something went wrong.

0 comments on commit db7b2f7

Please sign in to comment.