-
Notifications
You must be signed in to change notification settings - Fork 26
/
MANUAL.txt
284 lines (231 loc) · 9.49 KB
/
MANUAL.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
SJM: a simple job manager
Phil Lacroute
April 2008
Copyright Notice
================
Copyright (c) 2008-2012, The Board of Trustees of The Leland Stanford
Junior University. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
* Neither the name of Stanford University nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL STANFORD
UNIVERSITY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Overview
========
The purpose of sjm is to manage groups of related jobs running on a
compute cluster. The input to the program is a file containing a
description of the jobs that need to be run, the resources they
require and the dependencies between them. sjm can do the following:
+ sjm dispatches each job to the batch queuing system once all of its
dependencies have been satisfied
+ sjm monitors the progress of all running jobs and produces both a
detailed log and a concise summary indicating any jobs that failed
+ sjm makes it easy to rerun any jobs that failed and the jobs that
depend on them, without rerunning jobs that finished successfully
+ sjm can run jobs directly on the submission host as well as through
the batch queuing system; this is useful if some tasks require
resources only available on the head node of a cluster, such as
network access to remote hosts
+ sjm can submit jobs using Platform LSF or Sun Grid Engine
Job Description File
====================
1. Example
The input to sjm is a job description file. Here is a simple example:
# example sjm input file
job_begin
name jobA
time 4h
memory 3G
queue standard
project sequencing
cmd /home/lacroute/project/jobA.sh
job_end
job_begin
name jobB
time 2d
memory 1G
queue extended
cmd_begin
/home/lacroute/project/jobB_prolog.sh;
/home/lacroute/project/jobB.sh;
/home/lacroute/project/jobB_epilog.sh
cmd_end
job_end
order jobA before jobB
log_dir /home/lacroute/project/log
This description defines two jobs names jobA and jobB. Each job has
an optional time limit, an optional memory requirement, an optional
submission queue, and a command to run. The "order" statement
specifies that jobA must finish before jobB can start. Finally, the
"log_dir" statement specifies a directory where the output of each
command will be written.
2. Job Specification Blocks
A job specification block has the form:
job_begin
STATEMENT1
STATEMENT2
...
job_end
Each statement must be on a single line. The following statements are
allowed within the job specification block:
name STRING
[REQUIRED]
The name of the job. This name will be used in the log files and
the batch queuing system. Each job must have a unique name.
slots NUMBER
[OPTIONAL]
The number of processor slots (i.e. CPU cores) the job will use.
parallel_env NAME
[OPTIONAL]
The name of the parallel environment that defines the slot-allocation
policy.
memory AMOUNT
[OPTIONAL]
The maximum amount of memory the job will use. The job may be
killed if it exceeds this amount. The AMOUNT consists of a number
followed by "b" (bytes), "k" (kilobytes), "m" (megabytes)
or "g" (gigabytes). Capital letters are allowed and spaces are ok.
For example:
memory 2G (2 gigabytes)
memory 1500 m (1500 megabytes)
time AMOUNT
[OPTIONAL]
The maximum amount of wallclock time the job will use. The job may
be killed if it exceeds this amount. The AMOUNT consists of a
number followed by "s" (seconds), "m" (minutes), "h" (hours)
or "d" (days). Capital letters are allowed and spaces are ok.
For example:
time 2d (2 days)
time 4 H (4 hours)
queue STRING
[OPTIONAL]
The name of the queue for job submission.
project STRING
[OPTIONAL]
The name of the project the job is associated with. Generally
the name must be a project that has been registered with the job
submission system.
cmd STRING
[REQUIRED]
The command to run. The command must be specified on a single
line. The command line will be interpreted by the shell so it
can contain file redirection characters, etc. For multi-line
commands there is an alternate form (see the next statement type).
cmd_begin
CMD_LINE1
CMD_LINE2
...
cmd_end
[REQUIRED]
The command to run. Use this form instead of "cmd" for multi-line
commands. The command will be interpreted by the shell. Newlines
are treated as spaces, so you can spread a single command over
multiple lines. If there are multiple commands then separate
them with semicolons.
export NAME
export NAME=VALUE
[OPTIONAL]
An environment variable to pass to the command. The first form
copies the current value of the specified environment variable. The
second form sets the value explicitly.
module NAME
[OPTIONAL]
An environment module that will be loaded when running the command.
directory PATHNAME
[OPTIONAL]
A directory (specified as a full path) that will be used as the
current working directory when running the command.
host localhost
[OPTIONAL]
Run this job on the local host instead of submitting it to the batch
system.
sched_options STRING ...
[OPTIONAL]
Specify additional scheduler-specific options. For Sun Grid Engine
you may include qsub options that are not available through the
other directives.
3. Other Statements
Outside of a job specification block (not within the job_begin/job_end
keywords) the following statements are allowed:
order JOB1 before JOB2
[OPTIONAL]
JOB1 and JOB2 are job names. JOB1 must finish before JOB2 can start.
order JOB2 after JOB1
[OPTIONAL]
JOB1 and JOB2 are job names. JOB1 must finish before JOB2 can start.
log_dir STRING
[OPTIONAL]
Specifies a directory where all of the output from each job will be
stored. Standard output will go into a file named JOBNAME_oTIME.txt
and standard error will go into a file named JOBNAME_eTIME.txt where
JOBNAME is the name of the job and TIME is the time when sjm started
running. Both of these files will be in the directory specified by
this statement. Note that this statement only applies to jobs
submitted through the batch system and not to jobs run on the local
host (use command-line I/O redirection in the job description for local
host jobs).
The job specification file may also contain comment lines beginning
with a "#".
Running sjm
===========
Run sjm as follows:
sjm [options] JOB_FILE
JOB_FILE is the job description file. For a description of all the
options, run "sjm --help". Most of them allow you to adjust
parameters but usually the defaults are fine. The most useful options
are:
-i run sjm in the foreground and print log messages to the terminal;
the default is to run in the background and store log messages in
a file names JOB_FILE.status.log
-r instead of dispatching the jobs, output a graphical representation
of the job dependency graph in "dot" format. See www.graphviz.org
for programs to display this format.
sjm produces three output files:
1) The log file (JOB_FILE.status.log by default, or standard output with
-i option) contains a running log of all the jobs sjm submits and
has a summary at the end indicating which jobs completed
successfully.
2) The status file (JOB_FILE.status by default) contains detailed
information about the status of each job. It can be used to rerun
sjm if some of the jobs fail.
3) The backup status file (JOB_FILE.status.bak by default) is a
slightly-older backup copy of the status file. Use this file if
sjm crashes while updating the status file and leaves it in a
corrupted state (hopefully very rare!)
If the log file indicates that some jobs failed, once you have
determined the cause and fixed anything that needs to be modified you
can rerun the failed jobs by running sjm on the status file or the
backup status file:
cp JOB_FILE.status JOB_FILE2
sjm JOB_FILE2
The job summary in the log file consists of lines that look like:
job0 (5714): 1:23:06 (513 sec, 120/167 MB)
The fields are:
job name (job0)
batch system job ID (5714)
wallclock time (1 hour, 23 minutes, 6 seconds)
CPU time (513 sec)
maximum virtual memory usage (120 MB)
maximum swap usage (167 MB)
While sjm is running you can kill all the currently-running jobs by
sending an interrupt to the sjm process. If it was run with the -i
(interactive) flag, just hit CTRL-C. Otherwise, find the process id
(e.g. "ps | fgrep sjm") and then run "kill -INT process_id".