Permalink
Switch branches/tags
start slurm-17-11-0-0pre2 slurm-17-11-0-0pre1 slurm-17-02-7-1 slurm-17-02-6-1 slurm-17-02-5-1 slurm-17-02-4-1 slurm-17-02-3-1 slurm-17-02-2-1 slurm-17-02-1-2 slurm-17-02-1-1 slurm-17-02-0-1 slurm-17-02-0-0rc1 slurm-17-02-0-0pre4 slurm-17-02-0-0pre3 slurm-17-02-0-0pre2 slurm-17-02-0-0pre1 slurm-16-05-10-2 slurm-16-05-10-1 slurm-16-05-9-1 slurm-16-05-8-1 slurm-16-05-7-1 slurm-16-05-6-1 slurm-16-05-5-1 slurm-16-05-4-1 slurm-16-05-3-1 slurm-16-05-2-1 slurm-16-05-1-1 slurm-16-05-0-1 slurm-16-05-0-0rc2 slurm-16-05-0-0rc1 slurm-16-05-0-0pre2 slurm-16-05-0-0pre1 slurm-15-08-13-1 slurm-15-08-12-1 slurm-15-08-11-1 slurm-15-08-10-1 slurm-15-08-9-1 slurm-15-08-8-1 slurm-15-08-7-1 slurm-15-08-6-1 slurm-15-08-5-1 slurm-15-08-4-1 slurm-15-08-3-1 slurm-15-08-2-1 slurm-15-08-1-1 slurm-15-08-0-1 slurm-15-08-0-0rc1 slurm-15-08-0-0pre6 slurm-15-08-0-0pre5 slurm-15-08-0-0pre4 slurm-15-08-0-0pre3 slurm-15-08-0-0pre2 slurm-15-08-0-0pre1 slurm-14-11-11-1 slurm-14-11-10-1 slurm-14-11-9-1 slurm-14-11-8-1 slurm-14-11-7-1 slurm-14-11-6-1 slurm-14-11-5-1 slurm-14-11-4-1 slurm-14-11-3-1 slurm-14-11-2-1 slurm-14-11-1-1 slurm-14-11-0-1 slurm-14-11-0-0rc3 slurm-14-11-0-0rc2 slurm-14-11-0-0rc1 slurm-14-11-0-0pre5 slurm-14-11-0-0pre4 slurm-14-11-0-0pre3 slurm-14-11-0-0pre2 slurm-14-11-0-0pre1 slurm-14-03-11-1 slurm-14-03-10-1 slurm-14-03-9-1 slurm-14-03-8-1 slurm-14-03-7-1 slurm-14-03-6-1 slurm-14-03-5-1 slurm-14-03-4-2 slurm-14-03-4-1 slurm-14-03-3-2 slurm-14-03-3-1 slurm-14-03-2-1 slurm-14-03-1-2 slurm-14-03-1-1 slurm-14-03-0-1 slurm-14-03-0-0rc1 slurm-14-03-0-0pre6 slurm-14-03-0-0pre5 slurm-13-12-0-0pre4 slurm-13-12-0-0pre3 slurm-13-12-0-0pre2 slurm-13-12-0-0pre1 slurm-2-6-9-1 slurm-2-6-8-1 slurm-2-6-7-1 slurm-2-6-6-2
Nothing to show
Find file
8528 lines (8320 sloc) 482 KB
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
* Changes in Slurm 18.08.0pre1
==============================
* Changes in Slurm 17.11.0pre3
==============================
-- Added the following jobcomp/script environment variables: CLUSTER,
DEPENDENCY, DERIVED_EC, EXITCODE, GROUPNAME, QOS, RESERVATION, USERNAME.
The format of LIMIT (job time limit) has been modified to D-HH:MM:SS.
-- Fix QOS usage factor applying to individual TRES run minute usage.
-- Print numbers using exponential format if required to fit in allocated
field width. The sacctmgr and sshare commands are impacted.
-- Make it so a backup DBD doesn't attempt to create database tables and
relies on the primary to do so.
-- By default have Slurm dynamically link to libslurm.so instead of static
linking. If static linking is desired configure with
--without-shared-libslurm.
-- Change --workdir in sbatch to be --chdir as in all other commands (salloc,
srun).
-- Add WorkDir to the job record in the database.
-- Make the UsageFactor of a QOS work when a qos has the nodecay flag.
-- Add MaxQueryTimeRange option to slurmdbd.conf to limit accounting query
ranges when fetching job records.
-- Add LaunchParameters=batch_step_set_cpu_freq to allow the setting of the cpu
frequency on the batch step.
-- CRAY - Fix statically linked applications to CRAY's PMI.
-- Fix - Raise an error back to the user when trying to update currently
unsupported core-based reservations.
-- Do not print TmpDisk space as part of 'slurmd -C' line.
-- Fix to test MaxMemPerCPU/Node partition limits when scheduling, previously
only checked on submit.
-- Work for heterogeneous job support (complete solution in v17.11):
* Set SLURM_PROCID environment variable to reflect global task rank (needed
by MPI).
* Set SLURM_NTASKS environment variable to reflect global task count (needed
by MPI).
* In srun, if only some steps are allocated and one step allocation fails,
then delete all allocated steps.
* Get SPANK plungins working with heterogeneous jobs. The
spank_init_post_opt() function is executed once per job component.
* Modify sbcast command and srun's --bcast option to support heterogeneous
jobs.
* Set more environment variables for MPI: SLURM_GTIDS and SLURM_NODEID.
* Prevent a heterogeneous job allocation from including the same nodes in
multiple components (required by MPI jobs spanning components).
* Modify step create logic so that call components of a heterogeneous job
launched by a single srun command have the same step ID value.
-- Modify output of "--mpi=list" to avoid duplicates for version numbers in
mpi/pmix plugin names.
-- Allow nodes to be rebooted while in a maintenance reservation.
-- Show nodes as down even when nodes are in a maintenance reservation.
-- Harden the slurmctld HA stack to mitigate certain split-brain issues.
-- Work for heterogeneous job support (complete solution in v17.11):
* Add burst buffer support.
* Remove srun's --mpi-combine option (always combined).
* Add SchedulerParameters configuration option "enable_hetero_steps" to
enable job steps that span multiple components of a heterogeneous job.
Disabled by default as most MPI implementations and Slurm configurations
are not currently supported. Limitation to be removed in Slurm version
18.08.
* Synchronize application launch across multiple components with debugger.
* Modify slurm_kill_job_step() to cancel all components of a heterogeneous
job step (used by MPI).
* Set SLURM_JOB_NUM_NODES environment variable as needed by MVAPICH.
* Base time limit upon the time that the latest job component is available
(after all nodes in all components booted and ready for use).
-- Add cluster name to smail tool email header.
-- Speedup arbitrary distribution algorithm.
-- Modify "srun --mpi=list" output to match valid option input by removing the
"mpi/" prefix on each line of output.
-- Automatically set the reservation's partition for the job if not the
cluster default.
-- mpi/pmi2 plugin - vestigial pointer could be referenced at shutdown with
invalid memory reference resulting.
-- Fix to _is_gres_cnt_zero() return false for improper input string
-- Cleanup all pthread_create calls and replace with new slurm_thread_create
macro.
-- Removed obsolete MPI plugins. Remaining options are openmpi, pmi2, pmix.
-- Removed obsolete checkpoint/poe plugin.
-- Process spank environment variable options before processing spank command
line options. Spank plugins should be able to handle option callbacks being
called multiple times.
-- Add support for specialized cores with task/affinity plugin (previously
only supported with task/cgroup plugin).
-- Add "TaskPluginParam=SlurmdOffSpec" option that will prevent the Slurm
compute node daemons (slurmd and slurmstepd) from executing on specialized
cores.
-- CRAY - Make native mode default, use --disable-native-cray to use ALPS
instead of native Slurm.
-- Add ability to prevent suspension of some count of nodes in a specified
range using the SuspendExcNodes configuration parameter.
-- Add SLURM_WCKEY to PrologSlurmctld and EpilogSlurmctld environment.
-- Return user response string in response to successful job allocation request
not only on failure. Set in LUA using function 'slurm.user_msg("STRING")'.
-- Add 'scontrol write batch_script <jobid>' command to retrieve the batch
script for a given job.
-- Remove option to display the batch script as part of 'scontrol show job'.
-- On native Cray system the configured RebootProgram is executed on on the
head node by the slurmctld daemon rather than by the slurmd daemons on the
compute nodes. The "capmc_resume" program from "contribs/cray" can be used.
-- Modify "scontrol top" command to accept a comma separated list of job IDs
as an argument rather than a single job ID.
-- Add MemorySwappiness value to cgroup.conf.
-- Add new "billing" TRES which allows jobs to be limited based on the job's
billable TRES calculated by the job's partition's TRESBillingWeights.
-- sbatch - force line-buffered output so 'sbatch -W' returns the jobid
over a piped output immediately.
-- Regular user use of "scontrol top" command is now diabled. Use the
configuration parameter "SchedulerParameters=enable_user_top" to enable
that functionality. The configuration parameter
"SchedulerParameters=disable_user_top" will be silently ignored.
-- Add -TALL to sreport.
-- Removed unused SlurmdPlugstack option and associated framework.
-- Correct logic for line continuation in srun --multi-prog file.
-- Add DBD Agent queue size to sdiag output.
-- Add running job count to sdiag output.
-- Print unix timestamps next to ASCII timestamps in sdiag output.
-- In a job allocation spanning KNL and non-KNL nodes and requiring a reboot,
do not attempt to set default NUMA or MCDRAM modes on non-KNL nodes.
-- Change default to let pending jobs run outside of reservation after
reservation is gone to put jobs in held state. Added NO_HOLD_JOBS_AFTER_END
reservation flag to use old default.
-- When creating a reservation, validate the CoreCnt specification matches
the number of nodes listed.
-- When creating a reservation, correct logic to ignoring job allocations on
request.
* Changes in Slurm 17.11.0pre2
==============================
-- Initial work for heterogeneous job support (complete solution in v17.11):
* Modified salloc, sbatch and srun commands to parse command line, job
script and environment variables to recognize requests for heterogeneous
jobs. Same commands also modified to set environment variables describing
each component of the heterogeneous job.
* Modified job allocate, batch job submit and job "will-run" requests to
pass a list of job specifications and get a list of responses.
* Modify slurmctld daemon to process a heterogeneous job request and create
multiple job records as needed.
* Added new fields to job record: pack_job_id, pack_job_offset and
pack_job_set (set of job IDs). Added to slurmctld state save/restore
logic and job information reported.
* Display new job fields in "scontrol show job" output.
* Modify squeue command to display heterogeneous job records using "#+#"
format. The squeue --job=# output lists all components of a heterogeneous
job.
* Modify scancel logic to cancel all components of a heterogeneous job with
a single request/RPC.
* Configuration parameter DebugFlags value of "HeteroJobs" added.
* Job requeue and suspend/resume modified to operate on all components of
a heterogeneous job with a single request/RPC.
* New web page added to describe heterogeneous jobs.
* Descriptions of new API added to man pages.
* Modified email notifications to only operate on the first job component.
* Purge heterogeneous job records at the same time and not by individual
components.
* Modified logic for heterogeneous jobs submitted to multiple clusters
("--clusters=...") so the job will be routed to the cluster that is
expected to start all components earliest.
* Modified srun to create multiple job steps for heterogeneous job
allocations.
* Modified launch plugin to accept a pointer to job step options structure
rather than work from a single/common data structure.
-- Improve backfill scheduling algorithm with respect to starting jobs as soon
as possible while avoiding advanced reservations.
-- Add URG as an option to 'scancel --signal'.
-- Check if the buffer returned from slurm_persist_msg_pack() isn't NULL.
-- Modify all daemons to re-open log files on receipt of SIGUSR2 signal. This
is much than using SIGHUP to re-read the configuration file and rebuild
various tables.
-- Add PrivateData=events configuration parameter
-- Work for heterogeneous job support (complete solution in v17.11):
* Add pointer to job option structure to job_step_create_allocation()
function used by srun.
* Parallelize task launch for heterogeneous job allocations (initial work).
* Make packjobid, packjoboffset, and packjobidset fields available in squeue
output.
* Modify smap command to display heterogeneous job records using "#+#"
format.
* Add srun --pack-group and --mpi-combine options to control job step
launch behaviour (not fully implemented).
* Add pack job component ID to srun --label output (e.g. "P0 1:" for
job component 0 and task 1).
* jobcomp/elasticsearch: Add pack_job_id and pack_job_offset fields.
* sview: Modified to display pack job information.
* Major re-write of task state container logic to support for list of
containers rather than one container per srun command.
* Add some regression tests.
* Add srun pack job environment variables when performing job allocation.
-- Set Reason=dependency over Reason=JobArrayTaskLimit for pending jobs.
-- Add slurm.conf configuration parameters SlurmctldSyslogDebug and
SlurmdSyslogDebug to control which messages from the slurmctld and slurmd
daemons get written to syslog.
-- Add slurmdbd.conf configuration parameter DebugLevelSyslog to control which
messages from the slurmdbd daemon get written to syslog.
-- Fix handling of GroupUpdateForce option.
-- Work for heterogeneous job support (complete solution in v17.11):
* Add support to sched/backfill for concurrent allocation of all pack job
components including support of --time-min option.
* Defer initiation of a heterogeneous job until a components can be started
at the same time, taking into consideration association and QOS limits
for the job as a whole.
* Perform limit check on heterogeneous job as a whole at submit time to
reject jobs that will never be able to run.
* Add pack_job_id and pack_job_offset to accounting database.
* Modified sacct to accept pack job ID specification using "#+#" notation.
* Modified sstat to accept pack job ID specification using "#+#" notation.
-- Clear a job's "wait reason" value of BeginTime" after that time has passed.
Previously a readon of "BeginTime" could be reported long after the job's
requested begin time had passed.
-- Split group_info in slurm_ctl_conf_t into group_force and group_time.
-- Work for heterogeneous job support (complete solution in v17.11):
* Fix I/O race condition on step termination for srun launching multiple
pack job groups.
* If prolog is running when attempting to signal a step, then return EAGAIN
and retry rather than simply returning SLURM_ERROR and aborting.
* Modify launch/slurm plugin to signal all components of a pack job rather
than just the one (modify to use a list of step context records).
* Add logic to support srun --mpi-combine option.
* Set up debugger data structures.
* Disable cancellation of individual component while the job is pending.
* Modify scontrol job hold/release and update to operate with heterogeneous
job id specification (e.g. "scontrol hold 123+4").
* If srun lacks application specification for some component, the next one
specified will be used for earlier components.
* Changes in Slurm 17.11.0pre1
==============================
-- Interpet all format options in output/error file to log prolog errors. Prior
logic only supported "%j" (job ID) option.
-- Add the configure option --with-shared-libslurm which will link to
libslurm.so instead of libslurm.o thus reducing the footprint of all the
binaries.
-- In switch plugin, added plugin_id symbol to plugins and wrapped
switch_jobinfo_t with dynamic_plugin_data_t in interface calls in
order to pass switch information between clusters with different switch
types.
-- Switch naming of acct_gather_infiniband to acct_gather_interconnect
-- Make it so you can "stack" the interconnect plugins.
-- Add a last_sched_eval timestamp to record when a job was last evaluated
by the main scheduler or backfill.
-- Add scancel "--hurry" option to avoid staging out any burst buffer data.
-- Simplify the sched plugin interface.
-- Add new advanced reservation flags of "weekday" (repeat on each weekday;
Monday through Friday) and "weekend" (repeat on each weekend day; Saturday
and Sunday).
-- Add new advanced reservation flag of "flex", which permits jobs requesting
the reservation to begin prior to the reservation's start time and use
resources inside or outside of the reservation. A typical use case is to
prevent jobs not explicitly requesting the reservation from using those
reserved resources rather than forcing jobs requesting the reservation to
use those resources in the time frame reserved.
-- Add NoDecay flag to QOS.
-- Node "OS" field expanded from "sysname" to "sysname release version" (e.g.
change from "Linux" to
"Linux 4.8.0-28-generic #28-Ubuntu SMP Sat Feb 8 09:15:00 UTC 2017").
-- jobcomp/elasticsearch - Add "job_name" and "wc_key" fields to stored
information.
-- jobcomp/filetxt - Add ArrayJobId, ArrayTaskId, ReservationName, Gres,
Account, QOS, WcKey, Cluster, SubmitTime, EligibleTime, DerivedExitCode and
ExitCode.
-- scontrol modified to report core IDs for reservation containing individual
cores.
-- MYSQL - Get rid of table join during rollup which speeds up the process
dramatically on large job/step tables.
-- Add ability to define features on clusters for directing federated jobs to
different clusters.
-- Add new RPC to process multiple federation RPCs in a single communication.
-- Modify slurm_load_jobs() function to load job information from all clusters
in a federation.
-- Add squeue --local and --sibling options to modify filtering of jobs on
federated clusters.
-- Add SchedulerParameters option of bf_max_job_user_part to specifiy the
maximum number of jobs per user for any single partition. This differs from
bf_max_job_user in that a separate counter is applied to each partition
rather than having a single counter per user applied to all partitions.
-- Modify backfill logic so that bf_max_job_user, bf_max_job_part and
bf_max_job_user_part options can all be used independently of each other.
-- Add sprio -p/--partition option to filter jobs by partition name.
-- Add partition name to job priority factor response message.
-- Add sprio --local and --sibling options for use in federation of clusters.
-- Add sprio "%c" format to print cluster name in federation mode.
-- Modify sinfo logic to provided unified view of all nodes and partitions
in a federation, add --local option to only report local state information
even in a cluster, print cluster name with "%V" format option, and
optionally sort by cluster name.
-- If a task in a parallel job fails and it was launched with the
--kill-on-bad-exit option then terminate the remaining tasks using the
SIGCONT, SIGTERM and SIGKILL signals rather than just sending SIGKILL.
-- Include submit_time when doing the sort for job scheduling.
-- Modify sacct to report all jobs in federation by default. Also add --local
option.
-- Modify sacct to accept "--cluster all" option (in addition to the old
"--cluster -1", which is still accepted).
-- Modify sreport to report all jobs in federation by default. Also add --local
option.
-- sched/backfill: Improve assoc_limit_stop configuration parameter support.
-- KNL features: Always keep active and available features in the same order:
first site-specific features, next MCDRAM modes, last NUMA modes.
-- Changed default ProctrackType to cgroup.
-- Add "cluster_name" field to node_info_t and partition_info_t data structure.
It is filled in only when the cluster is part of a federation and
SHOW_FEDERATION flag used.
-- Functions slurm_load_node() slurm_load_partitions() modified to show all
nodes/partitions in a federation when the SHOW_FEDERATION flag is used.
-- Add federated views to sview.
-- Add --federation option to sacct, scontrol, sinfo, sprio, squeue, sreport to
show a federated view. Will show local view by default.
-- Add FederationParameters=fed_display slurm.conf option to configure status
commands to display a federated view by default if the cluster is a member
of a federation.
-- Log the down nodes whenever slurmctld restarts.
-- Report that "CPUs" plus "Boards" in node configuration invalid only if the
CPUs value is not equal to the total thread count.
-- Extend the output of the seff utility to also include the job's wall-clock
time.
-- Add bf_max_time to SchedulerParameters.
-- Add bf_max_job_assoc to SchedulerParameters.
-- Add new SchedulerParameters option bf_window_linear to control the rate at
which the backfill test window expands. This can be used on a system with
a modest number of running jobs (hundreds of jobs) to help prevent expected
start times of pending jobs to get pushed forward in time. On systems with
large numbers of running jobs, performance of the backfill scheduler will
suffer and fewer jobs will be evaluated.
-- Improve scheduling logic with respect to license use and node reboots.
-- CRAY - Alter algorithm to come up with the SLURM_ID_HASH.
-- Implement federated scheduling and federated status outputs.
-- The '-q' option to srun has changed from being the short form of
'--quit-on-interrupt' to '--qos'.
-- Change sched_min_interval default from 0 to 2 microseconds.
* Changes in Slurm 17.02.8
==========================
-- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd
instead of local.
-- mpi/mvapich - Buffer being only partially cleared. No failures observed.
-- Fix for job --switch option on dragonfly network.
-- In salloc with --uid option, drop supplementary groups before changing UID.
-- jobcomp/elasticsearch - strip any trailing slashes from JobCompLoc.
-- jobcomp/elasticsearch - fix memory leak when transferring generated buffer.
-- Prevent slurmstepd ABRT when parsing gres.conf CPUs.
-- Fix sbatch --signal to signal all MPI ranks in a step instead of just those
on node 0.
-- Check multiple partition limits when scheduling a job that were previously
only checked on submit.
-- Cray: Avoid running application/step Node Health Check on the external
job step.
-- Optimization enhancements for partition based job preemption.
-- Address some build warnings from GCC 7.1, and one possible memory leak if
/proc is inaccessible.
-- If creating/altering a core based reservation with scontrol/sview on a
remote cluster correctly determine the select type.
-- Fix autoconf test for libcurl when clang is used.
-- Fix default location for cgroup_allowed_devices_file.conf to use correct
default path.
-- Document NewName option to sacctmgr.
-- Reject a second PMI2_Init call within a single step to prevent slurmstepd
from hanging.
-- Handle old 32bit values stored in the database for requested memory
correctly in sacct.
-- Fix memory leaks in the task/cgroup plugin when constraining devices.
-- Make extremely verbose info messages debug2 messages in the task/cgroup
plugin when constraining devices.
-- Fix issue that would deny the stepd access to /dev/null where GRES has a
'type' but no file defined.
-- Fix issue where the slurmstepd would fatal on job launch if you have no
gres listed in your slurm.conf but some in gres.conf.
-- Fix validating time spec to correctly validate various time formats.
-- Make scontrol work correctly with job update timelimit [+|-]=.
-- Reduce the visibily of a number of warnings in _part_access_check.
-- Prevent segfault in sacctmgr if no association name is specified for
an update command.
-- burst_buffer/cray plugin modified to work with changes in Cray UP06
software release.
-- Fix job reasons for jobs that are violating assoc MaxTRESPerNode limits.
-- Fix segfault when unpacking a 16.05 slurm_cred in a 17.02 daemon.
-- Fix setting TRES limits with case insensitive TRES names.
-- Add alias for xstrncmp() -- slurm_xstrncmp().
-- Fix sorting of case insensitive strings when using xstrcasecmp().
-- Gracefully handle race condition when reading /proc as process exits.
-- Avoid error on Cray duplicate setup of core specialization.
-- Skip over undefined (hidden in Slurm) nodes in pbsnodes.
-- Add empty hashes in perl api's slurm_load_node() for hidden nodes.
-- CRAY - Add rpath logic to work for the alpscomm libs.
-- Fixes for administrator extended TimeLimit (job reason & time limit reset).
-- Fix gres selection on systems running select/linear.
-- sview: Added window decorator for maximize,minimize,close buttons for all
systems.
-- squeue: interpret negative length format specifiers as a request to
delimit values with spaces.
* Changes in Slurm 17.02.7
==========================
-- Fix deadlock if requesting to create more than 10000 reservations.
-- Fix potential memory leak when creating partition name.
-- Execute the HealthCheckProgram once when the slurmd daemon starts rather
than executing repeatedly until an exit code of 0 is returned.
-- Set job/step start and end times to 0 when using --truncate and start > end.
-- Make srun --pty option ignore EINTR allowing windows to resize.
-- When resuming node only send one message to the slurmdbd.
-- Modify srun --pty option to use configured SrunPortRange range.
-- Fix issue with whole gres not being printed out with Slurm tools.
-- Fix issue with multiple jobs from an array are prevented from starting.
-- Fix for possible slurmctld abort with use of salloc/sbatch/srun
--gres-flags=enforce-binding option.
-- Fix race condition when using jobacct_gather/cgroup where the memory of the
step wasn't always gathered correctly.
-- Better debug when slurmdbd queue is filling up in the slurmctld.
-- Fixed truncation on scontrol show config output.
-- Serialize updates from from the dbd to the slurmctld.
-- Fix memory leak in slurmctld when agent queue to the DBD has filled up.
-- CRAY - Throttle step creation if trying to create too many steps at once.
-- If failing after switch_g_job_init happened make sure switch_g_job_fini is
called.
-- Fix minor memory leak if launch fails in the slurmstepd.
-- Fix issue where UnkillableStepProgram if step was in an ending state.
-- Fix bug when tracking multiple simultaneous spawned ping cycles.
-- jobcomp/elasticsearch plugin now saves state of pending requests on
slurmctld daemon shutdown so then can be recovered on restart.
-- Fix issue when an alternate munge key when communicating on a persistent
connection.
-- Document inconsistent behavior of GroupUpdateForce option.
-- Fix bug in selection of GRES bound to specific CPUs where the GRES count
is 2 or more. Previous logic could allocate CPUs not available to the job.
-- Increase buffer to handle long /proc/<pid>/stat output so that Slurm can
read correct RSS value and take action on jobs using more memory than
requested.
-- Fix srun job jobs that can run immediately to run in the highest priority
partion when multiple partitions are listed. scontrol show jobs can
potentially show the partition list in priority order.
-- Fix starting controller if StateSaveLocation path didn't exist.
-- Fix inherited association 'max' TRES limits combining multiple limits in
the tree.
-- Sort TRES id's on limits when getting them from the database.
-- Fix issue with pmi[2|x] when TreeWidth=1.
-- Correct buffer size used in determining specialized cores to avoid possible
truncation of core specification and not reserving the specified cores.
-- Close race condition on Slurm structures when setting DebugFlags.
-- Make it so the cray/switch plugin grabs new DebugFlags on a reconfigure.
-- Fix incorrect lock levels when creating or updating a reservation.
-- Fix overlapping reservation resize.
-- Add logic to help support Dell KNL systems where syscfg is different than
the normal Intel syscfg.
-- CRAY - Fix BB to handle type= correctly, regression in 17.02.6.
* Changes in Slurm 17.02.6
==========================
-- Fix configurator.easy.html to output the SelectTypeParameters line.
-- If a job requests a specific memory requirement then gets something else
from the slurmctld make sure the step allocation is made aware of it.
-- Fix missing initialization in slurmd.
-- Fix potential degradation when running HTC (> 100 jobs a sec) like
workflows through the slurmd.
-- Fix race condition which could leave a stepd hung on shutdown.
-- CRAY - Add configuration for ATP to the ansible play script.
-- Fix potential to corrupt DBD message.
-- burst_buffer logic modified to support sizes in both SI and EIC size units
(e.g. M/MiB for powers of 1024, MB for powers of 1000).
* Changes in Slurm 17.02.5
==========================
-- Prevent segfault if a job was blocked from running by a QOS that is then
deleted.
-- Improve selection of jobs to preempt when there are multiple partitions
with jobs subject to preemption.
-- Only set kmem limit when ConstrainKmemSpace=yes is set in cgroup.conf.
-- Fix bug in task/affinity that could result in slurmd fatal error.
-- Increase number of jobs that are tracked in the slurmd as finishing at one
time.
-- Note when a job finishes in the slurmd to avoid a race when launching a
batch job takes longer than it takes to finish.
-- Improve slurmd startup on large systems (> 10000 nodes)
-- Add LaunchParameters option of cray_net_exclusive to control whether all
jobs on the cluster have exclusive access to their assigned nodes.
-- Make sure srun inside an allocation gets --ntasks-per-[core|socket]
set correctly.
-- Only make the extern step at job creation.
-- Fix for job step task layout with --cpus-per-task option.
-- Fix --ntasks-per-core option/environment variable parsing to set
the requested value, instead of always setting one (srun).
-- Correct error message when ClusterName in configuration files does not match
the name in the slurmctld daemon's state save file.
-- Better checking when a job is finishing to avoid underflow on job's
submitted to a QOS/association.
-- Handle partition QOS submit limits correctly when a job is submitted to
more than 1 partition or when the partition is changed with scontrol.
-- Performance boost for when Slurm is dealing with credentials.
-- Fix race condition which could leave a stepd hung on shutdown.
-- Add lua support for opensuse.
* Changes in Slurm 17.02.4
==========================
-- Do not attempt to schedule jobs after changing the power cap if there are
already many active threads.
-- Job expansion example in FAQ enhanced to demonstrate operation in
heterogeneous environments.
-- Prevent scontrol crash when operating on array and no-array jobs at once.
-- knl_cray plugin: Log incomplete capmc output for a node.
-- knl_cray plugin: Change capmc parsing of mcdram_pct from string to number.
-- Remove log files from test20.12.
-- When rebooting a node and using the PrologFlags=alloc make sure the
prolog is ran after the reboot.
-- node_features/knl_generic - If a node is rebooted for a pending job, but
fails to enter the desired NUMA and/or MCDRAM mode then drain the node and
requeue the job.
-- node_features/knl_generic disable mode change unless RebootProgram
configured.
-- Add new burst_buffer function bb_g_job_revoke_alloc() to be executed
if there was a failure after the initial resource allocation. Does not
release previously allocated resources.
-- Test if the node_bitmap on a job is NULL when testing if the job's nodes
are ready. This will be NULL is a job was revoked while beginning.
-- Fix incorrect lock levels when testing when job will run or updating a job.
-- Add missing locks to job_submit/pbs plugin when updating a jobs
dependencies.
-- Add support for lua5.3
-- Add min_memory_per_node|cpu to the job_submit/lua plugin to deal with lua
not being able to deal with pn_min_memory being a uint64_t. Scripts are
urged to change to these new variables avoid issue. If not set the
variables will be 'nil'.
-- Calculate priority correctly when 'nice' is given.
-- Fix minor typos in the documentation.
-- node_features/knl_cray: Preserve non-KNL active features if slurmctld
reconfigured while node boot in progress.
-- node_features/knl_generic: Do not repeatedly log errors when trying to read
KNL modes if not KNL system.
-- Add missing QOS read lock to backfill scheduler.
-- When doing a dlopen on liblua only attempt the version compiled against.
-- Fix null-dereference in sreport cluster ulitization when configured with
memory-leak-debug.
-- Fix Partition info in 'scontrol show node'. Previously duplicate partition
names, or Partitions the node did not belong to could be displayed.
-- Fix it so the backup slurmdbd will take control correctly.
-- Fix unsafe use of MAX() macro, which could result in problems cleaning up
accounting plugins in slurmd, or repeat job cancellation attempts in
scancel.
-- Fix 'scontrol update reservation duration=unlimited' to set the duration
to 365-days (as is done elsewhere), rather than 49710 days.
-- Check if variable given to scontrol show job is a valid jobid.
-- Fix WithSubAccounts option to not include WithDeleted unless requested.
-- Prevent a job tested on multiple partitions from being marked
WHOLE_NODE_USER.
-- Prevent a race between completing jobs on a user-exclusive node from
leaving the node owned.
-- When scheduling take the nodes in completing jobs out of the mix to reduce
fragmentation. SchedulerParameters=reduce_completing_frag
-- For jobs submited to multiple partitions, report the job's earliest start
time for any partition.
-- Backfill partitions that use QOS Grp limits to "float" better.
-- node_features/knl_cray: don't clear configured GRES from non-KNL node.
-- sacctmgr - prevent segfault in command when a request is denied due
to a insufficient priviledges.
-- Add warning about libcurl-devel not being installed during configure.
-- Streamline job purge by handling file deletion on a separate thread.
-- Always set RLIMIT_CORE to the maximum permitted for slurmd, to ensure
core files are created even on non-developer builds.
-- Fix --ntasks-per-core option/environment variable parsing to set
the requested value, instead of always setting one.
-- If trying to cancel a step that hasn't started yet for some reason return
a good return code.
-- Fix issue with sacctmgr show where user=''
* Changes in Slurm 17.02.3
==========================
-- Increase --cpu_bind and --mem_bind field length limits.
-- Fix segfault when using AdminComment field with job arrays.
-- Clear Dependency field when all dependencies are satisfied.
-- Add --array-unique to squeue which will display one unique pending job
array element per line.
-- Reset backfill timers correctly without skipping over them in certain
circumstances.
-- When running the "scontrol top" command, make sure that all of the user's
jobs have a priority that is lower than the selected job. Previous logic
would permit other jobs with equal priority (no jobs with higher priority).
-- Fix perl api so we always get an allocation when calling Slurm::new().
-- Fix issue with cleaning up cpuset and devices cgroups when multiple steps
end at the same time.
-- Document that PriorityFlags option of DEPTH_OBLIVIOUS precludes the use of
FAIR_TREE.
-- Fix issue if an invalid message came in a Slurm daemon/command may abort.
-- Make it impossible to use CR_CPU* along with CR_ONE_TASK_PER_CORE. The
options are mutually exclusive.
-- ALPS - Fix scheduling when ALPS doesn't agree with Slurm on what nodes
are free.
-- When removing a partition make sure it isn't part of a reservation.
-- Fix seg fault if loading attempting to load non-existent burstbuffer plugin.
-- Fix to backfill scheduling with respect to QOS and association limits. Jobs
submitted to multiple partitions are most likley to be effected.
-- sched/backfill: Improve assoc_limit_stop configuration parameter support.
-- CRAY - Add ansible play and README.
-- sched/backfill: Fix bug related to advanced reservations and the need to
reboot nodes to change KNL mode.
-- Preempt plugins - fix check for 'preempt_youngest_first' option.
-- Preempt plugins - fix incorrect casts in preempt_youngest_first mode.
-- Preempt/job_prio - fix incorrect casts in sort function.
-- Fix to make task/affinity work with ldoms where there are more than 64
cpus on the node.
-- When using node_features/knl_generic make it so the slurmd doesn't segfault
when shutting down.
-- Fix potential double-xfree() when using job arrays that can lead to
slurmctld crashing.
-- Fix priority/multifactor priorities on a slurmctld restart if not using
accounting_storage/[mysql|slurmdbd].
-- Fix NULL dereference reported by CLANG.
-- Update proctrack documentation to strongly encourage use of
proctrack/cgroup.
-- Fix potential memory leak if job fails to begin after nodes have been
selected for a job.
-- Handle a job that made it out of the select plugin without a job_resrcs
pointer.
-- Fix potential race condition when persistent connections are being closed at
shutdown.
-- Fix incorrect locks levels when submitting a batch job or updating a job
in general.
-- CRAY - Move delay waiting for job cleanup to after we check once.
-- MYSQL - Fix memory leak when loading archived jobs into the database.
-- Fix potential race condition when starting the priority/multifactor plugin's
decay thread.
-- Sanity check to make sure we have started a job in acct_policy.c before we
clear it as started.
-- Allow reboot program to use arguments.
-- Message Aggr - Remove race condition on slurmd shutdown with respects to
destroying a mutex.
-- Fix updating job priority on multiple partitions to be correct.
-- Don't remove admin comment when updating a job.
-- Return error when bad separator is given for scontrol update job licenses.
* Changes in Slurm 17.02.2
==========================
-- Update hyperlink to LBNL Node Health Check program.
-- burst_buffer/cray - Add support for line continuation.
-- If a job is cancelled by the user while it's allocated nodes are being
reconfigured (i.e. the capmc_resume program is rebooting nodes for the job)
and the node reconfiguration fails (i.e. the reboot fails), then don't
requeue the job but leave it in a cancelled state.
-- capmc_resume (Cray resume node script) - Do not disable changing a node's
active features if SyscfgPath is configured in the knl.conf file.
-- Improve the srun documentation for the --resv-ports option.
-- burst_buffer/cray - Fix parsing for discontinuous allocated nodes. A job
allocation of "20,22" must be expressed as "20\n22".
-- Fix rare segfault when shutting down slurmctld and still sending data to
the database.
-- Fix gres output of a job if it is updated while pending to be displayed
correctly with Slurm tools.
-- Fix pam_slurm_adopt.
-- Fix missing unlock when job_list doesn't exist when starting priority/
multifactor.
-- Fix segfault if slurmctld is shutting down and the slurmdbd plugin was
in the middle of setting db_indexes.
-- Add ESLURM_JOB_SETTING_DB_INX to errno to note when a job can't be updated
because the dbd is setting a db_index.
-- Fix possible double insertion into database when a job is updated at the
moment the dbd is assigning a db_index.
-- Fix memory error when updating a job's licenses.
-- Fix seff to work correctly with non-standard perl installs.
-- Export missing slurmdbd_defs_[init|fini] needed for libslurmdb.so to work.
-- Fix sacct from returning way more than requested when querying against a job
array task id.
-- Fix double read lock of tres when updating gres or licenses on a job.
-- Make sure locks are always in place when calling
assoc_mgr_make_tres_str_from_array.
-- Prevent slurmctld SEGV when creating reservation with duplicated name.
-- Consider QOS flags Partition[Min|Max]Nodes when doing backfill.
-- Fix slurmdbd_defs.c to not have half symbols go to libslurm.so and the
other half go to libslurmdb.so.
-- Fix 'scontrol show jobs' to remove an errant newline when 'Switches' is
printed.
-- Better code for handling memory required by a task on a heterogeneous
system.
-- Fix regression in 17.02.0 with respects to GrpTresMins on a QOS or
Association.
-- Cleanup to make make dist work.
-- Schedule interactive jobs quicker.
-- Perl API - correct value of MEM_PER_CPU constant to correctly handle
memory values.
-- Fix 'flags' variable to be 32 bit from the old 16 bit value in the perl api.
-- Export sched_nodes for a job in the perl api.
-- Improve error output when updating a reservation that has already started.
-- Fix --ntasks-per-node issue with srun so DenyOnLimit would work correctly.
-- node_features/knl_cray plugin - Fix memory leak.
-- Fix wrong cpu_per_task count issue on heterogeneous system when dealing with
steps.
-- Fix double free issue when removing usage from an association with sacctmgr.
-- Fix issue with SPANK plugins attempting to set null values as environment
variables, which leads to the command segfaulting on newer glibc versions.
-- Fix race condition on slurmctld startup when plugins have not gone through
init() ahead of the rpc_manager processing incoming messages.
-- job_submit/lua - expose admin_comment field.
-- Allow AdminComment field to be set by the job_submit plugin.
-- Allow AdminComment field to be changed by any Administrator.
-- Fix key words in jobcomp select.
-- MYSQL - Streamline job flush sql when doing a clean start on the slurmctld.
-- Fix potential infinite loop when talking to the DBD when shutting down
the slurmctld.
-- Fix MCS filter.
-- Make it so pmix can be included in the plugin rpm without having to
specify --with-pmix.
-- MYSQL - Fix initial load when not using he DBD.
-- Fix scontrol top to not make jobs priority 0 (held).
-- Downgrade info message about exceeding partition time limit to a debug2.
* Changes in Slurm 17.02.1-2
============================
-- Replace clock_gettime with time(NULL) for very old systems without the call.
* Changes in Slurm 17.02.1
==========================
-- Modify pam module to work when configured NodeName and NodeHostname differ.
-- Update to sbatch/srun man pages to explain the "filename pattern" clearer
-- Add %x to sbatch/srun filename pattern to represent the job name.
-- job_submit/lua - Add job "bitflags" field.
-- Update slurm.spec file to note obsolete RPMs.
-- Fix deadlock scenario when dumping configuration in the slurmctld.
-- Remove unneeded job lock when running assoc_mgr cache. This lock could
cause potential deadlock when/if TRES changed in the database and the
slurmctld wasn't made aware of the change. This would be very rare.
-- Fix missing locks in gres logic to avoid potential memory race.
-- If gres is NULL on a job don't try to process it when returning detailed
information about a job to scontrol.
-- Fix print of consumed energy in sstat when no energy is being collected.
-- Print formatted tres string when creating/updating a reservation.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Prevent manipulation of the cpu frequency and governor for batch or
extern steps. This addresses an issue where the batch step would
inadvertently set the cpu frequency maximum to the minimum value
supported on the node.
-- Convert a slurmctd power management data structure from array to list in
order to eliminate the possibility of zombie child suspend/resume
processes.
-- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
fails. Now job will be held. Update job update time when held.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Refactor slurmctld agent logic to eliminate some pthreads.
-- Added "SyscfgTimeout" parameter to knl.conf configuration file.
-- Fix for CPU binding for job steps run under a batch job.
* Changes in Slurm 17.02.0
==========================
-- job_submit/lua - Make "immediate" parameter available.
-- Fix srun I/O race condtion to eliminate a error message that might be
generated if the application exits with outstanding stdin.
-- Fix regression when purging/archiving jobs/events.
-- Add new job state JOB_OOM indicating Out Of Memory condition as detected
by task/cgroup plugin.
-- If QOS has been added to the system go refigure out Deny/AllowQOS on
partitions.
-- Deny job with duplicate GRES requested.
-- Fix loading super old assoc_mgr usage without segfaulting.
-- CRAY systems: Restore TaskPlugins order of task/cray before task/cgroup.
-- Task/cray: Treat missing "mems" cgroup with "debug" messages rather than
"error" messages. The file may be missing at step termination due to a
change in how cgroups are released at job/step end.
-- Fix for job constraint specification with counts, --ntasks-per-node value,
and no node count.
-- Fix ordering of step task allocation to fill in a socket before going into
another one.
-- Fix configure to not require C++
-- job_submit/lua - Remove access to slurmctld internal reservation fields of
job_pend_cnt and job_run_cnt.
-- Prevent job_time_limit enforcement from blocking other internal operations
if a large number of jobs need to be cancelled.
-- Add 'preempt_youngest_order' option to preempt/partition_prio plugin.
-- Fix controller being able to talk to a pre-released DBD.
-- Added ability to override the invoking uid for "scontrol update job"
by specifying "--uid=<uid>|-u <uid>".
-- Changed file broadcast "offset" from 32 to 64 bits in order to support files
over 2 GB.
-- slurm.spec - do not install init scripts alongside systemd service files.
* Changes in Slurm 17.02.0rc1
==============================
-- Add port info to 'sinfo' and 'scontrol show node'.
-- Fix errant definition of USE_64BIT_BITSTR which can lead to core dumps.
-- Move BatchScript to end of each job's information when using
"scontrol -dd show job" to make it more readable.
-- Add SchedulerParameters configuration parameter of "default_gbytes", which
treats numeric only (no suffix) value for memory and tmp disk space as being
in units of Gigabytes. Mostly for compatability with LSF.
-- Fix race condtion in srun/sattach logic which would prevent srun from
terminating.
-- Bitstring operations are now 64bit instead of 32bit.
-- Replace hweight() function in bitstring with faster version.
-- scancel would treat a non-numeric argument as the name of jobs to be
cancelled (a non-documented feature). Cancelling jobs by name now require
the "--jobname=" command line argument.
-- scancel modified to note that no jobs satisfy the filter options when the
--verbose option is used along with one or more job filters (e.g. "--qos=").
-- Change _pack_cred to use pack_bit_str_hex instead of pack_bit_fmt for
better scalability and performance.
-- Add BootTime configuration parameter to knl.conf file to optimize resource
allocations with respect to required node reboots.
-- Add node_features_p_boot_time() to node_features plugin to optimize
scheduling with respect to node reboots.
-- Avoid allocating resources to a job in the event that its run time plus boot
time (if needed) extent into an advanced reservation.
-- Burst_buffer/cray - Avoid stage-out operation if job never started.
-- node_features/knl_cray - Add capability to detected Uncorrectable Memory
Errors (UME) and if detected then log the event in all job and step stderr
with a message of the form:
error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 ***
Similar logic added to node_features/knl_generic in version 17.02.0pre4.
-- If job is allocated nodes which are powered down, then reset job start time
when the nodes are ready and do not charge the job for power up time.
-- Add the ability to purge transactions from the database.
-- Add support for requeue'ing of federated jobs (BETA).
-- Add support for interactive federated jobs (BETA).
-- Add the ability to purge rolled up usage from the database.
-- Properly set SLURM_JOB_GPUS environment variable for Prolog.
* Changes in Slurm 17.02.0pre4
==============================
-- Add support for per-partitiion OverTimeLimit configuration.
-- Add --mem_bind option of "sort" to run zonesort on KNL nodes at step start.
-- Add LaunchParameters=mem_sort option to configure running of zonesort
by default at step startup.
-- Add "FreeSpace" information for each pool to the "scontrol show burstbuffer"
output. Required changes to the burst_buffer_info_t data structure.
-- Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by
"scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag,
which could lead to inconsistencies. Add "ASAP" option to "scontrol reboot"
command that will drain a node in order to reboot it as soon as possible,
then return it to service.
-- Allow unit conversion routine to convert 1024M to 1G.
-- switch/cray plugin - change legacy spool directory location.
-- Add new PriorityFlags option of INCR_ONLY, which prevents a job's priority
from being decremented.
-- Make it so we don't purge job start messages until after we purge step
messages. Hopefully this will reduce the number of messages lost when
filling up memory when the database/DBD is down.
-- Added SchedulingParameters option of "bf_job_part_count_reserve". Jobs below
the specified threshold will not have resources reserved for them.
-- If GRES are configured with file IDs, then "scontrol -d show node" will
not only identify the count of currently allocated GRES, but their specific
index numbers (e.g. "GresUsed=gpu:alpha:2(IDX:0,2),gpu:beta:0(IDX:N/A)").
Ditto for job information with "scontrol -d show job".
-- Add new mcs/account plugin.
-- Add "GresEnforceBind=Yes" to "scontrol show job" output if so configured.
-- Add support for SALLOC_CONSTRAINT, SBATCH_CONSTRAINT and SLURM_CONSTRAINT
environment variables to set default constraints for salloc, sbatch and
srun commands respectively.
-- Provide limited support for the MemSpecLimit configuration parameter without
the task/cgroup plugin.
-- node_features/knl_generic - Add capability to detected Uncorrectable Memory
Errors (UME) and if detected then log the event in all job and step stderr
with a message of the form:
error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 ***
-- Add SLURM_JOB_GID to TaskProlog environment.
-- burst_buffer/cray - Remove leading zeros from node ID lists passed to
dw_wlm_cli program.
-- Add "Partitions" field to "scontrol show node" output.
-- Remove sched/wiki and sched/wiki2 plugins and associated code.
-- Remove SchedulerRootFilter option and slurm_get_root_filter() API call.
-- Add SchedulerParameters option of spec_cores_first to select specialized
cores from the lowest rather than highest number cores and sockets.
-- Add PrologFlags option of Serial to disable concurrent launch of
Prolog and Epilog scripts.
-- Fix security issue caused by insecure file path handling triggered by the
failure of a Prolog script. To exploit this a user needs to anticipate or
cause the Prolog to fail for their job. CVE-2016-10030.
* Changes in Slurm 17.02.0pre3
==============================
-- Add srun host & PID to job step data structures.
-- Avoid creating duplicate pending step records for the same srun command.
-- Rewrite srun's logic for pending steps for better efficiency (fewer RPCs).
-- Added new SchedulerParameters options step_retry_count and step_retry_time
to control scheduling behaviour of job steps waiting for resources.
-- Optimize resource allocation logic for --spread-job job option.
-- Modify cpu_bind and mem_bind map and mask options to accept a repetition
count to better support large task count. For example:
"mask_mem:0x0f*2,0xf0*2" is equivalent to "mask_mem:0x0f,0x0f,0xf0,0xf0".
-- Add support for --mem_bind=prefer option to prefer, but not restrict memory
use to the identified NUMA node.
-- Add mechanism to constrain kernel memory allocation using cgroups. New
cgroup.conf parameters added: ConstrainKmemSpace, MaxKmemPercent, and
MinKmemSpace.
-- Correct invokation of man2html, which previously could cause FreeBSD builds
to hang.
-- MYSQL - Unconditionally remove 'ignore' clause from 'alter ignore'.
-- Modify service files to not start Slurm daemons until after Munge has been
started.
NOTE: If you are not using Munge, but are using the "service" scripts to
start Slurm daemons, then you will need to remove this check from the
etc/slurm*service scripts.
-- Do not process SALLOC_HINT, SBATCH_HINT or SLURM_HINT environment variables
if any of the following salloc, sbatch or srun command line options are
specified: -B, --cpu_bind, --hint, --ntasks-per-core, or --threads-per-core.
-- burst_buffer/cray: Accept new jobs on backup slurmctld daemon without access
to dw_wlm_cli command. No burst buffer actions will take place.
-- Do not include SLURM_JOB_DERIVED_EC, SLURM_JOB_EXIT_CODE, or
SLURM_JOB_EXIT_CODE in PrologSlurmctld environment (not available yet).
-- Cray - set task plugin to fatal() if task/cgroup is not loaded after
task/cray in the TaskPlugin settings.
-- Remove separate slurm_blcr package. If Slurm is built with BLCR support,
the files will now be part of the main Slurm packages.
-- Replace sjstat, seff and sjobexit RPM packages with a single "contribs"
package.
-- Remove long since defunct slurmdb-direct scripts.
-- Add SbcastParameters configuration option to control default file
destination directory and compression algorithm.
-- Add new SchedulerParameter (max_array_tasks) to limit the maximum number of
tasks in a job array independently from the maximum task ID (MaxArraySize).
-- Fix issue where number of nodes is not properly allocated when sbatch and
salloc are requested with -n tasks < hosts from -w hostlist or from -N.
-- Add infrastructure for submitting federated jobs.
* Changes in Slurm 17.02.0pre2
==============================
-- Add new RPC (REQUEST_EVENT_LOG) so that slurmd and slurmstepd can log events
through the slurmctld daemon.
-- Remove sbatch --bb option. That option was never supported.
-- Automatically clean up task/cgroup cpuset and devices cgroups after steps
are completed.
-- Add federation read/write locks.
-- Limit job purge run time to 1 second at a time.
-- The database index for jobs is now 64 bits. If you happen to be close to
4 billion jobs in your database you will want to update your slurmctld at
the same time as your slurmdbd to prevent roll over of this variable as
it is 32 bit previous versions of Slurm.
-- Optionally lock slurmstepd in memory for performance reasons and to avoid
possible SIGBUS if the daemon is paged out at the time of a Slurm upgrade
(changing plugins). Controlled via new LaunchParameters options of
slurmstepd_memlock and slurmstepd_memlock_all.
-- Add event trigger on burst buffer errors (see strigger man page,
--burst_buffer option).
-- Add job AdminComment field which can only be set by a Slurm administrator.
-- Add salloc, sbatch and srun option of --delay-boot=<time>, which will
temporarily delay booting nodes into the desired state for a job in the
hope of using nodes already in the proper state which will be available at
a later time.
-- Add job burst_buffer_state and delay_boot fields to scontrol and squeue
output. Also add ability to modify delay_boot from scontrol.
-- Fix for node's available TRES array getting filled in with configured GRES
model types.
-- Log if job --bb option contains any unrecognized content.
-- Display configured and allocated TRES for nodes in scontrol show nodes.
-- Change all memory values (in MB) to uint64_t to accommodate > 2TB per node.
-- Add MailDomain configuration parameter to qualify email addresses.
-- Refactor the persistent connections within the federation code to use
the same logic that was found in the slurmdbd. Now both functionalities
share the same code.
-- Remove BlueGene/L and BlueGene/P support.
-- Add "flag" field to launch_tasks_request_msg. Remove the following fields
(moved into flags): multi_prog, task_flags, user_managed_io, pty,
buffered_stdio, and labelio.
-- Add protocol version to slurmd startup communications for slurmstepd to
permit changes in the protocol.
* Changes in Slurm 17.02.0pre1
==============================
-- burst_buffer/cray - Add support for rounding up the size of a buffer reqeust
if the DataWarp configuration "equalize_fragments" is used.
-- Remove AIX support.
-- Rename "in" to "input" in slurm_step_io_fds data structure defined in
slurm.h. This is needed to avoid breaking Python with by using one of its
keywords in a Slurm data structure.
-- Remove eligible_time from jobcomp/elasticsearch.
-- Enable the deletion of a QOS, even if no clusters have been added to the
database.
-- SlurmDBD - change all timestamps to bigint from int to solve Y2038 problem.
-- Add salloc/sbatch/srun --spread-job option to distribute tasks over as many
nodes as possible. This also treats the --ntasks-per-node option as a
maximum value.
-- Add ConstrainKmemSpace to cgroup.conf, defaulting to yes, to allow
cgroup Kmem enforcement to be disabled while still using ConstrainRAMSpace.
-- Add support for sbatch --bbf option to specify a burst buffer input file.
-- Added burst buffer support for job arrays. Add new SchedulerParameters
configuration parameter of bb_array_stage_cnt=# to indicate how many pending
tasks of a job array should be made available for burst buffer resource
allocation.
-- Fix small memory leak when a job fails to load from state save.
-- Fix invalid read when attempting to delete clusters from database with
running jobs.
-- Fix small memory leak when deleting clusters from database.
-- Add SLURM_ARRAY_TASK_COUNT environment variable. Total number of tasks in a
job array (e.g. "--array=2,4,8" will set SLURM_ARRAY_TASK_COUNT=3).
-- Add new sacctmgr commands: "shutdown" (shutdown the server), "list stats"
(get server statistics) "clear stats" (clear server statistics).
-- Restructure job accounting query to use 'id_job in (1, 2, .. )' format
instead of logically equivalent 'id_job = 1 || id_job = 2 || ..' .
-- Added start_delay field to jobcomp/elasticsearch.
-- In order to support federated jobs, the MaxJobID configuration parameter
default value has been reduced from 2,147,418,112 to 67,043,328 and its
maximum value is now 67,108,863. Upon upgrading, any pre-existing jobs that
have a job ID above the new range will continue to run and new jobs will get
job IDs in the new range.
-- Added infrastructure for setting up federations in database and establishing
connections between federation clusters.
* Changes in Slurm 16.05.11
===========================
-- burst_buffer/cray - Add support for line continuation.
-- If a job is cancelled by the user while it's allocated nodes are being
reconfigured (i.e. the capmc_resume program is rebooting nodes for the job)
and the node reconfiguration fails (i.e. the reboot fails), then don't
requeue the job but leave it in a cancelled state.
-- capmc_resume (Cray resume node script) - Do not disable changing a node's
active features if SyscfgPath is configured in the knl.conf file.
-- Fix memory error when updating a job's licenses.
-- Fix double read lock of tres when updating gres or licenses on a job.
-- Fix regression in 16.05.10 with respects to GrpTresMins on a QOS or
Association.
-- ALPS - Fix scheduling when ALPS doesn't agree with Slurm on what nodes
are free.
-- Fix seg fault if loading attempting to load non-existent burstbuffer plugin.
-- Fix to backfill scheduling with respect to QOS and association limits. Jobs
submitted to multiple partitions are most likley to be effected.
* Changes in Slurm 16.05.10-2
=============================
-- Replace clock_gettime with time(NULL) for very old systems without the call.
* Changes in Slurm 16.05.10
===========================
-- Record job state as PREEMPTED instead of TIMEOUT when GraceTime is reached.
-- task/cgroup - print warnings to stderr when --cpu_bind=verbose is enabled
and the requested processor affinity cannot be set.
-- power/cray - Disable power cap get and set operations on DOWN nodes.
-- Jobs preempted with PreemptMode=REQUEUE were incorrectly recorded as
REQUEUED in the accounting.
-- PMIX - Use volatile specifier to avoid flag caching and lock the flag to
make sure it is protected.
-- PMIX/PMI2 - Make it possible to use %n or %h in a spool dir.
-- burst_buffer/cray - Support default pool which is not the first pool
reported by DataWarp and log in Slurm when pools that are added or removed
from DataWarp.
-- Insure job does not start running before PrologSlurmctld is complete and
node is booted (all nodes for interactive job, at least first node for batch
job without burst buffers).
-- Fix minor memory leak in the slurmctld when removing a QOS.
-- burst_buffer/cray - Do not execute "pre_run" operation until after all nodes
are booted and ready for use.
-- scontrol - return an error when attempting to use the +=/-+ syntax to
update a field where this is not appropriate.
-- Fix task/affinity to work correctly with --ntasks-per-socket.
-- Honor --ntasks-per-node and --ntasks option when used with job constraints
that contain node counts.
-- Prevent deadlocked slurmstepd processes due to unsafe use of regcomp with
older glibc versions.
-- Fix squeue when SLURM_BITSTR_LEN=0 is set in the user environment.
-- Fix comments in acct_policy.c to reflect actual variables instead of
old ones.
-- Fix correct variables when validating GrpTresMins on a QOS.
-- Better debug output when a job is being held because of a GrpTRES[Run]Min
limits.
-- Fix correct state reason when job can't run 'safely' because of an
association GrpWall limit.
-- Squeue always loads new data if user_id option specified
-- Fix for possible job ID parsing failure and abort.
-- If node boot in progress when slurmctld daemon is restarted, then allow
sufficient time for reboot to complete and not prematurely DOWN the node as
"Not responding".
-- For job resize, correct logic to build "resize" script with new values.
Previously the scripts were based upon the original job size.
-- Fix squeue to not limit the size of partition, burst_buffer, exec_host, or
reason to 32 chars.
-- Fix potential packing error when packing a NULL slurmdb_clus_res_rec_t.
-- Fix potential packing errors when packing a NULL slurmdb_reservation_cond_t.
-- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
fails. Now job will be held. Update job update time when held.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Increase number of ResumePrograms that can be managed without leaving
zombie/orphan processes from 10 to 100.
-- Refactor slurmctld agent logic to eliminate some pthreads.
* Changes in Slurm 16.05.9
==========================
-- Fix parsing of SBCAST_COMPRESS environment variable in sbcast.
-- Change some debug messages to errors in task/cgroup plugin.
-- backfill scheduler: Stop trying to determine expected start time for a job
after 2 seconds of wall time. This can happen if there are many running jobs
and a pending job can not be started soon.
-- Improve performance of cr_sort_part_rows() in cons_res plugin.
-- CRAY - Fix dealock issue when updating accounting in the slurmctld and
scheduling a Datawarp job.
-- Correct the job state accounting information for jobs requeued due to burst
buffer errors.
-- burst_buffer/cray - Avoid "pre_run" operation if not using buffer (i.e.
just creating or deleting a persistent burst buffer).
-- Fix slurm.spec file support for BlueGene builds.
-- Fix missing TRES read lock in acct_policy_job_runnable_pre_select() code.
-- Fix debug2 message printing value using wrong array index in
_qos_job_runnable_post_select().
-- Prevent job timeout on node power up.
-- MYSQL - Fix minor memory leak when querying steps and the sql fails.
-- Make it so sacctmgr accepts column headers like MaxTRESPU and not MaxTRESP.
-- Only look at SLURM_STEP_KILLED_MSG_NODE_ID on startup, to avoid race
condition later when looking at a steps env.
-- Make backfill scheduler behave like regular scheduler in respect to
'assoc_limit_stop'.
-- Allow a lower version client command to talk to a higher version contoller
using the multi-cluster options (e.g. squeue -M<clsuter>).
-- slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld
daemon is running or node boot in progress.
-- MYSQL - Fix a few other minor memory leaks when uncommon failures occur.
-- burst_buffer/cray - Fix race condition that could cause multiple batch job
launch requests resulting in drained nodes.
-- Correct logic to purge old reservations.
-- Fix DBD cache restore from previous versions.
-- Fix to logic for getting expected start time of existing job ID with
explicit begin time that is in the past.
-- Clear job's reason of "BeginTime" in a more timely fashion and/or prevents
them from being stuck in a PENDING state.
-- Make sure acct policy limits imposed on a job are correct after requeue.
* Changes in Slurm 16.05.8
==========================
-- Remove StoragePass from being printed out in the slurmdbd log at debug2
level.
-- Defer PATH search for task program until launch in slurmstepd.
-- Modify regression test1.89 to avoid leaving vestigial job. Also reduce
logging to reduce likelyhood of Expect buffer overflow.
-- Do not PATH search for mult-prog launches if LaunchParamters=test_exec is
enabled.
-- Fix for possible infinite loop in select/cons_res plugin when trying to
satisfy a job's ntasks_per_core or socket specification.
-- If job is held for bad constraints make it so once updated the job doesn't
go into JobAdminHeld.
-- sched/backfill - Fix logic to reserve resources for jobs that require a
node reboot (i.e. to change KNL mode) in order to start.
-- When unpacking a node or front_end record from state and the protocol
version is lower than the min version, set it to the min.
-- Remove redundant lookup for part_ptr when updating a reservation's nodes.
-- Fix memory and file descriptor leaks in slurmd daemon's sbcast logic.
-- Do not allocate specialized cores to jobs using the --exclusive option.
-- Cancel interactive job if Prolog failure with "PrologFlags=contain" or
"PrologFlags=alloc" configured. Send new error prolog failure message to
the salloc or srun command as needed.
-- Prevent possible out-of-bounds read in slurmstepd on an invalid #! line.
-- Fix check for PluginDir within slurmctld to work with multiple directories.
-- Cancel interactive jobs automatically on communication error to launching
srun/salloc process.
-- Fix security issue caused by insecure file path handling triggered by the
failure of a Prolog script. To exploit this a user needs to anticipate or
cause the Prolog to fail for their job. CVE-2016-10030.
* Changes in Slurm 16.05.7
==========================
-- Fix issue in the priority/multifactor plugin where on a slurmctld restart,
where more time is accounted for than should be allowed.
-- cray/busrt_buffer - If total_space in a pool decreases, reset used_space
rather than trying to account for buffer allocations in progress.
-- cray/busrt_buffer - Fix for double counting of used_space at slurmctld
startup.
-- Fix regression in 16.05.6 where if you request multiple cpus per task (-c2)
and request --ntasks-per-core=1 and only 1 task on the node
the slurmd would abort on an infinite loop fatal.
-- cray/busrt_buffer - Internally track both allocated and unusable space.
The reported UsedSpace in a pool is now the allocated space (previously was
unusable space). Base available space on whichever value leaves least free
space.
-- cray/burst_buffer - Preserve job ID and don't translate to job array ID.
-- cray/burst_buffer - Update "instance" parsing to match updated dw_wlm_cli
output.
-- sched/backfill - Insure we don't try to start a job that was already started
and requeued by the main scheduling logic.
-- job_submit/lua - add access to the job features field in job_record.
-- select/linear plugin modified to better support heterogeneous clusters when
topology/none is also configured.
-- Permit cancellation of jobs in configuring state.
-- acct_gather_energy/rapl - prevent segfault in slurmd from race to gather
data at slurmd startup.
-- Integrate node_feature/knl_generic with "hbm" GRES information.
-- Fix output routines to prevent rounding the TRES values for memory or BB.
-- switch/cray plugin - fix use after free error.
-- docs - elaborate on how way to clear TRES limits in sacctmgr.
-- knl_cray plugin - Avoid abort from backup slurmctld at start time.
-- cgroup plugins - fix two minor memory leaks.
-- If a node is booting for some job, don't allocate additional jobs to the
node until the boot completes.
-- testsuite - fix job id output in test17.39.
-- Modify backfill algorithm to improve performance with large numbers of
running jobs. Group running jobs that end in a "similar" time frame using a
time window that grows exponentially rather than linearly. After one second
of wall time, simulate the termination of all remaining running jobs in
order to respond in a reasonable time frame.
-- Fix slurm_job_cpus_allocated_str_on_node_id() API call.
-- sched/backfill plugin: Make malloc match data type (defined as uint32_t and
allocated as int).
-- srun - prevent segfault when terminating job step before step has launched.
-- sacctmgr - prevent segfault when trying to reset usage for an invalid
account name.
-- Make the openssl crypto plugin compile with openssl >= 1.1.
-- Fix SuspendExcNodes and SuspendExcParts on slurmctld reconfiguration.
-- sbcast - prevent segfault in slurmd due to race condition between file
transfers from separate jobs using zlib compression
-- cray/burst_buffer - Increase time to synchronize operations between threads
from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
-- node_features/knl_cray - Fix possible race condition when changing node
state that could result in old KNL mode as an active features.
-- Make sure if a job can't run because of resources we also check accounting
limits after the node selection to make sure it doesn't violate those limits
and if it does change the reason for waiting so we don't reserve resources
on jobs violating accounting limits.
-- NRT - Make it so a system running against IBM's PE will work with PE
version 1.3.
-- NRT - Make it so protocols pgas and test are allowed to be used.
-- NRT - Make it so you can have more than 1 protocol listed in MP_MSG_API.
-- cray/burst_buffer - If slurmctld daemon restarts with pending job and burst
buffer having unknown file stage-in status, teardown the buffer, defer the
job, and start stage-in over again.
-- On state restore in the slurmctld don't overwrite the mem_spec_limit given
from the slurm.conf when using FastSchedule=0.
-- Recognize a KNL's proper NUMA count (rather than setting it to the value
in slurm.conf) when using FastSchedule=0.
-- Fix parsing in regression test1.92 for some prompts.
-- sbcast - use slurmd's gid cache rather than a separate lookup.
-- slurmd - return error if setgroups() call fails in _drop_privileges().
-- Remove error messages about gres counts changing when a job is resized on
a slurmctld restart or reconfig, as they aren't really error messages.
-- Fix possible memory corruption if a job is using GRES and changing size.
-- jobcomp/elasticsearch - fix printf format for a value on 32-bit builds.
-- task/cgroup - Change error message if CPU binding can not take place to
better identify the root cause of the problem.
-- Fix issue where task/cgroup would not always honor --cpu_bind=threads.
-- Fix race condition in with getgrouplist() in slurmd that can lead to
user accounts being granted access to incorrect group memberships during
job launch.
* Changes in Slurm 16.05.6
==========================
-- Docs - the correct default value for GroupUpdateForce is 0.
-- mpi/pmix - improve point to point communication performance.
-- SlurmDB - include pending jobs in search during 'sacctmgr show runawayjobs'.
-- Add client side out-of-range checks to --nice flag.
-- Fix support for sbatch "-W" option, previously eeded to use "--wait".
-- node_features/knl_cray plugin and capmc_suspend/resume programs modified to
sleep and retry capmc operations if the Cray State Manager is down. Added
CapmcRetries configuration parameter to knl_cray.conf.
-- node_features/knl_cray plugin: Remove any KNL MCDRAM or NUMA features from
node's configuration if capmc does NOT report the node as being KNL.
-- node_features/knl_cray plugin: drain any node not reported by
"capmc node_status" on startup or reconfig.
-- node_features/knl_cray plugin: Substantially streamline and speed up logic
to load current node state on reconfigure failure or unexpected node boot.
-- node_features/knl_cray plugin: Add separate thread to interact with capmc
in response to unexpected node reboots.
-- node_features plugin - Add "mode" argument to node_features_p_node_xlate()
function to fix some bugs updating a node's features using the node update
RPC.
-- node_features/knl_cray plugin: If the reconfiguration of nodes for an
interactive job fails, kill the job (it can't be requeued like a batch job).
-- Testsuite - Added srun/salloc/sbatch tests with --use-min-nodes option.
-- Fix typo when an error occurs when discovering pmix version on
configure.
-- Fix configuring pmix support when you have your lib dir symlinked to lib64.
-- Fix waiting reason if a job is waiting for a specific limit instead of
always just AccountingPolicy.
-- Correct SchedulerParameters=bf_busy_nodes logic with respect to the job's
minimum node count. Previous logic would not decremement counter in some
locations and reject valid job request for not reaching minimum node count.
-- Fix FreeBSD-11 build by using llabs() function in place of abs().
-- Cray: The slurmd can manipulate the socket/core/thread values reported based
upon the configuration. The logic failed to consider select/cray with
SelectTypeParameters=other_cons_res as equivalent to select/cons_res.
-- If a node's socket or core count are changed at registration time (e.g. a
KNL node's NUMA mode is changed), change it's board count to match.
-- Prevent possible divide by zero in select/cons_res if a node's board count
is higher than it's socket count.
-- Allow an advanced reservation to contain a license count of zero.
-- Preserve non-KNL node features when updating the KNL node features for a
multi-node job in which the non-KNL node features vary by node.
-- task/affinity plugin: Honor a job's --ntasks-per-socket and
--ntasks-per-core options in task binding.
-- slurmd - do not print ClusterName when using 'slurmd -C'.
-- Correct a bitmap test function (used only by the select/bluegene plugin).
-- Do not propagate SLURM_UMASK environment variable to batch script.
-- Added node_features/knl_generic plugin for KNL support on non-Cray systems.
-- Cray: Prevent abort in backfill scheduling logic for requeued job that has
been cancelled while NHC is running.
-- Improve reported estimates of start and end times for pending jobs.
-- pbsnodes: Show OS value as "unknown" for down nodes.
-- BlueGene - correctly scale node counts when enforcing MaxNodes limit take 2.
-- Fix "sbatch --hold" to set JobHeldUser correctly instead of JobHeldAdmin.
-- Cray - print warning that task/cgroup is required, and must be after
task/cray in the TaskPlugin settings.
-- Document that node Weight takes precedence over load with LLN scheduling.
-- Fix issue where gang scheduling could happen even with OverSubscribe=NO.
-- Expose JOB_SHARED_* values to job_submit/lua plugin.
-- Fix issue where number of nodes is not properly allocated when srun is
requested with -n tasks < hosts from -w hostlist.
-- Update srun documentation for -N, -w and -m arbitrary.
-- Fix bug that was clearing MAINT mode on nodes scheduled for reboot (bug
introduced in version 16.05.5 to address bug in overlapping reservations).
-- Add logging of node reboot requests.
-- Docs - remove recommendation for ReleaseAgent setting in cgroup.conf.
-- Make sure a job cleans up completely if it has a node fail. Mostly an
issue with gang scheduling.
* Changes in Slurm 16.05.5
==========================
-- Fix accounting for jobs requeued after the previous job was finished.
-- slurmstepd modified to pre-load all relevant plugins at startup to avoid
the possibility of modified plugins later resulting in inconsistent API
or data structures and a failure of slurmstepd.
-- Export functions from parse_time.c in libslurm.so.
-- Export unit convert functions from slurm_protocol_api.c in libslurm.so.
-- Fix scancel to allow multiple steps from a job to be cancelled at once.
-- Update and expand upgrade guide (in Quick Start Administrator web page).
-- burst_buffer/cray: Requeue, but do not hold a job which fails the pre_run
operation.
-- Insure reported expected job start time is not in the past for pending jobs.
-- Add support for PMIx v2.
-- mpi/pmix: support for passing TMPDIR path through info key
-- Cray: update slurmconfgen_smw.py script to correctly identify service nodes
versus compute nodes.
-- FreeBSD - fix build issue in knl_cray plugin.
-- Corrections to gres.conf parsing logic.
-- Make partition State independent of EnforcePartLimits value.
-- Fix multipart srun submission with EnforcePartLimits=NO and job violating
the partition limits.
-- Fix problem updating job state_reason.
-- pmix - Provide HWLOC topology in the job-data if Slurm was configured
with hwloc.
-- Cray - Fix issue restoring jobs when blade count increases due to hardware
reconfiguration.
-- burst_buffer/cray - Hold job after 3 failed pre-run operations.
-- sched/backfill - Check that a user's QOS is allowed to use a partition
before trying to schedule resources on that partition for the job.
-- sacctmgr - Fix displaying nodenames when printing out events or
reservations.
-- Fix mpiexec wrapper to accept task count with more than one digit.
-- Add mpiexec man page to the script.
-- Add salloc_wait_nodes option to the SchedulerParameters parameter in the
slurm.conf file controlling when the salloc command returns in relation to
when nodes are ready for use (i.e. booted).
-- Handle case when slurmctld daemon restart while compute node reboot in
progress. Return node to service rather than setting DOWN.
-- Preserve node "RESERVATION" state when one of multiple overlapping
reservations ends.
-- Restructure srun command locking for task_exit processing logic for improved
parallelism.
-- Modify srun task completion handling to only build the task/node string for
logging purposes if it is needed. Modified for performance purposes.
-- Docs - update salloc/sbatch/srun man pages to mention corresponding
environment variables for --mem/--mem-per-cpu and allowed suffixes.
-- Silence srun warning when overriding the job ntasks-per-node count
with a lower task count for the step.
-- Docs - assorted spelling fixes.
-- node_features/knl_cray: Fix bug where MCDRAM state could be taken from
capmc rather than cnselect.
-- node_features/knl_cray: If a node is rebooted outside of Slurm's direction,
update it's active features with current MCDRAM and NUMA mode information.
-- Restore ability to manually power down nodes, broken in 15.08.12.
-- Don't log error for job end_time being zero if node health check is still
running.
-- When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode)
then pass to the ResumeProgram the job ID assigned to the nodes in the
SLURM_JOB_ID environment variable.
-- Allow a node's PowerUp state flag to be cleared using update_node RPC.
-- capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of
nodes or reboot a set of nodes fails then just requeue the job and abort the
entire operation rather than trying to operate on individual nodes.
-- node_features/knl_cray plugin: Increase default CapmcTimeout parameter from
10 to 60 seconds.
-- Fix squeue filter by job license when a job has requested more than 1
license of a certain type.
-- Fix bug in PMIX_Ring in the pmi2 plugin so that it supports singleton mode.
It also updates the testpmixring.c test program so it can be used to check
singleton runs.
-- Automically clean up task/cgroup cpuset and devices cgroups after steps are
completed.
-- Testsuite - Fix test1.83 to handle gaps in node names properly.
-- BlueGene - correctly scale node counts when enforcing MaxNodes limit.
-- Make sure no attempt is made to schedule a requeued job until all steps are
cleaned (Node Health Check completes for all steps on a Cray).
-- KNL: Correct task affinity logic for some NUMA modes.
-- Add salloc/sbatch/srun --priority option of "TOP" to set job priority to
the highest possible value. This option is only available to Slurm operators
and administrators.
-- Add salloc/sbatch/srun option --use-min-nodes to prefer smaller node counts
when a range of node counts is specified (e.g. "-N 2-4").
-- Validate salloc/sbatch --wait-all-nodes argument.
-- Add "sbatch_wait_nodes" to SchedulerParameters to control default sbatch
behaviour with respect to waiting for all allocated nodes to be ready for
use. Job can override the configuration option using the --wait-all-nodes=#
option.
-- Prevent partition group access updates from resetting last_part_update when
no changes have been made. Prevents backfill scheduler from restarting
mid-cycle unnecessarily.
-- Cray - add NHC_ABSOLUTELY_NO to never run NHC, even on certain edge cases
that it would otherwise be run on with NHC_NO.
-- Ignore GRES/QOS updates that maintain the same value as before.
-- mpi/pmix - prepare temp directory for application.
-- Fix display for the nice and priority values in sprio/scontrol/squeue.
* Changes in Slurm 16.05.4
==========================
-- Fix potential deadlock if running with message aggregation.
-- Streamline when schedule() is called when running with message aggregation
on batch script completes.
-- Fix incorrect casting when [un]packing derived_ec on slurmdb_job_rec_t.
-- Document that persistent burst buffers can not be created or destroyed using
the salloc or srun --bb options.
-- Add support for setting the SLURM_JOB_ACCOUNT, SLURM_JOB_QOS and
SLURM_JOB_RESERVAION environment variables are set for the salloc command.
Document the same environment variables for the salloc, sbatch and srun
commands in their man pages.
-- Fix issue where sacctmgr load cluster.cfg wouldn't load associations
that had a partition in them.
-- Don't return the extern step from sstat by default.
-- In sstat print 'extern' instead of 4294967295 for the extern step.
-- Make advanced reservations work properly with core specialization.
-- Fix race condition in the account_gather plugin that could result in job
stuck in COMPLETING state.
-- Regression test fixes if SelectTypePlugin not managing memory and no node
memory size set (defaults to 1 MB per node).
-- Add missing partition write locks to _slurm_rpc_dump_nodes/node_single to
prevent a race condition leading to inconsistent sinfo results.
-- Fix task:CPU binding logic for some processors. This bug was introduced
in version 16.05.1 to address KNL bunding problem.
-- Fix two minor memory leaks in slurmctld.
-- Improve partition-specific limit logging from slurmctld daemon.
-- Fix incorrect access check when using MaxNodes setting on the partition.
-- Fix issue with sacctmgr when specifying a list of clusters to query.
-- Fix issue when calculating future StartTime for a job.
-- Make EnforcePartLimit support logic work with any ordering of partitions
in job submit request.
-- Prevent restoration of wrong CPU governor and frequency when using
multiple task plugins.
-- Prevent slurmd abort if hwloc library fails to populate the "children"
arrays (observed with hwloc version "dev-333-g85ea6e4").
-- burst_buffer/cray: Add "--groupid" to DataWarp "setup" command.
-- Fix lustre profiling putting it in the Filesystem dataset instead of the
Network dataset.
-- Fix profiling documentation and code to match be consistent with
Filesystem instead of Lustre.
-- Correct the way watts is calculated in the rapl plugin when using a poll
frequency other than AcctGatherNodeFreq.
-- Don't about step launch if job reaches expected end time while node is
configuring/booting (NOTE: The job end time will be adjusted after node
becomes ready for use).
-- Fix several print routines to respect a custom output delimiter when
printing NO_VAL or INFINITE.
-- Correct documented configurations where --ntasks-per-core and
--ntasks-per-socket are supported.
-- task/affinity plugin buffer allocated too small, can corrupt memory.
* Changes in Slurm 16.05.3
==========================
-- Make it so the extern step uses a reverse tree when cleaning up.
-- If extern step doesn't get added into the proctrack plugin make sure the
sleep is killed.
-- Fix areas the slurmctld can segfault if an extern step is in the system
cleaning up on a restart.
-- Prevent possible incorrect counting of GRES of a given type if a node has
the multiple "types" of a given GRES "name", which could over-subscribe
GRES of a given type.
-- Add web links to Slurm Diamond Collectors (from Harvard University) and
collectd (from EDF).
-- Add job_submit plugin for the "reboot" field.
-- Make some more Slurm constants (INFINITE, NO_VAL64, etc.) available to
job_submit/lua plugins.
-- Send in a -1 for a taskid into spank_task_post_fork for the extern_step.
-- MYSQL - Sightly better logic if a job completion comes in with an end time
of 0.
-- task/cgroup plugin is configured with ConstrainRAMSpace=yes, then set soft
memory limit to allocated memory limit (previously no soft limit was set).
-- Document limitations in burst buffer use by the salloc command (possible
access problems from a login node).
-- Fix proctrack plugin to only add the pid of a process once
(regression in 16.05.2).
-- Fix for sstat to print correct info when requesting jobid.batch as part of
a comma-separated list.
-- CRAY - Fix issue if pid has already been added to another job container.
-- CRAY - Fix add of extern step to AELD.
-- burstbufer/cray: avoid batch submit error condition if waiting for stagein.
-- CRAY - Fix for reporting steps lingering after they are already finished.
-- Testsuite - fix test1.29 / 17.15 for limits with values above 32-bits.
-- CRAY - Simplify when a NHC is called on a step that has unkillable
processes.
-- CRAY - If trying to kill a step and you have NHC_NO_STEPS set run NHC
anyway to attempt to log the backtraces of the potential
unkillable processes.
-- Fix gang scheduling and license release logic if single node job killed on
bad node.
-- Make scontrol show steps show the extern step correctly.
-- Do not scheduled powered down nodes in FAILED state.
-- Do not start slurmctld power_save thread until partition information is read
in order to prevent race condition that can result invalid pointer when
trying to resolve configured SuspendExcParts.
-- Add SLURM_PENDING_STEP id so it won't be confused with SLURM_EXTERN_CONT.
-- Fix for core selection with job --gres-flags=enforce-binding option.
Previous logic would in some cases allocate a job zero cores, resulting in
slurmctld abort.
-- Minimize preempted jobs for configurations with multiple jobs per node.
-- Improve partition AllowGroups caching. Update the table of UIDs permitted to
use a partition based upon it's AllowGroups configuration parameter as new
valid UIDs are found rather than looking up that user's group information
for every job they submit. If the user is now allowed to use the partition,
then do not check that user's group access again for 5 seconds.
-- Add routing queue information to Slurm FAQ web page.
-- Do not select_g_step_finish() a SLURM_PENDING_STEP step, as nothing has
been allocated for the step yet.
-- Fixed race condition in PMIx Fence logic.
-- Prevent slurmctld abort if job is killed or requeued while waiting for
reboot of its allocated compute nodes.
-- Treat invalid user ID in AllowUserBoot option of knl.conf file as error
rather than fatal (log and do not exit).
-- qsub - When doing the default output files for an array in qsub style
make them using the master job ID instead of the normal job ID.
-- Create the extern step while creating the job instead of waiting until the
end of the job to do it.
-- Always report a 0 exit code for the extern step instead of being canceled
or failed based on the signal that would always be killing it.
-- Fix to allow users to update QOS of pending jobs.
-- CRAY - Fix minor memory leak in switch plugin.
-- CRAY - Change slurmconfgen_smw.py to skip over disabled nodes.
-- Fix eligible_time for elasticsearch as well as add queue_wait
(difference between start of job and when it was eligible).
* Changes in Slurm 16.05.2
==========================
-- CRAY - Fix issue where the proctrack plugin could hang if the container
id wasn't able to be made.
-- Move test for job wait reason value of BurstBufferResources and
BurstBufferStageIn later in the scheduling logic.
-- Document which srun options apply to only job, only step, or job and step
allocations.
-- Use more compatible function to get thread name (>= 2.6.11).
-- Fix order of job then step id when noting cleaning flag being set.
-- Make it so the extern step sends a message with accounting information
back to the slurmctld.
-- Make it so the extern step calls the select_g_step_start|finish functions.
-- Don't print error when extern step is canceled because job is ending.
-- Handle a few error codes when dealing with the extern step to make sure
we have the pids added to the system correctly.
-- Add support for job dependencies with job array expressions. Previous logic
required listing each task of job array individually.
-- Make sure tres_cnt is set before creating a slurmdb_assoc_usage_t.
-- Prevent backfill scheduler from starting a second "singleton" job if another
one started during a backfill sleep.
-- Fix for invalid array pointer when creating advanced reservation when job
allocations span heterogeneous nodes (differing core or socket counts).
-- Fix hostlist_ranged_string_xmalloc_dims to correctly not put brackets on
hostlists when brackets == 0.
-- Make sure we don't get brackets when making a range of reserved ports
for a step.
-- Change fatal to an error if port ranges aren't correct when reading state
for steps.
* Changes in Slurm 16.05.1
==========================
-- Fix __cplusplus macro in spank.h to allow compilation with C++.
-- Fix compile issue with older glibc < 2.12
-- Fix for starting batch step with mpi/pmix plugin.
-- Fix for "scontrol -dd show job" with respect to displaying the specific
CPUs allocated to a job on each node. Prior logic would only display
the CPU information for the first node in the job allocation.
-- Print correct return code on failure to update active node features
through sview.
-- Allow QOS timelimit to override partition timelimit when EnforcePartLimits
is set to all/any.
-- Make it so qsub will do a "basename" on a wrapped command for the output
and error files.
-- Fix issue where slurmd could core when running the ipmi energy plugin.
-- Documentation - clean up typos.
-- Add logic so that slurmstepd can be launched under valgrind.
-- Increase buffer size to read /proc/*/stat files.
-- Fix for tracking job resource allocation when slurmctld is reconfigured
while Cray Node Health Check (NHC) is running. Previous logic would fail to
record the job's allocation then perform release operation upon NHC
completion, resulting in underflow error messages.
-- Make "scontrol show daemons" work with long node names.
-- CRAY - Collect energy using a uint64_t instead of uint32_t.
-- Fix incorrect if statements when determining if the user has a default
account or wckey.
-- Prevent job stuck in configuring state if slurmctld daemon restarted while
PrologSlurmctld is running. Also re-issue burst_buffer/pre-load operation
as needed.
-- Correct task affinity support for FreeBSD.
-- Fix for task affinity on KNL in SNC2/Flat mode.
-- Recalculate a job's memory allocation after node reboot if job requests all
of a node's memory and FastSchedule=0 is configured. Intel KNL memory size
can change on reboot with various MCDRAM modes.
-- Fix small memory leak when printing HealthCheckNodeState.
-- Eliminate memory leaks when AuthInfo is configured.
-- Improve sdiag output description in man page.
-- Cray/capmc_resume script modify a node's features (as needed) when the
reinit (reboot) command is issued rather than wait for the nodes to change
to the "on" state.
-- Correctly print ranges when using step values in job arrays.
-- Allow from file names / paths over 256 characters when launching steps,
as well as spaces in the executable name.
-- job_submit.license.lua example modified to send message back to user.
-- Document job --mem=0 option means all memory on a node.
-- Set SLURM_JOB_QOS environment variable to QOS name instead of description.
-- knl_cray.conf file option of CnselectPath added.
-- node_features/knl_cray plugin modified to get current node NUMA and MCDRAM
modes using cnselect command rather than capmc command.
-- liblua - add SLES12 paths to runtime search list.
-- Fix qsub default output and error files for task arrays.
-- Fix qsub to set job_name correctly when wrapping a script (-b y)
-- Cray - set EnforcePartLimits=any in slurm.conf template.
* Changes in Slurm 16.05.0
==========================
-- Update seff to fix warnings with ncpus, and list slurm-perlapi dependency
in spec file.
-- Fix testsuite to consistent use /usr/bin/env {bash,expect} construct.
-- Cray - Ensure that step completion messages get to the database.
-- Fix step cpus_per_task calculation for heterogeneous job allocation.
-- Fix --with-json= configure option to use specified path.
-- Add back thread_id to "thread_id" LogTimeFormat to distinguish between
mutliple threads with the same name. Now displays thread name and id.
-- Change how Slurm determines the NUMA count of a node. Ignore KNL NUMA
that only include memory.
-- Cray - Fix node list parsing in capmc_suspend/resume programs.
-- Fix sbatch #BSUB parsing for -W and -M options.
-- Fix GRES task layout bug that could cause slurmctld to abort.
-- Fix to --gres-flags=enforce-binding logic when multiple sockets needed.
* Changes in Slurm 16.05.0rc2
=============================
-- Cray node shutdown/reboot scripts, perform operations on all nodes in one
capmc command. Only if that fails, issue the operations in parallel on
individual nodes. Required for scalability.
-- Cleanup two minor Coverity warnings.
-- Make it so the tres units in a job's formatted string are converted like
they are in a step.
-- Correct partition's MaxCPUsPerNode enforcement when nodes are shared by
multiple partitions.
-- node_feature/knl_cray - Prevent slurmctld GRES errors for "hbm" references.
-- Display thread name instead of thread id and remove process name in stderr
logging for "thread_id" LogTimeFormat.
-- Log IP address of bad incomming message to slurmctld.
-- If a user requests tasks, nodes and ntasks-per-node and
tasks-per-node/nodes != tasks print warning and ignore ntasks-per-node.
-- Release CPU "owner" file locks.
-- Fix for job step memory allocation: Reject invalid step at submit time
rather than leaving it queued.
-- Whenever possible, avoid allocating nodes that require a reboot.
* Changes in Slurm 16.05.0rc1
==============================
-- Remove the SchedulerParameters option of "assoc_limit_continue", making it
the default value. Add option of "assoc_limit_stop". If "assoc_limit_stop"
is set and a job cannot start due to association limits, then do not attempt
to initiate any lower priority jobs in that partition. Setting this can
decrease system throughput and utlization, but avoid potentially starving
larger jobs by preventing them from launching indefinitely.
-- Update a node's socket and cores per socket counts as needed after a node
boot to reflect configuration changes which can occur on KNL processors.
Note that the node's total core count must not change, only the distribution
of cores across varying socket counts (KNL NUMA nodes treated as sockets by
Slurm).
-- Rename partition configuration from "Shared" to "OverSubscribe". Rename
salloc, sbatch, srun option from "--shared" to "--oversubscribe". The old
options will continue to function. Output field names also changed in
scontrol, sinfo, squeue and sview.
-- Add SLURM_UMASK environment variable to user job.
-- knl_conf: Added new configuration parameter of CapmcPollFreq.
-- squeue: remove errant spaces in column formats for "squeue -o %all".
-- Add ARRAY_TASKS mail option to send emails to each task in a job array.
-- Change default compression library for sbcast to lz4.
-- select/cray - Initiate step node health check at start of step termination
rather than after application completely ends so that NHC can capture
information about hung (non-killable) processes.
-- Add --units=[KMGTP] option to sacct to display values in specific unit type.
-- Modify sacct and sacctmgr to display TRES values in converted units.
-- Modify sacctmgr to accept TRES values with [KMGTP] suffixes.
-- Replace hash function with more modern SipHash functions.
-- Add "--with-cray_dir" build/configure option.
-- BB- Only send stage_out email when stage_out is set in script.
-- Add r/w locking to file_bcast receive functions in slurmd.
-- Add TopologyParam option of "TopoOptional" to optimize network topology
only for jobs requesting it.
-- Fix build on FreeBSD.
-- Configuration parameter "CpuFreqDef" used to set default governor for job
step not specifying --cpu-freq (previously the parameter was unused).
-- Fix sshare -o<format> to correctly display new lengths.
-- Update documentation to rename Shared option to OverSubscribe.
-- Update documentation to rename partition Priority option to PriorityTier.
-- Prevent changing of QOS on running jobs.
-- Update accounting when changing QOS on pending jobs.
-- Add support to ntasks_per_socket in task/affinity.
-- Generate init.d and systemd service scripts in etc/ through Make rather
than at configure time to ensure all variable substitutions happen.
-- Use TaskPluginParam for default task binding if no user specified CPU
binding. User --cpu_bind option takes precident over default. No longer
any error if user --cpu_bind option does not match TaskPluginParam.
-- Make sacct and sattach work with older slurmd versions.
-- Fix protocol handling between 15.08 and 16.05 for 'scontrol show config'.
-- Enable prefixes (e.g. info, debug, etc.) in slurmstepd debugging.
* Changes in Slurm 16.05.0pre2
==============================
-- Split partition's "Priority" field into "PriorityTier" (used to order
partitions for scheduling and preemption) plus "PriorityJobFactor" (used by
priority/multifactor plugin in calculating job priority, which is used to
order jobs within a partition for scheduling).
-- Revert call to getaddrinfo, restoring gethostbyaddr (introduced in Slurm
16.05.0pre1) which was failing on some systems.
-- knl_cray.conf - Added AllowMCDRAM, AllowNUMA and ALlowUserBoot
configuration options.
-- Add node_features_p_user_update() function to node_features plugin.
-- Don't print Weight=1 lines in 'scontrol write config' (its the default).
-- Remove PARAMS macro from slurm.h.
-- Remove BEGIN_C_DECLS and END_C_DECLS macros.
-- Check that PowerSave mode configured for node_features/knl_cray plugin.
It is required to reconfigure and reboot nodes.
-- Update documentation to reflect new cgroup default location change from
/cgroup to /sys/fs/cgroup.
-- If NodeHealthCheckProgram configured HealthCheckInterval is non-zero, then
modify slurmd to run it before registering with slurmctld.
-- Fix for tasks being packed onto cores when the requested --cpus-per-task is
greater than the number of threads on a core and --ntasks-per-core is 1.
-- Make it so jobs/steps track ':' named gres/tres, before hand gres/gpu:tesla
would only track gres/gpu, now it will track both gres/gpu and
gres/gpu:tesla as separate gres if configured like
AccountingStorageTRES=gres/gpu,gres/gpu:tesla
-- Added new job dependency type of "aftercorr" which will start a task of a
job array after the corresponding task of another job array completes.
-- Increase default MaxTasksPerNode configuration parameter from 128 to 512.
-- Enable sbcast data compression logic (compress option previously ignored).
-- Add --compress option to srun command for use with --bcast option.
-- Add TCPTimeout option to slurm[dbd].conf. Decouples MessageTimeout from TCP
connections.
-- Don't call primary controller for every RPC when backup is in control.
-- Add --gres-flags=enforce-binding option to salloc, sbatch and srun commands.
If set, the only CPUs available to the job will be those bound to the
selected GRES (i.e. the CPUs identifed in the gres.conf file will be
strictly enforced rather than advisory).
-- Change how a node's allocated CPU count is calculated to avoid double
counting CPUs allocated to multiple jobs at the same time.
-- Added SchedulingParameters option of "bf_min_prio_reserve". Jobs below
the specified threshold will not have resources reserved for them.
-- Added "sacctmgr show lostjobs" to report any orphaned jobs in the database.
-- When a stepd is about to shutdown and send it's response to srun
make the wait to return data only hit after 500 nodes and configurable
based on the TcpTimeout value.
-- Add functionality to reset the lft and rgt values of the association table
with the slurmdbd.
-- Add SchedulerParameter no_env_cache, if set no env cache will be use when
launching a job, instead the job will fail and drain the node if the env
isn't loaded normally.
* Changes in Slurm 16.05.0pre1
==============================
-- Add sbatch "--wait" option that waits for job completion before exiting.
Exit code will match that of spawned job.
-- Modify advanced reservation save/restore logic for core reservations to
support configuration changes (changes in configured nodes or cores counts).
-- Allow ControlMachine, BackupController, DbdHost and DbdBackupHost to be
either short or long hostname.
-- Job output and error files can now contain "%" character by specifying
a file name with two consecutive "%" characters. For example,
"sbatch -o "slurm.%%.%j" for job ID 123 will generate an output file named
"slurm.%.123".
-- Pass user name in Prolog RPC from controller to slurmd when using
PrologFlags=Alloc. Allows SLURM_JOB_USER env variable to be set when using
Native Slurm on a Cray.
-- Add "NumTasks" to job information visible to Slurm commands.
-- Add mail wrapper script "smail" that will include job statistics in email
notification messages.
-- Remove vestigial "SICP" job option (inter-cluster job option). Completely
different logic will be forthcoming.
-- Fix case where the primary and backup dbds would both be performing rollup.
-- Add an ack reply from slurmd to slurmstepd when job setup is done and the
job is ready to be executed.
-- Removed support for authd. authd has not been developed and supported since
several years.
-- Introduce a new parameter requeue_setup_env_fail in SchedulerParameters.
A job that fails to setup the environment will be requeued and the node
drained.
-- Add ValidateTimeout and OtherTimeout to "scontrol show burst" output.
-- Increase default sbcast buffer size from 512KB to 8MB.
-- Enable the hdf5 profiling of the batch step.
-- Eliminate redundant environment and script files for job arrays.
-- Stop searching sbatch scripts for #PBS directives after 100 lines of
non-comments. Stop parsing #PBS or #SLURM directives after 1024 characters
into a line. Required for decent perforamnce with huge scripts.
-- Add debug flag for timing Cray portions of the code.
-- Remove all *.la files from RPMs.
-- Add Multi-Category Security (MCS) infrastructure to permit nodes to be bound
to specific users or groups.
-- Install the pmi2 unix sockets in slurmd spool directory instead of /tmp.
-- Implement the getaddrinfo and getnameinfo instead of gethostbyaddr and
gethostbyname.
-- Finished PMIx implementation.
-- Implemented the --without=package option for configure.
-- Fix sshare to show each individual cluster with -M,--clusters option.
-- Added --deadline option to salloc, sbatch and srun. Jobs which can not be
completed by the user specified deadline will be terminated with a state of
"Deadline" or "DL".
-- Implemented and documented PMIX protocol which is used to bootstrap an
MPI job. PMIX is an alternative to PMI and PMI2.
-- Change default CgroupMountpoint (in cgroup.conf) from "/cgroup" to
"/sys/fs/cgroup" to match current standard.
-- Add #BSUB options to sbatch to read in from the batch script.
-- HDF: Change group name of node from nodename to nodeid.
-- The partition-specific SelectTypeParameters parameter can now be used to
change the memory allocation tracking specification in the global
SelectTypeParameters configuration parameter. Supported partition-specific
values are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory. If the
global SelectTypeParameters value includes memory allocation management and
the partition-specific value does not, then memory allocation management for
that partition will NOT be supported (i.e. memory can be over-allocated).
Likewise the global SelectTypeParameters might not include memory management
while the partition-specific value does.
-- Burst buffer/cray - Add support for multiple buffer pools including support
for different resource granularity by pool.
-- Burst buffer advanced reservation units treated as bytes (per documentation)
rather than GB.
-- Add an "scontrol top <jobid>" command to re-order the priorities of a user's
pending jobs. May be disabled with the "disable_user_top" option in the
SchedulerParameters configuration parameter.
-- Modify sview to display negative job nice values.
-- Increase job's nice value field from 16 to 32 bits.
-- Remove deprecated job_submit/cnode plugin.
-- Enhance slurm.conf option EnforcePartLimit to include options like "ANY" and
"ALL". "Any" is equivalent to "Yes" and "All" will check all partitions
a job is submitted to and if any partition limit is violated the job will
be rejected even if it could possibly run on another partition.
-- Add "features_act" field (currently active features) to the node
information. Output of scontrol, sinfo, and sview changed accordingly.
The field previously displayed as "Features" is now "AvailableFeatures"
while the new field is displayed as "ActiveFeatures".
-- Remove Sun Constellation, IBM Federation Switches (replaced by NRT switch
plugin) and long-defunct Quadrics Elan support.
-- Add -M<clusters> option to sreport.
-- Rework group caching to work better in environments with
enumeration disabled. Removed CacheGroups config directive, group
membership lists are now always cached, controlled by
GroupUpdateTime parameter. GroupUpdateForce parameter default
value changed to 1.
-- Add reservation flag of "purge_comp" which will purge an advanced
reservation once it has no more active (pending, suspended or running) jobs.
-- Add new configuration parameter "KNLPlugins" and plugin infrastructure.
-- Add optional job "features" to node reboot RPC.
-- Add slurmd "-b" option to report node rebooted at daemon start time. Used
for testing purposes.
-- contribs/cray: Add framework for powering nodes up and down.
-- For job constraint, convert comma separator to "&".
-- Add Max*PerAccount options for QOS.
-- Protect slurm_mutex_* calls with abort() on failure.
* Changes in Slurm 15.08.14
===========================
-- For job resize, correct logic to build "resize" script with new values.
Previously the scripts were based upon the original job size.
* Changes in Slurm 15.08.13
===========================
-- Fix issue where slurmd could core when running the ipmi energy plugin.
-- Print correct return code on failure to update node features through sview.
-- Documentation - cleanup typos.
-- Add logic so that slurmstepd can be launched under valgrind.
-- Increase buffer size to read /proc/*/stat files.
-- MYSQL - Handle ER_HOST_IS_BLOCKED better by failing when it occurs instead
of continuously printing the message over and over as the problem will
most likely not resolve itself.
-- Add --disable-bluegene to configure. This will make it so Slurm
can work on a BGAS node.
-- Prevent job stuck in configuring state if slurmctld daemon restarted while
PrologSlurmctld is running.
-- Handle association correctly if using FAIR_TREE as well as shares=Parent
-- Fix race condition when setting priority of a job and the association
doesn't have a parent.
-- MYSQL - Fix issue with adding a reservation if the name has single quotes in
it.
-- Correctly print ranges when using step values in job arrays.
-- Fix for invalid array pointer when creating advanced reservation when job
allocations span heterogeneous nodes (differing core or socket counts).
-- Fix for sstat to print correct info when requesting jobid.batch as part of
a comma-separated list.
-- Cray - Fix issue restoring jobs when blade count increases due to hardware
reconfiguration.
-- Ignore warnings about depricated functions. This is primarily there for
new glibc 2.24+ that depricates readdir_r.
-- Fix security issue caused by insecure file path handling triggered by the
failure of a Prolog script. To exploit this a user needs to anticipate or
cause the Prolog to fail for their job. CVE-2016-10030.
* Changes in Slurm 15.08.12
===========================
-- Do not attempt to power down a node which has never responded if the
slurmctld daemon restarts without state.
-- Fix for possible slurmstepd segfault on invalid user ID.
-- MySQL - Fix for possible race condition when archiving multiple clusters
at the same time.
-- Fix compile for when you don't have hwloc.
-- Fix issue where daemons would only listen on specific address given in
slurm.conf instead of all. If looking for specific addresses use
TopologyParam options No*InAddrAny.
-- Cray - Better robustness when dealing with the aeld interface.
-- job_submit.lua - add array_inx value for job arrays.
-- Perlapi - Remove unneeded/undefined mutex.
-- Fix issue when TopologyParam=NoInAddrAny is set the responses wouldn't
make it to the slurmctld when using message aggregation.
-- MySQL - Fix potential memory leak when rolling up data.
-- Fix issue with clustername file when running on NFS with root_squash.
-- Fix race condition with respects to cleaning up the profiling threads
when in use.
-- Fix issues when building on NetBSD.
-- Fix jobcomp/elasticsearch build when libcurl is installed in a
non-standard location.
-- Fix MemSpecLimit to explicitly require TaskPlugin=task/cgroup and
ConstrainRAMSpace set in cgroup.conf.
-- MYSQL - Fix order of operations issue where if the database is locked up
and the slurmctld doesn't wait long enough for the response it would give
up leaving the connection open and create a situation where the next message
sent could receive the response of the first one.
-- Fix CFULL_BLOCK distribution type.
-- Prevent sbatch from trying to enable debug messages when using job arrays.
-- Prevent sbcast from enabling "--preserve" when specifying a jobid.
-- Prevent wrong error message from spank plugin stack on GLOB_NOSPACE error.
-- Fix proctrack/lua plugin to prevent possible deadlock.
-- Prevent infinite loop in slurmstepd if execve fails.
-- Prevent multiple responses to REQUEST_UPDATE_JOB_STEP message.
-- Prevent possible deadlock in acct_gather_filesystem/lustre on error.
-- Make it so --mail-type=NONE didn't throw an invalid error.
-- If no default account is given for a user when creating (only a list of
accounts) no default account is printed, previously NULL was printed.
-- Fix for tracking a node's allocated CPUs with gang scheduling.
-- Fix Hidden error during _rpc_forward_data call.
-- Fix bug resulting from wrong order-of-operations in _connect_srun_cr(),
and two others that cause incorrect debug messages.
-- Fix backwards compatibility with sreport going to <= 14.11 coming from
>= 15.08 for some reports.
* Changes in Slurm 15.08.11
===========================
-- Fix for job "--contiguous" option that could cause job allocation/launch
failure or slurmctld crash.
-- Fix to setup logs for single-character program names correctly.
-- Backfill scheduling performance enhancement with large number of running
jobs.
-- Reset job's prolog_running counter on slurmctld restart or reconfigure.
-- burst_buffer/cray - Update job's prolog_running counter if pre_run fails.
-- MYSQL - Make the error message more specific when removing a reservation
and it doesn't meet basic requirements.
-- burst_buffer/cray - Fix for script creating or deleting persistent buffer
would fail "paths" operation and hold the job.
-- power/cray - Prevent possible divide by zero.
-- power/cray - Fix bug introduced in 15.08.10 preventin operation in many
cases.
-- Prevent deadlock for flow of data to the slurmdbd when sending reservation
that wasn't set up correctly.
-- burst_buffer/cray - Don't call Datawarp "paths" function if script includes
only create or destroy of persistent burst buffer. Some versions of Datawarp
software return an error for such scripts, causing the job to be held.
-- Fix potential issue when adding and removing TRES which could result
in the slurmdbd segfaulting.
-- Add cast to memory limit calculation to prevent integer overflow for
very large memory values.
-- Bluegene - Fix issue with reservations resizing under the covers on a
restart of the slurmctld.
-- Avoid error message of "Requested cpu_bind option requires entire node to
be allocated; disabling affinity" being generated in some cases where
task/affinity and task/cgroup plugins used together.
-- Fix version issue when packing GRES information between 2 different versions
of Slurm.
-- Fix for srun hanging with OpenMPI and PMIx
-- Better initialization of node_ptr when dealing with protocol_version.
-- Fix incorrect type when initializing header of a message.
-- MYSQL - Fix incorrect usage of limit and union.
-- MYSQL - Remove 'ignore' from alter ignore when updating a table.
-- Documentation - update prolog_epilog page to reflect current behavior
if the Prolog fails.
-- Documentation - clarify behavior of 'srun --export=NONE' in man page.
-- Fix potential gres underflow on restart of slurmctld.
-- Fix sacctmgr to remove a user who has no associations.
* Changes in Slurm 15.08.10
===========================
-- Fix issue where if a slurmdbd rollup lasted longer than 1 hour the
rollup would effectively never run again.
-- Make error message in the pmi2 code to debug as the issue can be expected
and retries are done making the error message a little misleading.
-- Power/cray: Don't specify NID list to Cray APIs. If any of those nodes are
not in a ready state, the API returned an error for ALL nodes rather than
valid data for nodes in ready state.
-- Fix potential divide by zero when tree_width=1.
-- checkpoint/blcr plugin: Fix memory leak.
-- If using PrologFlags=contain: Don't launch the extern step if a job is
cancelled while launching.
-- Remove duplicates from AccountingStorageTRES
-- Fix backfill scheduler race condition that could cause invalid pointer in
select/cons_res plugin. Bug introduced in 15.08.9.
-- Avoid double calculation on partition QOS if the job is using the same QOS.
-- Do not change a job's time limit when updating unrelated field in a job.
-- Fix situation on a heterogeneous memory cluster where the order of
constraints mattered in a job.
* Changes in Slurm 15.08.9
==========================
-- BurstBuffer/cray - Defer job cancellation or time limit while "pre-run"
operation in progress to avoid inconsistent state due to multiple calls
to job termination functions.
-- Fix issue with resizing jobs and limits not be kept track of correctly.
-- BGQ - Remove redeclaration of job_read_lock.
-- BGQ - Tighter locks around structures when nodes/cables change state.
-- Make it possible to change CPUsPerTask with scontrol.
-- Make it so scontrol update part qos= will take away a partition QOS from
a partition.
-- Fix issue where SocketsPerBoard didn't translate to Sockets when CPUS=
was also given.
-- Add note to slurm.conf man page about setting "--cpu_bind=no" as part
of SallocDefaultCommand if a TaskPlugin is in use.
-- Set correct reason when a QOS' MaxTresMins is violated.
-- Insure that a job is completely launched before trying to suspend it.
-- Remove historical presentations and design notes. Only distribute
maintained doc/html and doc/man directories.
-- Remove duplicate xmalloc() in task/cgroup plugin.
-- Backfill scheduler to validate correct job partition for job submitted to
multiple partitions.
-- Force close on exec on first 256 file descriptors when launching a
slurmstepd to close potential open ones.
-- Step GRES value changed from type "int" to "int64_t" to support larger
values.
-- Fix getting reservations to database when database is down.
-- Fix issue with sbcast not doing a correct fanout.
-- Fix issue where steps weren't always getting the gres/tres involved.
-- Fixed double read lock on getting job's gres/tres.
-- Fix display for RoutePlugin parameter to display the correct value.
-- Fix route/topology plugin to prevent segfault in sbcast when in use.
-- Fix Cray slurmconfgen_smw.py script to use nid as nid, not nic.
-- Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.
-- burst_buffer/cray plugin: Prevent a requeued job from being restarted while
file stage-out is still in progress. Previous logic could restart the job
and not perform a new stage-in.
-- Fix job array formatting to allow return [0-100:2] display for arrays with
step functions rather than [0,2,4,6,8,...] .
-- FreeBSD - replace Linux-specific set_oom_adj to avoid errors in slurmd log.
-- Add option for TopologyParam=NoInAddrAnyCtld to make the slurmctld listen
on only one port like TopologyParam=NoInAddrAny does for everything else.
-- Fix burst buffer plugin to prevent corruption of the CPU TRES data when bb
is not set as an AccountingStorageTRES type.
-- Surpress error messages in acct_gather_energy/ipmi plugin after repeated
failures.
-- Change burst buffer use completion email message from
"SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to
"SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"
-- Generate burst buffer use completion email immediately afer teardown
completes rather than at job purge time (likely minutes later).
-- Fix issue when adding a new TRES to AccountingStorageTRES for the first
time.
-- Update gang scheduling tables when job manually suspended or resumed. Prior
logic could mess up job suspend/resume sequencing.
-- Update gang scheduling data structures when job changes in size.
-- Associations - prevent hash table corruption if uid initially unset for
a user, which can cause slurmctld to crash if that user is deleted.
-- Avoid possibly aborting srun on SIGSTOP while creating the job step due to
threading bug.
-- Fix deadlock issue with burst_buffer/cray when a newly created burst
buffer is found.
-- burst_buffer/cray: Set environment variables just before starting job rather
than at job submission time to reflect persistent buffers created or
modified while the job is pending.
-- Fix check of per-user qos limits on the initial run by a user.
-- Fix gang scheduling resource selection bug which could prevent multiple jobs
from being allocated the same resources. Bug was introduced in 15.08.6.
-- Don't print the Rgt value of an association from the cache as it isn't
kept up to date.
-- burst_buffer/cray - If the pre-run operation fails then don't issue
duplicate job cancel/requeue unless the job is still in run state. Prevents
jobs hung in COMPLETING state.
-- task/cgroup - Fix bug in task binding to CPUs.
* Changes in Slurm 15.08.8
==========================
-- Backfill scheduling properly synchronized with Cray Node Health Check.
Prior logic could result in highest priority job getting improperly
postponed.
-- Make it so daemons also support TopologyParam=NoInAddrAny.
-- If scancel is operating on large number of jobs and RPC responses from
slurmctld daemon are slow then introduce a delay in sending the cancel job
requests from scancel in order to reduce load on slurmctld.
-- Remove redundant logic when updating a job's task count.
-- MySQL - Fix querying jobs with reservations when the id's have rolled.
-- Perl - Fix use of uninitialized variable in slurm_job_step_get_pids.
-- Launch batch job requsting --reboot after the boot completes.
-- Move debug messages like "not the right user" from association manager
to debug3 when trying to find the correct association.
-- Fix incorrect logic when querying assoc_mgr information.
-- Move debug messages to debug3 notifying a gres_bit_alloc was NULL for
gres types without a file.
-- Sanity Check Patch to setup variables for RAPL if in a race for it.
-- GRES - Fix minor typecast issues.
-- burst_buffer/cray - Increase size of intermediate variable used to store
buffer byte size read from DW instance from 32 to 64-bits to avoid overflow
and reporting invalid buffer sizes.
-- Allow an existing reservation with running jobs to be modified without
Flags=IGNORE_JOBS.
-- srun - don't attempt to execve() a directory with a name matching the
requested command
-- Do not automatically relocate an advanced reservation for individual cores
that spans multiple nodes when nodes in that reservation go down (e.g.
a 1 core reservation on node "tux1" will be moved if node "tux1" goes
down, but a reservation containing 2 cores on node "tux1" and 3 cores on
"tux2" will not be moved node "tux1" goes down). Advanced reservations for
whole nodes will be moved by default for down nodes.
-- Avoid possible double free of memory (and likely abort) for slurmctld in
background mode.
-- contribs/cray/csm/slurmconfgen_smw.py - avoid including repurposed compute
nodes in configs.
-- Support AuthInfo in slurmdbd.conf that is different from the value in
slurm.conf.
-- Fix build on FreeBSD 10.
-- Fix hdf5 build on ppc64 by using correct fprintf formatting for types.
-- Fix cosmetic printing of NO_VALs in scontrol show assoc_mgr.
-- Fix perl api for newer perl versions.
-- Fix for jobs requesting cpus-per-task (eg. -c3) that exceed the number of
cpus on a core.
-- Remove unneeded perl files from the .spec file.
-- Flesh out filters for scontrol show assoc_mgr.
-- Add function to remove assoc_mgr_info_request_t members without freeing
structure.
-- Fix build on some non-glibc systems by updating includes.
-- Add new PowerParameters options of get_timeout and set_timeout. The default
set_timeout was increased from 5 seconds to 30 seconds. Also re-read current
power caps periodically or after any failed "set" operation.
-- Fix slurmdbd segfault when listing users with blank user condition.
-- Save the ClusterName to a file in SaveStateLocation, and use that to
verify the state directory belongs to the given cluster at startup to avoid
corruption from multiple clusters attempting to share a state directory.
-- MYSQL - Fix issue when rerolling monthly data to work off correct time
period. This would only hit you if you rerolled a 15.08 prior to this
commit.
-- If FastSchedule=0 is used make sure TRES are set up correctly in accounting.
-- Fix sreport's truncation of columns with large TRES and not using
a parsing option.
-- Make sure count of boards are restored when slurmctld has option -R.
-- When determine if a job can fit into a TRES time limit after resources
have been selected set the time limit appropriately if the job didn't
request one.
-- Fix inadequate locks when updating a partition's TRES.
-- Add new assoc_limit_continue flag to SchedulerParameters.
-- Avoid race in acct_gather_energy_cray if energy requested before available.
-- MYSQL - Avoid having multiple default accounts when a user is added to
a new account and making it a default all at once.
* Changes in Slurm 15.08.7
==========================
-- sched/backfill: If a job can not be started within the configured
backfill_window, set it's start time to 0 (unknown) rather than the end
of the backfill_window.
-- Remove the 1024-character limit on lines in batch scripts.
-- burst_buffer/cray: Round up swap size by configured granularity.
-- select/cray: Log repeated aeld reconnects.
-- task/affinity: Disable core-level task binding if more CPUs required than
available cores.
-- Preemption/gang scheduling: If a job is suspended at slurmctld restart or
reconfiguration time, then leave it suspended rather than resume+suspend.
-- Don't use lower weight nodes for job allocation when topology/tree used.
-- BGQ - If a cable goes into error state remove the under lying block on
a dynamic system and mark the block in error on a static/overlap system.
-- BGQ - Fix regression in 9cc4ae8add7f where blocks would be deleted on
static/overlap systems when some hardware issue happens when restarting
the slurmctld.
-- Log if CLOUD node configured without a resume/suspend program or suspend
time.
-- MYSQL - Better locking around g_qos_count which was previously unprotected.
-- Correct size of buffer used for jobid2str to avoid truncation.
-- Fix allocation/distribution of tasks across multiple nodes when
--hint=nomultithread is requested.
-- If a reservation's nodes value is "all" then track the current nodes in the
system, even if those nodes change.
-- Fix formatting if using "tree" option with sreport.
-- Make it so sreport prints out a line for non-existent TRES instead of
error message.
-- Set job's reason to "Priority" when higher priority job in that partition
(or reservation) can not start rather than leaving the reason set to
"Resources".
-- Fix memory corruption when a new non-generic TRES is added to the
DBD for the first time. The corruption is only noticed at shutdown.
-- burst_buffer/cray - Improve tracking of allocated resources to handle race
condition when reading state while buffer allocation is in progress.
-- If a job is submitted only with -c option and numcpus is updated before
the job starts update the cpus_per_task appropriately.
-- Update salloc/sbatch/srun documentation to mention time granularity.
-- Fixed memory leak when freeing assoc_mgr_info_msg_t.
-- Prevent possible use of empty reservation core bitmap, causing abort.
-- Remove unneeded pack32's from qos_rec when qos_rec is NULL.
-- Make sacctmgr print MaxJobsPerUser when adding/altering a QOS.
-- Correct dependency formatting to print array task ids if set.
-- Update sacctmgr help with current QOS options.
-- Update slurmstepd to initialize authentication before task launch.
-- burst_cray/cray: Eliminate need for dedicated nodes.
-- If no MsgAggregationParams is set don't set the internal string to
anything. The slurmd will process things correctly after the fact.
-- Fix output from api when printing job step not found.
-- Don't allow user specified reservation names to disrupt the normal
reservation sequeuece numbering scheme.
-- Fix scontrol to be able to accept TRES as an option when creating
a reservation.
-- contrib/torque/qstat.pl - return exit code of zero even with no records
printed for 'qstat -u'.
-- When a reservation is created or updated, compress user provided node names
using hostlist functions (e.g. translate user input of "Nodes=tux1,tux2"
into "Nodes=tux[1-2]").
-- Change output routines for scontrol show partition/reservation to handle
unexpectedly large strings.
-- Add more partition fields to "scontrol write config" output file.
-- Backfill scheduling fix: If a job can't be started due to a "group" resource
limit, rather than reserve resources for it when the next job ends, don't
reserve any resources for it.
-- Avoid slurmstepd abort if malloc fails during accounting gather operation.
-- Fix nodes from being overallocated when allocation straddles multiple nodes.
-- Fix memory leak in slurmctld job array logic.
-- Prevent decrementing of TRESRunMins when AccountingStorageEnforce=limits is
not set.
-- Fix backfill scheduling bug which could postpone the scheduling of jobs due
to avoidance of nodes in COMPLETING state.
-- Properly account for memory, CPUs and GRES when slurmctld is reconfigured
while there is a suspended job. Previous logic would add the CPUs, but not
memory or GPUs. This would result in underflow/overflow errors in select
cons_res plugin.
-- Strip flags from a job state in qstat wrapper before evaluating.
-- Add missing job states from the qstat wrapper.
-- Cleanup output routines to reduce number of fixed-sized buffer function
calls and allow for unexpectedly large strings.
* Changes in Slurm 15.08.6
==========================
-- In slurmctld log file, log duplicate job ID found by slurmd. Previously was
being logged as prolog/epilog failure.
-- If a job is requeued while in the process of being launch, remove it's
job ID from slurmd's record of active jobs in order to avoid generating a
duplicate job ID error when launched for the second time (which would
drain the node).
-- Cleanup messages when handling job script and environment variables in
older directory structure formats.
-- Prevent triggering gang scheduling within a partition if configured with
PreemptType=partition_prio and PreemptMode=suspend,gang.
-- Decrease parallelism in job cancel request to prevent denial of service
when cancelling huge numbers of jobs.
-- If all ephemeral ports are in use, try using other port numbers.
-- Revert way lib lua is handled when doing a dlopen, fixing a regression in
15.08.5.
-- Set the debug level of the rmdir message in xcgroup_delete() to debug2.
-- Fix the qstat wrapper when user is removed from the system but still
has running jobs.
-- Log the request to terminate a job at info level if DebugFlags includes
the Steps keyword.
-- Fix potential memory corruption in _slurm_rpc_epilog_complete as well as
_slurm_rpc_complete_job_allocation.
-- Fix cosmetic display of AccountingStorageEnforce option "nosteps" when
in use.
-- If a job can never be started due to unsatisfied job dependencies, report
the full original job dependency specification rather than the dependencies
remaining to be satisfied (typically NULL).
-- Refactor logic to synchronize active batch jobs and their script/environment
files, reducing overhead dramatically for large numbers of active jobs.
-- Avoid hard-link/copy of script/environment files for job arrays. Use the
master job record file for all tasks of the job array.
NOTE: Job arrays submitted to Slurm version 15.08.6 or later will fail if
the slurmctld daemon is downgraded to an earlier version of Slurm.
-- Move slurmctld mail handler to separate thread for improved performance.
-- Fix containment of adopted processes from pam_slurm_adopt.
-- If a pending job array has multiple reasons for being in a pending state,
then print all reasons in a comma separated list.
* Changes in Slurm 15.08.5
==========================
-- Prevent "scontrol update job" from updating jobs that have already finished.
-- Show requested TRES in "squeue -O tres" when job is pending.
-- Backfill scheduler: Test association and QOS node limits before reserving
resources for pending job.
-- burst_buffer/cray: If teardown operations fails, sleep and retry.
-- Clean up the external pids when using the PrologFlags=Contain feature
and the job finishes.
-- burst_buffer/cray: Support file staging when job lacks job-specific buffer
(i.e. only persistent burst buffers).
-- Added srun option of --bcast to copy executable file to compute nodes.
-- Fix for advanced reservation of burst buffer space.
-- BurstBuffer/cray: Add logic to terminate dw_wlm_cli child processes at
shutdown.
-- If job can't be launch or requeued, then terminate it.
-- BurstBuffer/cray: Enable clearing of burst buffer string on completed job
as a means of recovering from a failure mode.
-- Fix wrong memory free when parsing SrunPortRange=0-0 configuration.
-- BurstBuffer/cray: Fix job record purging if cancelled from pending state.
-- BGQ - Handle database throw correctly when syncing users on blocks.
-- MySQL - Make sure we don't have a NULL string returned when not
requesting any specific association.
-- sched/backfill: If max_rpc_cnt is configured and the backlog of RPCs has
not cleared after yielding locks, then continue to sleep.
-- Preserve the job dependency description displayed in 'scontrol show job'
even if the dependee jobs was terminated and cleaned causing the
dependent to never run because of DependencyNeverSatisfied.
-- Correct job task count calculation if only node count and ntasks-per-node
options supplied.
-- Make sure the association manager converts any string to be lower case
as all the associations from the database will be lower case.
-- Sanity check for xcgroup_delete() to verify incoming parameter is valid.
-- Fix formatting for sacct with variables that switched from uint32_t to
uint64_t.
-- Fix a typo in sacct man page.
-- Set up extern step to track any children of an ssh if it leaves anything
else behind.
-- Prevent slurmdbd divide by zero if no associations defined at rollup time.
-- Multifactor - Add sanity check to make sure pending jobs are handled
correctly when PriorityFlags=CALCULATE_RUNNING is set.
-- Add slurmdb_find_tres_count_in_string() to slurm db perl api.
-- Make lua dlopen() conditional on version found at build.
-- sched/backfill - Delay backfill scheduler for completing jobs only if
CompleteWait configuration parameter is set (make code match documentation).
-- Release a job's allocated licenses only after epilog runs on all nodes
rather than at start of termination process.
-- Cray job NHC delayed until after burst buffer released and epilog completes
on all allocated nodes.
-- Fix abort of srun if using PrologFlags=NoHold
-- Let devices step_extern cgroup inherit attributes of job cgroup.
-- Add new hook to Task plugin to be able to put adopted processes in the
step_extern cgroups.
-- Fix AllowUsers documentation in burst_buffer.conf man page. Usernames are
comma separated, not colon delimited.
-- Fix issue with time limit not being set correctly from a QOS when a job
requests no time limit.
-- Various CLANG fixes.
-- In both sched/basic and backfill: If a job can not be started due to some
account/qos limit, then don't start other jobs which could delay jobs. The
old logic would skip the job and start other jobs, which could delay the
higher priority job.
-- select/cray: Prevent NHC from running more than once per job or step.
-- Fix fields not properly printed when adding an account through sacctmgr.
-- Update LBNL Node Health Check (NHC) link on FAQ.
-- Fix multifactor plugin to prevent slurmctld from getting segmentation fault
should the tres_alloc_cnt be NULL.
-- sbatch/salloc - Move nodelist logic before the time min_nodes is used
so we can set it correctly before tasks are set.
* Changes in Slurm 15.08.4
==========================
-- Fix typo for the "devices" cgroup subsystem in pam_slurm_adopt.c
-- Fix TRES_MAX flag to work correctly.
-- Improve the systemd startup files.
-- Added burst_buffer.conf flag parameter of "TeardownFailure" which will
teardown and remove a burst buffer after failed stage-in or stage-out.
By default, the buffer will be preserved for analysis and manual teardown.
-- Prevent a core dump in srun if the signal handler runs during the job
allocation causing the step context to be NULL.
-- Don't fail job if multiple prolog operations in progress at slurmctld
restart time.
-- Burst_buffer/cray: Fix to purge terminated jobs with burst buffer errors.
-- Burst_buffer/cray: Don't stall scheduling of other jobs while a stage-in
is in progress.
-- Make it possible to query 'extern' step with sstat.
-- Make 'extern' step show up in the database.
-- MYSQL - Quote assoc table name in mysql query.
-- Make SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX, and SLURM_ARRAY_TASK_STEP
environment variables available to PrologSlurmctld and EpilogSlurmctld.
-- Fix slurmctld bug in which a pending job array could be canceled
by a user different from the owner or the administrator.
-- Support taking node out of FUTURE state with "scontrol reconfig" command.
-- Sched/backfill: Fix to properly enforce SchedulerParameters of
bf_max_job_array_resv.
-- Enable operator to reset sdiag data.
-- jobcomp/elasticsearch plugin: Add array_job_id and array_task_id fields.
-- Remove duplicate #define IS_NODE_POWER_UP.
-- Added SchedulerParameters option of max_script_size.
-- Add REQUEST_ADD_EXTERN_PID option to add pid to the slurmstepd's extern
step.
-- Add unique identifiers to anchor tags in HTML generated from the man pages.
-- Add with_freeipmi option to spec file.
-- Minor elasticsearch code improvements
* Changes in Slurm 15.08.3
==========================
-- Correct Slurm's RPM build if Munge is not installed.
-- Job array termination status email ExitCode based upon highest exit code
from any task in the job array rather than the last task. Also change the
state from "Ended" or "Failed" to "Mixed" where appropriate.
-- Squeue recombines pending job array records only if their name and partition
are identical.
-- Fix some minor leaks in the job info and step info API.
-- Export missing QOS id when filling in association with the association
manager.
-- Fix invalid reference if a lua job_submit plugin references a default qos
when a user doesn't exist in the database.
-- Use association enforcement in the lua plugin.
-- Fix a few spots missing defines of accounting_enforce or acct_db_conn
in the plugins.
-- Show requested TRES in scontrol show jobs when job is pending.
-- Improve sched/backfill support for job features, especially XOR construct.
-- Correct scheduling logic for job features option with XOR construct that
could delay a job's initiation.
-- Remove unneeded frees when creating a tres string.
-- Send a tres_alloc_str for the batch step
-- Fix incorrect check for slurmdb_find_tres_count_in_string in various places,
it needed to check for INFINITE64 instead of zero.
-- Don't allow scontrol to create partitions with the name "DEFAULT".
-- burst_buffer/cray: Change error from "invalid request" to "permssion denied"
if a non-authorized user tries to create/destroy a persistent buffer.
-- PrologFlags work: Setting a flag of "Contain" implicitly sets the "Alloc"
flag. Fix code path which could prevent execution of the Prolog when the
"Alloc" or "Contain" flag were set.
-- Fix for acct_gather_energy/cray|ibmaem to work with missed enum.
-- MYSQL - When inserting a job and begin_time is 0 do not set it to
submit_time. 0 means the job isn't eligible yet so we need to treat it so.
-- MYSQL - Don't display ineligible jobs when querying for a window of time.
-- Fix creation of advanced reservation of cores on nodes which are DOWN.
-- Return permission denied if regular user tries to release job held by an
administrator.
-- MYSQL - Fix rollups for multiple jobs running by the same association
in an hour counting multiple times.
-- Burstbuffer/Cray plugin - Fix for persistent burst buffer use.
Don't call paths if no #DW options.
-- Modifications to pam_slurm_adopt to work correctly for the "extern" step.
-- Alphabetize debugflags when printing them out.
-- Fix systemd's slurmd service from killing slurmstepds on shutdown.
-- Fixed counter of not indexed jobs, error_cnt post-increment changed to
pre-increment.
* Changes in Slurm 15.08.2
==========================
-- Fix for tracking node state when jobs that have been allocated exclusive
access to nodes (i.e. entire nodes) and later relinquish some nodes. Nodes
would previously appear partly allocated and prevent use by other jobs.
-- Correct some cgroup paths ("step_batch" vs. "step_4294967294", "step_exter"
vs. "step_extern", and "step_extern" vs. "step_4294967295").
-- Fix advanced reservation core selection logic with network topology.
-- MYSQL - Remove restriction to have to be at least an operator to query TRES
values.
-- For pending jobs have sacct print 0 for nnodes instead of the bogus 2.
-- Fix for tracking node state when jobs that have been allocated exclusive
access to nodes (i.e. entire nodes) and later relinquish some nodes. Nodes
would previously appear partly allocated and prevent use by other jobs.
-- Fix updating job in db after extending job's timelimit past partition's
timelimit.
-- Fix srun -I<timeout> from flooding the controller with step create requests.
-- Requeue/hold batch job launch request if job already running (possible if
node went to DOWN state, but jobs remained active).
-- If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
-- Don't mark powered down node as not responding. This could be triggered by
race condition of the node suspend and ping logic, preventing use of the
node.
-- Don't requeue RPC going out from slurmctld to DOWN nodes (can generate
repeating communication errors).
-- Propagate sbatch "--dist=plane=#" option to srun.
-- Add acct_gather_energy/ibmaem plugin for systems with IBM Systems Director
Active Energy Manager.
-- Fix spec file to look for mariadb or mysql devel packages for build
requirements.
-- MySQL - Improve the code with asking for jobs in a suspended state.
-- Fix slurcmtld allowing root to see job steps using squeues -s.
-- Do not send burst buffer stage out email unless the job uses burst buffers.
-- Fix sacct to not return all jobs if the -j option is given with a trailing
','.
-- Permit job_submit plugin to set a job's priority.
-- Fix occasional srun segfault.
-- Fix issue with sacct, printing 0_0 for array's that had finished in the
database but the start record hadn't made it yet.
-- sacctmgr - Don't allow default account associations to be removed
from a user.
-- Fix sacct -j, (nothing but a comma) to not return all jobs.
-- Fixed slurmctld not sending cold-start messages correctly to the database
when a cold-start (-c) happens to the slurmctld.
-- Fix case where if the backup slurmdbd has existing connections when it gives
up control that the it would be killed.
-- Fix task/cgroup affinity to work correctly with multi-socket
single-threaded cores. A regression caused only 1 socket to be used on
this kind of node instead of all that were available.
-- MYSQL - Fix minor issue after an index was added to the database it would
previously take 2 restarts of the slurmdbd to make it stick correctly.
-- Add hv_to_qos_cond() and qos_rec_to_hv() functions to the Perl interface.
-- Add new burst_buffer.conf parameters: ValidateTimeout and OtherTimeout.
See man page for details.
-- Fix burst_buffer/cray support for interactive allocations >4GB.
-- Correct backfill scheduling logic for job with INFINITE time limit.
-- Fix issue on a scontrol reconfig all available GRES/TRES would be zeroed
out.
-- Set SLURM_HINT environment variable when --hint is used with sbatch or
salloc.
-- Add scancel -f/--full option to signal all steps including batch script and
all of its child processes.
-- Fix salloc -I to accept an argument.
-- Avoid reporting more allocated CPUs than exist on a node. This can be
triggered by resuming a previosly suspended job, resulting in
oversubscription of CPUs.
-- Fix the pty window manager in slurmstepd not to retry IO operation with
srun if it read EOF from the connection with it.
-- sbatch --ntasks option to take precedence over --ntasks-per-node plus node
count, as documented. Set SLURM_NTASKS/SLURM_NPROCS environment variables
accordingly.
-- MYSQL - Make sure suspended time is only subtracted from the CPU TRES
as it is the only TRES that can be given to another job while suspended.
-- Clarify how TRESBillingWeights operates on memory and burst buffers.
* Changes in Slurm 15.08.1
==========================
-- Fix test21.30 and 21.34 to check grpwall better.
-- Add time to the partition QOS the job is running on instead of just the
job QOS.
-- Print usage for GrpJobs, GrpSubmitJobs and GrpWall even if there is no
limit.
-- If AccountingEnforce=safe is set make sure a job can finish before going
over the limit with grpwall on a QOS or association.
-- burst_buffer/cray - Major updates based upon recent Cray changes.
-- Improve clean up logic of pmi2 plugin.
-- Improve job state reason string when required nodes not available.
-- Fix missing else when packing an update partition message
-- Fix srun from inheriting the SLURM_CPU_BIND and SLURM_MEM_BIND environment
variables when running in an existing srun (e.g. an srun within an salloc).
-- Fix missing else when packing an update partition message.
-- Use more flexible mechnanism to find json installation.
-- Make sure safe_limits was initialized before processing limits in the
slurmctld.
-- Fix for burst_buffer/cray to parse type option correctly.
-- Fix memory error and version number in the nonstop plugin and reservation
code.
-- When requesting GRES in a step check for correct variable for the count.
-- Fix issue with GRES in steps so that if you have multiple exclusive steps
and you use all the GRES up instead of reporting the configuration isn't
available you hold the requesting step until the GRES is available.
-- MYSQL - Change debug to print out with DebugFlags=DB_Step instead of debug4
-- Simplify code when user is selecting a job/step/array id and removed
anomaly when only asking for 1 (task_id was never set to INFINITE).
-- MYSQL - If user is requesting various task_ids only return requested steps.
-- Fix issue when tres cnt for energy is 0 for total reported.
-- Resolved scalability issues of power adaptive scheduling with layouts.
-- Burst_buffer/cray bug - Fix teardown race condition that can result in
infinite loop.
-- Add support for --mail-type=NONE option.
-- Job "--reboot" option automatically, set's exclusive node mode.
-- Fix memory leak when using PrologFlags=Alloc.
-- Fix truncation of job reason in squeue.
-- If a node is in DOWN or DRAIN state, leave it unavailable for allocation
when powered down.
-- Update the slurm.conf man page documenting better nohold_on_prolog_fail
variable.
-- Don't trucate task ID information in "squeue --array/-r" or "sview".
-- Fix a bug which caused scontrol to core dump when releasing or
holding a job by name.
-- Fix unit conversion bug in slurmd which caused wrong memory calculation
for cgroups.
-- Fix issue with GRES in steps so that if you have multiple exclusive steps
and you use all the GRES up instead of reporting the configuration isn't
available you hold the requesting step until the GRES is available.
-- Fix slurmdbd backup to use DbdAddr when contacting the primary.
-- Fix error in MPI documentation.
-- Fix to handle arrays with respect to number of jobs submitted. Previously
only 1 job was accounted (against MaxSubmitJob) for when an array was
submitted.
-- Correct counting for job array limits, job count limit underflow possible
when master cancellation of master job record.
-- Combine 2 _valid_uid_gid functions into a single function to avoid
diversion.
-- Pending job array records will be combined into single line by default,
even if started and requeued or modified.
-- Fix sacct --format=nnodes to print out correct information for pending
jobs.
-- Make is so 'scontrol update job 1234 qos='' will set the qos back to
the default qos for the association.
-- Add [Alloc|Req]Nodes to sacct to be more like cpus.
-- Fix sacct documentation about [Alloc|Req]TRES
-- Put node count in TRES string for steps.
-- Fix issue with wrong protocol version when using the srun --no-allocate
option.
-- Fix TRES counts on GRES on a clean start of the slurmctld.
-- Add ability to change a job array's maximum running task count:
"scontrol update jobid=# arraytaskthrottle=#"
* Changes in Slurm 15.08.0
==========================
-- Fix issue with frontend systems (outside ALPs or BlueGene) where srun
wouldn't get the correct protocol version to launch a step.
-- Fix for message aggregation return rpcs where none of the messages are
intended for the head of the tree.
-- Fix segfault in sreport when there was no response from the dbd.
-- ALPS - Fix compile to not link against -ljob and -lexpat with every lib
or binary.
-- Fix testing for CR_Memory when CR_Memory and CR_ONE_TASK_PER_CORE are used
with select/linear.
-- When restarting or reconfiging the slurmctld, if job is completing handle
accounting correctly to avoid meaningless errors about overflow.
-- Add AccountingStorageTRES to scontrol show config
-- MySQL - Fix minor memory leak if a connection ever goes away whist using it.
-- ALPS - Make it so srun --hint=nomultithread works correctly.
-- Make MaxTRESPerUser work in sacctmgr.
-- Fix handling of requeued jobs with steps that are still finishing.
-- Cleaner copy for PriorityWeightTRES, it also fixes a core dump when trying
to free it otherwise.
-- Add environment variables SLURM_ARRAY_TASK_MAX, SLURM_ARRAY_TASK_MIN,
SLURM_ARRAY_TASK_STEP for job arrays.
-- Fix srun to use the NoInAddrAny TopologyParam option.
-- Change QOS flag name from PartitionQOS to OverPartQOS to be a better
description.
-- Fix rpmbuild issue on Centos7.
* Changes in Slurm 15.08.0rc1
==============================
-- Added power_cpufreq layout.
-- Make complete_batch_script RPC work with message aggregation.
-- Do not count slurmctld threads waiting in a "throttle" lock against the
daemon's thread limit as they are not contending for resources.
-- Modify slurmctld outgoing RPC logic to support more parallel tasks (up to
85 RPCs and 256 pthreads; the old logic supported up to 21 RPCs and 256
threads). This change can dramatically improve performance for RPCs
operating on small node counts.
-- Increase total backfill scheduler run time in stats_info_response_msg data
structure from 32 to 64 bits in order to prevent overflow.
-- Add NoInAddrAny option to TopologyParam in the slurm.conf which allows to
bind to the interface of return of gethostname instead of any address on
the node which avoid RSIP issues in Cray systems. This is most likely
useful in other systems as well.
-- Fix memory leak in Slurm::load_jobs perl api call.
-- Added --noconvert option to sacct, sstat, squeue and sinfo which allows
values to be displayed in their original unit types (e.g. 2048M won't be
converted to 2G).
-- Fix spelling of node_rescrs to node_resrcs in Perl API.
-- Fix node state race condition, UNKNOWN->IDLE without configuration info.
-- Cray: Disable LDAP references from slurmstepd on job launch due for
improved scalability.
-- Remove srun "read header error" due to application termination race
condition.
-- Optimize sacct queries with additional db indexes.
-- Add SLURM_TOPO_LEN env variable for scontrol show topology.
-- Add free_mem to node information.
-- Fix abort of batch launch if prolog is running, wait for prolog instead.
-- Fix case where job would get the wrong cpu count when using
--ntasks-per-core and --cpus-per-task together.
-- Add TRESBillingWeights to partitions in slurm.conf which allows taking into
consideration any TRES Type when calculating the usage of a job.
-- Add PriorityWeightTRES slurm.conf option to be able to configure priority
factors for TRES types.
* Changes in Slurm 15.08.0pre6
==============================
-- Add scontrol options to view and modify layouts tables.
-- Add MsgAggregationParams which controls a reverse tree to the slurmctld
which can be used to aggregate messages to the slurmctld into a single
message to reduce communication to the slurmctld. Currently only epilog
complete messages and node registration messages use this logic.
-- Add sacct and squeue options to print trackable resources.
-- Add sacctmgr option to display trackable resources.
-- If an salloc or srun command is executed on a "front-end" configuration,
that job will be assigned a slurmd shepherd daemon on the same host as used
to execute the command when possible rather than an slurmd daemon on an
arbitrary front-end node.
-- Add srun --accel-bind option to control how tasks are bound to GPUs and NIC
Generic RESources (GRES).
-- gres/nic plugin modified to set OMPI_MCA_btl_openib_if_include environment
variable based upon allocated devices (usable with OpenMPI and Melanox).
-- Make it so info options for srun/salloc/sbatch print with just 1 -v instead
of 4.
-- Add "no_backup_scheduling" SchedulerParameter to prevent jobs from being
scheduled when the backup takes over. Jobs can be submitted, modified and
cancelled while the backup is in control.
-- Enable native Slurm backup controller to reside on an external Cray node
when the "no_backup_scheduling" SchedulerParameter is used.
-- Removed TICKET_BASED fairshare. Consider using the FAIR_TREE algorithm.
-- Disable advanced reservation "REPLACE" option on IBM Bluegene systems.
-- Add support for control distribution of tasks across cores (in addition
to existing support for nodes and sockets, (e.g. "block", "cyclic" or
"fcyclic" task distribution at 3 levels in the hardware rather than 2).
-- Create db index on <cluster>_assoc_table.acct. Deleting accounts that didn't
have jobs in the job table could take a long time.
-- The performance of Profiling with HDF5 is improved. In addition, internal
structures are changed to make it easier to add new profile types,
particularly energy sensors. sh5util will continue to work with either
format.
-- Add partition information to sshare output if the --partition option
is specified on the sshare command line.
-- Add sreport -T/--tres option to identify Trackable RESources (TRES) to
report.
-- Display job in sacct when single step's cpus are different from the job
allocation.
-- Add association usage information to "scontrol show cache" command output.
-- MPI/MVAPICH plugin now requires Munge for authentication.
-- job_submit/lua: Add default_qos fields. Add job record qos. Add partition
record allow_qos and qos_char fields.
* Changes in Slurm 15.08.0pre5
==============================
-- Add jobcomp/elasticsearch plugin. Libcurl is required for build. Configure
the server as follows: "JobCompLoc=http://YOUR_ELASTICSEARCH_SERVER:9200".
-- Scancel logic large re-written to better support job arrays.
-- Added a slurm.conf parameter PrologEpilogTimeout to control how long
prolog/epilog can run.
-- Added TRES (Trackable resources) to track Mem, GRES, license, etc
utilization.
-- Add re-entrant versions of glibc time functions (e.g. localtime) to Slurm
in order to eliminate rare deadlock of slurmstepd fork and exec calls.
-- Constrain kernel memory (if available) in cgroups.
-- Add PrologFlags option of "Contain" to create a proctrack container at
job resource allocation time.
-- Disable the OOM Killer in slurmd and slurmstepd's memory cgroup when using
MemSpecLimit.
* Changes in Slurm 15.08.0pre4
==============================
-- Burst_buffer/cray - Convert logic to use new commands/API names (e.g.
"dws_setup" rather than "bbs_setup").
-- Remove the MinJobAge size limitation. It can now exceed 65533 as it
is represented using an unsigned integer.
-- Verify that all plugin version numbers are identical to the component
attempting to load them. Without this verification, the plugin can reference
Slurm functions in the caller which differ (e.g. the underlying function's
arguments could have changed between Slurm versions).
NOTE: All plugins (except SPANK) must be built against the identical
version of Slurm in order to be used by any Slurm command or daemon. This
should eliminate some very difficult to diagnose problems due to use of old
plugins.
-- Increase the MAX_PACK_MEM_LEN define to avoid PMI2 failure when fencing
with large amount of ranks (to 1GB).
-- Requests by normal user to reset a job priority (even to lower it) will
result in an error saying to change the job's nice value instead.
-- SPANK naming changes: For environment variables set using the
spank_job_control_setenv() function, the values were available in the
slurm_spank_job_prolog() and slurm_spank_job_epilog() functions using
getenv where the name was given a prefix of "SPANK_". That prefix has
been removed for consistency with the environment variables available in
the Prolog and Epilog scripts.
-- Major additions to the layouts framework code.
-- Add "TopologyParam" configuration parameter. Optional value of "dragonfly"
is supported.
-- Optimize resource allocation for systems with dragonfly networks.
-- Add "--thread-spec" option to salloc, sbatch and srun commands. This is
the count of threads reserved for system use per node.
-- job_submit/lua: Enable reading and writing job environment variables.
For example: if (job_desc.environment.LANGUAGE == "en_US") then ...
-- Added two new APIs slurm_job_cpus_allocated_str_on_node_id()
and slurm_job_cpus_allocated_str_on_node() to print the CPUs id
allocated to a job.
-- Specialized memory (a node's MemSpecLimit configuration parameter) is not
available for allocation to jobs.
-- Modify scontrol update job to allow jobid specification without
the = sign. 'scontrol update job=123 ...' and 'scontrol update job 123 ...'
are both valid syntax.
-- Archive a month at a time when there are lots of records to archive.
-- Introduce new sbatch option '--kill-on-invalid-dep=yes|no' which allows
users to specify which behavior they want if a job dependency is not
satisfied.
-- Add Slurmdb::qos_get() interface to perl api.
-- If a job fails to start set the requeue reason to be:
job requeued in held state.
-- Implemented a new MPI key,value PMIX_RING() exchange algorithm as
an alternative to PMI2.
-- Remove possible deadlocks in the slurmctld when the slurmdbd is busy
archiving/purging.
-- Add DB_ARCHIVE debug flag for filtering out debug messages in the slurmdbd
when the slurmdbd is archiving/purging.
-- Fix some power_save mode issues: Parsing of SuspendTime in slurm.conf was
bad, powered down nodes would get set non-responding if there was an
in-flight message, and permit nodes to be powered down from any state.
-- Initialize variables in consumable resource plugin to prevent core dump.
* Changes in Slurm 15.08.0pre3
==============================
-- CRAY - addition of acct_gather_energy/cray plugin.
-- Add job credential to "Run Prolog" RPC used with a configuration of
PrologFlags=alloc. This allows the Prolog to be passed identification of
GPUs allocated to the job.
-- Add SLURM_JOB_CONSTAINTS to environment variables available to the Prolog.
-- Added "--mail=stage_out" option to job submission commands to notify user
when burst buffer state out is complete.
-- Require a "Reason" when using scontrol to set a node state to DOWN.
-- Mail notifications on job BEGIN, END and FAIL now apply to a job array as a
whole rather than generating individual email messages for each task in the
job array.
-- task/affinity - Fix memory binding to NUMA with cpusets.
-- Display job's estimated NodeCount based off of partition's configured
resources rather than the whole system's.
-- Add AuthInfo option of "cred_expire=#" to specify the lifetime of a job
step credential. The default value was changed from 1200 to 120 seconds.
-- Set the delay time for job requeue to the job credential lifetime (120
seconds by default). This insures that prolog runs on every node when a
job is requeued. (This change will slow down launch of re-queued jobs).
-- Add AuthInfo option of "cred_expire=#" to specify the lifetime of a job
step credential.
-- Remove srun --max-launch-time option. The option has not been functional
since Slurm version 2.0.
-- Add sockets and cores to TaskPluginParams' autobind option.
-- Added LaunchParameters configuration parameter. Have srun command test
locally for the executable file if LaunchParameters=test_exec or the
environment variable SLURM_TEST_EXEC is set. Without this an invalid
command will generate one error message per task launched.
-- Fix the slurm /etc/init.d script to return 0 upon stopping the
daemons and return 1 in case of failure.
-- Add the ability for a compute node to be allocated to multiple jobs, but
restricted to a single user. Added "--exclusive=user" option to salloc,
sbatch and srun commands. Added "owner" field to node record, visible using
the scontrol and sview commands. Added new partition configuration parameter
"ExclusiveUser=yes|no".
* Changes in Slurm 15.08.0pre2
==============================
-- Add the environment variables SLURM_JOB_ACCOUNT, SLURM_JOB_QOS
and SLURM_JOB_RESERVATION in the batch/srun jobs.
-- Add sview burst buffer display.
-- Properly enforce partition Shared=YES option. Previously oversubscribing
resources required gang scheduling to be configured.
-- Enable per-partition gang scheduling resource resolution (e.g. the partition
can have SelectTypeParameters=CR_CORE, while the global value is CR_SOCKET).
-- Make it so a newer version of a slurmstepd can talk to an older srun.
allocation. Nodes could have been added while waiting for an allocation.
-- Expanded --cpu-freq parameters to include min-max:governor specifications.
--cpu-freq now supported on salloc and sbatch.
-- Add support for optimized job allocations with respect to SGI Hypercube
topology.
NOTE: Only supported with select/linear plugin.
NOTE: The program contribs/sgi/netloc_to_topology can be used to build
Slurm's topology.conf file.
-- Remove 64k validation of incoming RPC nodelist size. Validated at 64MB
when unpacking.
-- In slurmstepd() add the user primary group if it is not part of the
groups sent from the client.
-- Added BurstBuffer field to advanced reservations.
-- For advanced reservation, replace flag "License_only" with flag "Any_Nodes".
It can be used to indicate the an advanced reservation resources (licenses
and/or burst buffers) can be used with any compute nodes.
-- Allow users to specify the srun --resv-ports as 0 in which case no ports
will be reserved. The default behaviour is to allocate one port per task.
-- Interpret a partition configuration of "Nodes=ALL" in slurm.conf as
including all nodes defined in the cluster.
-- Added new configuration parameters PowerParameters and PowerPlugin.
-- Added power management plugin infrastructure.
-- If job already exceeded one of its QOS/Accounting limits do not
return error if user modifies QOS unrelated job settings.
-- Added DebugFlags value of "Power".
-- When caching user ids of AllowGroups use both getgrnam_r() and getgrent_r()
then remove eventual duplicate entries.
-- Remove rpm dependency between slurm-pam and slurm-devel.
-- Remove support for the XCPU (cluster management) package.
-- Add Slurmdb::jobs_get() interface to perl api.
-- Performance improvement when sending data from srun to stepds when
processing fencing.
-- Add the feature to specify arbitrary field separator when running
sacct -p or sacct -P. The command line option is --separator.
-- Introduce slurm.conf parameter to use Proportional Set Size (PSS) instead
of RSS to determinate the memory footprint of a job.
Add an slurm.conf option not to kill jobs that is over memory limit.
-- Add job submission command options: --sicp (available for inter-cluster
dependencies) and --power (specify power management options) to salloc,
sbatch, and srun commands.
-- Add DebugFlags option of SICP (inter-cluster option logging).
-- In order to support inter-cluster job dependencies, the MaxJobID
configuration parameter default value has been reduced from 4,294,901,760
to 2,147,418,112 and it's maximum value is now 2,147,463,647.
ANY JOBS WITH A JOB ID ABOVE 2,147,463,647 WILL BE PURGED WHEN SLURM IS
UPGRADED FROM AN OLDER VERSION!
-- Add QOS name to the output of a partition in squeue/scontrol/sview/smap.
* Changes in Slurm 15.08.0pre1
==============================
-- Add sbcast support for file transfer to resources allocated to a job step
rather than a job allocation.
-- Change structures with association in them to assoc to save space.
-- Add support for job dependencies jointed with OR operator (e.g.
"--depend=afterok:123?afternotok:124").
-- Add "--bb" (burst buffer specification) option to salloc, sbatch, and srun.
-- Added configuration parameters BurstBufferParameters and BurstBufferType.
-- Added burst_buffer plugin infrastructure (needs many more functions).
-- Make it so when the fanout logic comes across a node that is down we abandon
the tree to avoid worst case scenarios when the entire branch is down and
we have to try each serially.
-- Add better error reporting of invalid partitions at submission time.
-- Move will-run test for multiple clusters from the sbatch code into the API
so that it can be used with DRMAA.
-- If a non-exclusive allocation requests --hint=nomultithread on a
CR_CORE/SOCKET system lay out tasks correctly.
-- Avoid including unused CPUs in a job's allocation when cores or sockets are
allocated.
-- Added new job state of STOPPED indicating processes have been stopped with a
SIGSTOP (using scancel or sview), but retain its allocated CPUs. Job state
returns to RUNNING when SIGCONT is sent (also using scancel or sview).
-- Added EioTimeout parameter to slurm.conf. It is the number of seconds srun
waits for slurmstepd to close the TCP/IP connection used to relay data
between the user application and srun when the user application terminates.
-- Remove slurmctld/dynalloc plugin as the work was never completed, so it is
not worth the effort of continued support at this time.
-- Remove DynAllocPort configuration parameter.
-- Add advance reservation flag of "replace" that causes allocated resources
to be replaced with idle resources. This maintains a pool of available
resources that maintains a constant size (to the extent possible).
-- Added SchedulerParameters option of "bf_busy_nodes". When selecting
resources for pending jobs to reserve for future execution (i.e. the job
can not be started immediately), then preferentially select nodes that are
in use. This will tend to leave currently idle resources available for
backfilling longer running jobs, but may result in allocations having less
than optimal network topology. This option is currently only supported by
the select/cons_res plugin.
-- Permit "SuspendTime=NONE" as slurm.conf value rather than only a numeric
value to match "scontrol show config" output.
-- Add the 'scontrol show cache' command which displays the associations
in slurmctld.
-- Test more frequently for node boot completion before starting a job.
Provides better responsiveness.
-- Fix PMI2 singleton initialization.
-- Permit PreemptType=qos and PreemptMode=suspend,gang to be used together.
A high-priority QOS job will now oversubscribe resources and gang schedule,
but only if there are insufficient resources for the job to be started
without preemption. NOTE: That with PreemptType=qos, the partition's
Shared=FORCE:# configuration option will permit one job more per resource
to be run than than specified, but only if started by preemption.
-- Remove the CR_ALLOCATE_FULL_SOCKET configuration option. It is now the
default.
-- Fix a race condition in PMI2 when fencing counters can be out of sync.
-- Increase the MAX_PACK_MEM_LEN define to avoid PMI2 failure when fencing
with large amount of ranks.
-- Add QOS option to a partition. This will allow a partition to have
all the limits a QOS has. If a limit is set in both QOS the partition
QOS will override the job's QOS unless the job's QOS has the
OverPartQOS flag set.
-- The task_dist_states variable has been split into "flags" and "base"
components. Added SLURM_DIST_PACK_NODES and SLURM_DIST_NO_PACK_NODES values
to give user greater control over task distribution. The srun --dist options
has been modified to accept a "Pack" and "NoPack" option. These options can
be used to override the CR_PACK_NODE configuration option.
* Changes in Slurm 14.11.12
===========================
-- Correct dependency formatting to print array task ids if set.
-- Fix for configuration of "AuthType=munge" and "AuthInfo=socket=..." with
alternate munge socket path.
-- BGQ - Remove redeclaration of job_read_lock.
-- BGQ - Tighter locks around structures when nodes/cables change state.
-- Fix job array formatting to allow return [0-100:2] display for arrays with
step functions rather than [0,2,4,6,8,...] .
-- Associations - prevent hash table corruption if uid initially unset for
a user, which can cause slurmctld to crash if that user is deleted.
-- Add cast to memory limit calculation to prevent integer overflow for
very large memory values.
-- Fix test cases to have proper int return signature.
* Changes in Slurm 14.11.11
===========================
-- Fix systemd's slurmd service from killing slurmstepds on shutdown.
-- Fix the qstat wrapper when user is removed from the system but still
has running jobs.
-- Log the request to terminate a job at info level if DebugFlags includes
the Steps keyword.
-- Fix potential memory corruption in _slurm_rpc_epilog_complete as well as
_slurm_rpc_complete_job_allocation.
-- Fix incorrectly sized buffer used by jobid2str which will cause buffer
overflow in slurmctld. (Bug 2295.)
* Changes in Slurm 14.11.10
===========================
-- Fix truncation of job reason in squeue.
-- If a node is in DOWN or DRAIN state, leave it unavailable for allocation
when powered down.
-- Update the slurm.conf man page documenting better nohold_on_prolog_fail
variable.
-- Don't trucate task ID information in "squeue --array/-r" or "sview".
-- Fix a bug which caused scontrol to core dump when releasing or
holding a job by name.
-- Fix unit conversion bug in slurmd which caused wrong memory calculation
for cgroups.
-- Fix issue with GRES in steps so that if you have multiple exclusive steps
and you use all the GRES up instead of reporting the configuration isn't
available you hold the requesting step until the GRES is available.
-- Fix slurmdbd backup to use DbdAddr when contacting the primary.
-- Fix error in MPI documentation.
-- Fix to handle arrays with respect to number of jobs submitted. Previously
only 1 job was accounted (against MaxSubmitJob) for when an array was
submitted.
-- Correct counting for job array limits, job count limit underflow possible
when master cancellation of master job record.
-- For pending jobs have sacct print 0 for nnodes instead of the bogus 2.
-- Fix for tracking node state when jobs that have been allocated exclusive
access to nodes (i.e. entire nodes) and later relinquish some nodes. Nodes
would previously appear partly allocated and prevent use by other jobs.
-- Fix updating job in db after extending job's timelimit past partition's
timelimit.
-- Fix srun -I<timeout> from flooding the controller with step create requests.
-- Requeue/hold batch job launch request if job already running (possible if
node went to DOWN state, but jobs remained active).
-- If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
-- Don't mark powered down node as not responding. This could be triggered by
race condition of the node suspend and ping logic.
-- Don't requeue RPC going out from slurmctld to DOWN nodes (can generate
repeating communication errors).
-- Propagate sbatch "--dist=plane=#" option to srun.
-- Fix sacct to not return all jobs if the -j option is given with a trailing
','.
-- Permit job_submit plugin to set a job's priority.
-- Fix occasional srun segfault.
-- Fix issue with sacct, printing 0_0 for array's that had finished in the
database but the start record hadn't made it yet.
-- Fix sacct -j, (nothing but a comma) to not return all jobs.
-- Prevent slurmstepd from core dumping if /proc/<pid>/stat has
unexpected format.
* Changes in Slurm 14.11.9
==========================
-- Correct "sdiag" backfill cycle time calculation if it yields locks. A
microsecond value was being treated as a second value resulting in an
overflow in the calcuation.
-- Fix segfault when updating timelimit on jobarray task.
-- Fix to job array update logic that can result in a task ID of 4294967294.
-- Fix of job array update, previous logic could fail to update some tasks
of a job array for some fields.
-- CRAY - Fix seg fault if a blade is replaced and slurmctld is restarted.
-- Fix plane distribution to allocate in blocks rather than cyclically.
-- squeue - Remove newline from job array ID value printed.
-- squeue - Enable filtering for job state SPECIAL_EXIT.
-- Prevent job array task ID being inappropriately set to NO_VAL.
-- MYSQL - Make it so you don't have to restart the slurmctld
to gain the correct limit when a parent account is root and you
remove a subaccount's limit which exists on the parent account.
-- MYSQL - Close chance of setting the wrong limit on an association
when removing a limit from an association on multiple clusters
at the same time.
-- MYSQL - Fix minor memory leak when modifying an association but no
change was made.
-- srun command line of either --mem or --mem-per-cpu will override both the
SLURM_MEM_PER_CPU and SLURM_MEM_PER_NODE environment variables.
-- Prevent slurmctld abort on update of advanced reservation that contains no
nodes.
-- ALPS - Revert commit 2c95e2d22 which also removes commit 2e2de6a4 allowing
cray with the SubAllocate option to work as it did with 2.5.
-- Properly parse CPU frequency data on POWER systems.
-- Correct sacct.a man pages describing -i option.
-- Capture salloc/srun information in sdiag statistics.
-- Fix bug in node selection with topology optimization.
-- Don't set distribution when srun requests 0 memory.
-- Read in correct number of nodes from SLURM_HOSTFILE when specifying nodes
and --distribution=arbitrary.
-- Fix segfault in Bluegene setups where RebootQOSList is defined in
bluegene.conf and accounting is not setup.
-- MYSQL - Update mod_time when updating a start job record or adding one.
-- MYSQL - Fix issue where if an association id ever changes on at least a
portion of a job array is pending after it's initial start in the
database it could create another row for the remain array instead
of using the already existing row.
-- Fix scheduling anomaly with job arrays submitted to multiple partitions,
jobs could be started out of priority order.
-- If a host has suspened jobs do not reboot it. Reboot only hosts
with no jobs in any state.
-- ALPS - Fix issue when using --exclusive flag on srun to do the correct
thing (-F exclusive) instead of -F share.
-- Fix various memory leaks in the Perl API.
-- Fix a bug in the controller which display jobs in CF state as RUNNING.
-- Preserve advanced _core_ reservation when nodes added/removed/resized on
slurmctld restart. Rebuild core_bitmap as needed.
-- Fix for non-standard Munge port location for srun/pmi use.
-- Fix gang scheduling/preemption issue that could cancel job at startup.
-- Fix a bug in squeue which prevented squeue -tPD to print array jobs.
-- Sort job arrays in job queue according to array_task_id when priorities are
equal.
-- Fix segfault in sreport when there was no response from the dbd.
-- ALPS - Fix compile to not link against -ljob and -lexpat with every lib
or binary.
-- Fix testing for CR_Memory when CR_Memory and CR_ONE_TASK_PER_CORE are used
with select/linear.
-- MySQL - Fix minor memory leak if a connection ever goes away whist using it.
-- ALPS - Make it so srun --hint=nomultithread works correctly.
-- Prevent job array task ID from being reported as NO_VAL if last task in the
array gets requeued.
-- Fix some potential deadlock issues when state files don't exist in the
association manager.
-- Correct RebootProgram logic when executed outside of a maintenance
reservation.
-- Requeue job if possible when slurmstepd aborts.
* Changes in Slurm 14.11.8
==========================
-- Eliminate need for user to set user_id on job_update calls.
-- Correct list of unavailable nodes reported in a job's "reason" field when
that job can not start.
-- Map job --mem-per-cpu=0 to --mem=0.
-- Fix squeue -o %m and %d unit conversion to Megabytes.
-- Fix issue with incorrect time calculation in the priority plugin when
a job runs past it's time limit.
-- Prevent users from setting job's partition to an invalid partition.
-- Fix sreport core dump when requesting
'job SizesByAccount grouping=individual'.
-- select/linear: Correct count of CPUs allocated to job on system with
hyperthreads.
-- Fix race condition where last array task might not get updated in the db.
-- CRAY - Remove libpmi from rpm install
-- Fix squeue -o %X output to correctly handle NO_VAL and suffix.
-- When deleting a job from the system set the job_id to 0 to avoid memory
corruption if thread uses the pointer basing validity off the id.
-- Fix issue where sbatch would set ntasks-per-node to 0 making any srun
afterward cause a divide by zero error.
-- switch/cray: Refine logic to set PMI_CRAY_NO_SMP_ENV environment variable.
-- When sacctmgr loads archives with version less than 14.11 set the array
task id to NO_VAL, so sacct can display the job ids correctly.
-- When using memory cgroup if a task uses more memory than requested
the failures are logged into memory.failcnt count file by cgroup
and the user is notified by slurmstepd about it.
-- Fix scheduling inconsistency with GRES bound to specific CPUs.
-- If user belongs to a group which has split entries in /etc/group
search for its username in all groups.
-- Do not consider nodes explicitly powered up as DOWN with reason of "Node
unexpected rebooted".
-- Use correct slurmd spooldir when creating cpu-frequency locks.
-- Note that TICKET_BASED fairshare will be deprecated in the future. Consider
using the FAIR_TREE algorithm instead.
-- Set job's reason to BadConstaints when job can't run on any node.
-- Prevent abort on update of reservation with no nodes (licenses only).
-- Prevent slurmctld from dumping core if job_resrcs is missing in the
job data structure.
-- Fix squeue to print array task ids according to man page when
SLURM_BITSTR_LEN is defined in the environment.
-- In squeue, sort jobs based on array job ID if available.
-- Fix the calculation of job energy by not including the NO_VAL values.
-- Advanced reservation fixes: enable update of bluegene reservation, avoid
abort on multi-core reservations.
-- Set the totalview_stepid to the value of the job step instead of NO_VAL.
-- Fix slurmdbd core dump if the daemon does not have connection with
the database.
-- Display error message when attempting to modify priority of a held job.
-- Backfill scheduler: The configured backfill_interval value (default 30
seconds) is now interpretted as a maximum run time for the backfill
scheduler. Once reached, the scheduler will build a new job queue and
start over, even if not all jobs have been tested.
-- Backfill scheduler now considers OverTimeLimit and KillWait configuration
parameters to estimate when running jobs will exit.
-- Correct task layout with CR_Pack_Node option and more than 1 CPU per task.
-- Fix the scontrol man page describing the release argument.
-- When job QOS is modified, do so before attempting to change partition in
order to validate the partition's Allow/DenyQOS parameter.
* Changes in Slurm 14.11.7
==========================
-- Initialize some variables used with the srun --no-alloc option that may
cause random failures.
-- Add SchedulerParameters option of sched_min_interval that controls the
minimum time interval between any job scheduling action. The default value
is zero (disabled).
-- Change default SchedulerParameters=max_sched_time from 4 seconds to 2.
-- Refactor scancel so that all pending jobs are cancelled before starting
cancellation of running jobs. Otherwise they happen in parallel and the
pending jobs can be scheduled on resources as the running jobs are being
cancelled.
-- ALPS - Add new cray.conf variable NoAPIDSignalOnKill. When set to yes this
will make it so the slurmctld will not signal the apid's in a batch job.
Instead it relies on the rpc coming from the slurmctld to kill the job to
end things correctly.
-- ALPS - Have the slurmstepd running a batch job wait for an ALPS release
before ending the job.
-- Initialize variables in consumable resource plugin to prevent core dump.
-- Fix scancel bug which could return an error on attempt to signal a job step.
-- In slurmctld communication agent, make the thread timeout be the configured
value of MessageTimeout rather than 30 seconds.
-- sshare -U/--Users only flag was used uninitialized.
-- Cray systems, add "plugstack.conf.template" sample SPANK configuration file.
-- BLUEGENE - Set DB2NOEXITLIST when starting the slurmctld daemon to avoid
random crashing in db2 when the slurmctld is exiting.
-- Make full node reservations display correctly the core count instead of
cpu count.
-- Preserve original errno on execve() failure in task plugin.
-- Add SLURM_JOB_NAME env variable to an salloc's environment.
-- Overwrite SLURM_JOB_NAME in an srun when it gets an allocation.
-- Make sure each job has a wckey if that is something that is tracked.
-- Make sure old step data is cleared when job is requeued.
-- Load libtinfo as needed when building ncurses tools.
-- Fix small memory leak in backup controller.
-- Fix segfault when backup controller takes control for second time.
-- Cray - Fix backup controller running native Slurm.
-- Provide prototypes for init_setproctitle()/fini_setproctitle on NetBSD.
-- Add configuration test to find out the full path to su command.
-- preempt/job_prio plugin: Fix for possible infinite loop when identifying
preemptable jobs.
-- preempt/job_prio plugin: Implement the concept of Warm-up Time here. Use
the QoS GraceTime as the amount of time to wait before preempting.
Basically, skip preemption if your time is not up.
-- Make srun wait KillWait time when a task is cancelled.
-- switch/cray: Revert logic added to 14.11.6 that set "PMI_CRAY_NO_SMP_ENV=1"
if CR_PACK_NODES is configured.
* Changes in Slurm 14.11.6
==========================
-- If SchedulerParameters value of bf_min_age_reserve is configured, then
a newly submitted job can start immediately even if there is a higher
priority non-runnable job which has been waiting for less time than
bf_min_age_reserve.
-- qsub wrapper modified to export "all" with -V option
-- RequeueExit and RequeueExitHold configuration parameters modified to accept
numeric ranges. For example "RequeueExit=1,2,3,4" and "RequeueExit=1-4" are
equivalent.
-- Correct the job array specification parser to accept brackets in job array
expression (e.g. "123_[4,7-9]").
-- Fix for misleading job submit failure errors sent to users. Previous error
could indicate why specific nodes could not be used (e.g. too small memory)
when other nodes could be used, but were not for another reason.
-- Fix squeue --array to display correctly the array elements when the
% separator is specified at the array submission time.
-- Fix priority from not being calculated correctly due to memory issues.
-- Fix a transient pending reason 'JobId=job_id has invalid QOS'.
-- A non-administrator change to job priority will not be persistent except
for holding the job. User's wanting to change a job priority on a persistent
basis should reset it's "nice" value.
-- Print buffer sizes as unsigned values when failed to pack messages.
-- Fix race condition where sprio would print factors without weights applied.
-- Document the sacct option JobIDRaw which for arrays prints the jobid instead
of the arrayTaskId.
-- Allow users to modify MinCPUsNode, MinMemoryNode and MinTmpDiskNode of
their own jobs.
-- Increase the jobid print field in SQUEUE_FORMAT in
opt_modulefiles_slurm.in.
-- Enable compiling without optimizations and with debugging symbols by
default. Disable this by configuring with --disable-debug.
-- job_submit/lua plugin: Add mail_type and mail_user fields.
-- Correct output message from sshare.
-- Use standard statvfs(2) syscall if available, in preference to
non-standard statfs.
-- Add a new option -U/--Users to sshare to display only users
information, parent and ancestors are not printed.
-- Purge 50000 records at a time so that locks can released periodically.
-- Fix potentially uninitialized variables
-- ALPS - Fix issue where a frontend node could become unresponsive and never
added back into the system.
-- Gate epilog complete messages as done with other messages
-- If we have more than a certain number of agents (50) wait longer when gating
rpcs.
-- FrontEnd - ping non-responding or down nodes.
-- switch/cray: If CR_PACK_NODES is configured, then set the environment
variable "PMI_CRAY_NO_SMP_ENV=1"
-- Fix invalid memory reference in SlurmDBD when putting a node up.
-- Allow opening of plugstack.conf even when a symlink.
-- Fix scontrol reboot so that rebooted nodes will not be set down with reason
'Node xyz unexpectedly rebooted' but will be correctly put back to service.
-- CRAY - Throttle the post NHC operations as to not hog the job write lock
if many steps/jobs finish at once.
-- Disable changes to GRES count while jobs are running on the node.
-- CRAY - Fix issue with scontrol reconfig.
-- slurmd: Remove wrong reporting of "Error reading step ... memory limit".
The logic was treating success as an error.
-- Eliminate "Node ping apparently hung" error messages.
-- Fix average CPU frequency calculation.
-- When allocating resources with resolution of sockets, charge the job for all
CPUs on allocated sockets rather than just the CPUs on used cores.
-- Prevent slurmdbd error if cluster added or removed while rollup in progress.
Removing a cluster can cause slurmdbd to abort. Adding a cluster can cause
the slurmdbd rollup to hang.
-- sview - When right clicking on a tab make sure we don't display the page
list, but only the column list.
-- FRONTEND - If doing a clean start make sure the nodes are brought up in the
database.
-- MySQL - Fix issue when using the TrackSlurmctldDown and nodes are down at
the same time, don't double bill the down time.
-- MySQL - Various memory leak fixes.
-- sreport - Fix Energy displays
-- Fix node manager logic to keep unexpectedly rebooted node in state
NODE_STATE_DOWN even if already down when rebooted.
-- Fix for array jobs submitted to multiple partitions not starting.
-- CRAY - Enable ALPs mpp compatibility code in sbatch for native Slurm.
-- ALPS - Move basil_inventory to less confusing function.
-- Add SchedulerParameters option of "sched_max_job_start=" to limit the
number of jobs that can be started in any single execution of the main
scheduling logic.
-- Fixed compiler warnings generated by gcc version >= 4.6.
-- sbatch to stop parsing script for "#SBATCH" directives after first command,
which matches the documentation.
-- Overwrite the SLURM_JOB_NAME in sbatch if already exist in the environment
and use the one specified on the command line --job-name.
-- Remove xmalloc_nz from unpack functions. If the unpack ever failed the
free afterwards would not have zeroed out memory on the variables that
didn't get unpacked.
-- Improve database interaction from controller.
-- Fix for data shift when loading job archives.
-- ALPS - Added new SchedulerParameters=inventory_interval to specify how
often an inventory request is handled.
-- ALPS - Don't run a release on a reservation on the slurmctld for a batch
job. This is already handled on the stepd when the script finishes.
* Changes in Slurm 14.11.5
==========================
-- Correct the squeue command taking into account that a node can
have NULL name if it is not in DNS but still in slurm.conf.
-- Fix slurmdbd regression which would cause a segfault when a node is set
down with no reason.
-- BGQ - Fix issue with job arrays not being handled correctly
in the runjob_mux plugin.
-- Print FAIR_TREE, if configured, in "scontrol show config" output for
PriorityFlags.
-- Add SLURM_JOB_GPUS environment variable to those available in the Prolog.
-- Load lua-5.2 library if using lua5.2 for lua job submit plugin.
-- GRES logic: Prevent bad node_offset due to not preserving no_consume flag.
-- Fix wrong variables used in the wrapper functions needed for systems that
don't support strong_alias
-- Fix code for apple computers SOL_TCP is not defined
-- Cray/BASIL - Check for mysql credentials in /root/.my.cnf.
-- Fix sprio showing wrong priority for job arrays until priority is
recalculated.
-- Account to batch step all CPUs that are allocated to a job not
just one since the batch step has access to all CPUs like other steps.
-- Fix job getting EligibleTime set before meeting dependency requirements.
-- Correct the initialization of QOS MinCPUs per job limit.
-- Set the debug level of information messages in cgroup plugin to debug2.
-- For job running under a debugger, if the exec of the task fails, then
cancel its I/O and abort immediately rather than waiting 60 seconds for
I/O timeout.
-- Fix associations not getting default qos set until after a restart.
-- Set the value of total_cpus not to be zero before invoking
acct_policy_job_runnable_post_select.
-- MySQL - When requesting cluster resources, only return resources for the
cluster(s) requested.
-- Add TaskPluginParam=autobind=threads option to set a default binding in the
case that "auto binding" doesn't find a match.
-- Introduce a new SchedulerParameters variable nohold_on_prolog_fail.
If configured don't requeue jobs on hold is a Prolog fails.
-- Make it so sched_params isn't read over and over when an epilog complete
message comes in
-- Fix squeue -L <licenses> not filtering out jobs with licenses.
-- Changed the implementation of xcpuinfo_abs_to_mac() be identical
_abs_to_mac() to fix CPUs allocation using cpuset cgroup.
-- Improve the explanation of the unbuffered feature in the
srun man page.
-- Make taskplugin=cgroup work for core spec. needed to have task/cgroup
before.
-- Fix reports not using the month usage table.
-- BGQ - Sanity check given for translating small blocks into slurm bg_records.
-- Fix bug preventing the requeue/hold or requeue/special_exit of job from the
completing state.
-- Cray - Fix for launching batch step within an existing job allocation.
-- Cray - Add ALPS_APP_ID_ENV environment variable.
-- Increase maximum MaxArraySize configuration parameter value from 1,000,001
to 4,000,001.
-- Added new SchedulerParameters value of bf_min_age_reserve. The backfill
scheduler will not reserve resources for pending jobs until they have
been pending for at least the specified number of seconds. This can be
valuable if jobs lack time limits or all time limits have the same value.
-- Fix support for --mem=0 (all memory of a node) with select/cons_res plugin.
-- Fix bug that can permit someone to kill job array belonging to another user.
-- Don't set the default partition on a license only reservation.
-- Show a NodeCnt=0, instead of NO_VAL, in "scontrol show res" for a license
only reservation.
-- BGQ - When using static small blocks make sure when clearing the job the
block is set up to it's original state.
-- Start job allocation using lowest numbered sockets for block task
distribution for consistency with cyclic distribution.
* Changes in Slurm 14.11.4
==========================
-- Make sure assoc_mgr locks are initialized correctly.
-- Correct check of enforcement when filling in an association.
-- Make sacctmgr print out classification correctly for clusters.
-- Add array_task_str to the perlapi job info.
-- Fix for slurmctld abort with GRES types configured and no CPU binding.
-- Fix for GRES scheduling where count > 1 per topology type (or GRES types).
-- Make CR_ONE_TASK_PER_CORE work correctly with task/affinity.
-- job_submit/pbs - Fix possible deadlock.
-- job_submit/lua - Add "alloc_node" to job information available.
-- Fix memory leak in mysql accounting when usage rollup happens.
-- If users specify ALL together with other variables using the
--export sbatch/srun command line option, propagate the users'
environ to the execution side.
-- Fix job array scheduling anomaly that can stop scheduling of valid tasks.
-- Fix perl api tests for libslurmdb to work correctly.
-- Remove some misleading logs related to non-consumable GRES.
-- Allow --ignore-pbs to take effect when read as an #SBATCH argument.
-- Fix Slurmdb::clusters_get() in perl api from not returning information.
-- Fix TaskPluginParam=Cpusets from logging error message about not being able
to remove cpuset dir which was already removed by the release_agent.
-- Fix sorting by time left in squeue.
-- Fix the file name substitution for job stderr when %A, %a %j and %u
are specified.
-- Remove minor warning when compiling slurmstepd.
-- Fix database resources so they can add new clusters to them after they have
initially been added.
-- Use the slurm_getpwuid_r wrapper of getpwuid_r to handle possible
interrupts.
-- Correct the scontrol man page and command listing which node states can
be set by the command.
-- Stop sacct from printing non-existent stat information for
Front End systems.
-- Correct srun and acct_gather.conf man pages, mention Filesystem instead
of Lustre.
-- When a job using multiple partition starts send to slurmdbd only
the partition in which the job runs.
-- ALPS - Fix depth for MemoryAllocation in BASIL with CLE 5.2.3.
-- Fix assoc_mgr hash to deal with users that don't have a uid yet when making
reservations.
-- When a job uses multiple partition set the environment variable
SLURM_JOB_PARTITION to be the one in which the job started.
-- Print spurious message about the absence of cgroup.conf at log level debug2
instead of info.
-- Enable CUDA v7.0+ use with a Slurm configuration of TaskPlugin=task/cgroup
ConstrainDevices=yes (in cgroup.conf). With that configuration
CUDA_VISIBLE_DEVICES will start at 0 rather than the device number.
-- Fix job array logic that can cause slurmctld to abort.
-- Report job "shared" field properly in scontrol, squeue, and sview.
-- If a job is requeued because of RequeueExit or RequeueExitHold sent event
REQUEUED to slurmdbd.
-- Fix build if hwloc is in non-standard location.
-- Fix slurmctld job recovery logic which could cause the last task in a job
array to be lost.
-- Fix slurmctld initialization problem which could cause requeue of the last
task in a job array to fail if executed prior to the slurmctld loading
the maximum size of a job array into a variable in the job_mgr.c module.
-- Fix fatal in controller when deleting a user association of a user which
had been previously removed from the system.
-- MySQL - If a node state and reason are the same on a node state change
don't insert a new row in the event table.
-- Fix issue with "sreport cluster AccountUtilizationByUser" when using
PrivateData=users.
-- Fix perlapi tests for libslurm perl module.
-- MySQL - Fix potential issue when PrivateData=Usage and a normal user
runs certain sreport reports.
* Changes in Slurm 14.11.3
==========================
-- Prevent vestigial job record when canceling a pending job array record.
-- Fixed squeue core dump.
-- Fix job array hash table bug, could result in slurmctld infinite loop or
invalid memory reference.
-- In srun honor ntasks_per_node before looking at cpu count when the user
doesn't request a number of tasks.
-- Fix ghost job when submitting job after all jobids are exhausted.
-- MySQL - Enhanced coordinator security checks.
-- Fix for task/affinity if an admin configures a node for having threads
but then sets CPUs to only represent the number of cores on the node.
-- Make it so previous versions of salloc/srun work with newer versions
of Slurm daemons.
-- Avoid delay on commit for PMI rank 0 to improve performance with some
MPI implementations.
-- auth/munge - Correct logic to read old format AccountingStoragePass.
-- Reset node "RESERVED" state as appropriate when deleting a maintenance
reservation.
-- Prevent a job manually suspended from being resumed by gang scheduler once
free resources are available.
-- Prevent invalid job array task ID value if a task is started using gang
scheduling.
-- Fixes for clean build on FreeBSD.
-- Fix documentation bugs in slurm.conf.5. DenyAccount should be DenyAccounts.
-- For backward compatibility with older versions of OMPI not compiled
with --with-pmi restore the SLURM_STEP_RESV_PORTS in the job environment.
-- Update the html documentation describing the integration with openmpi.
-- Fix sacct when searching by nodelist.
-- Fix cosmetic info statements when dealing with a job array task instead of
a normal job.
-- Fix segfault with job arrays.
-- Correct the sbatch pbs parser to process -j.
-- BGQ - Put print statement under a DebugFlag. This was just an oversight.
-- BLUEGENE - Remove check that would erroneously remove the CONFIGURING
flag from a job while the job is waiting for a block to boot.
-- Fix segfault in slurmstepd when job exceeded memory limit.
-- Fix race condition that could start a job that is dependent upon a job array
before all tasks of that job array complete.
-- PMI2 race condition fix.
* Changes in Slurm 14.11.2
==========================
-- Fix Centos5 compile errors.
-- Fix issue with association hash not getting the correct index which
could result in seg fault.
-- Fix salloc/sbatch -B segfault.
-- Avoid huge malloc if GRES configured with "Type" and huge "Count".
-- Fix jobs from starting in overlapping reservations that won't finish before
a "maint" reservation begins.
-- When node gets drained while in state mixed display its status as draining
in sinfo output.
-- Allow priority/multifactor to work with sched/wiki(2) if all priorities
have no weight. This allows for association and QOS decay limits to work.
-- Fix "squeue --start" to override SQUEUE_FORMAT env variable.
-- Fix scancel to be able to cancel multiple jobs that are space delimited.
-- Log Cray MPI job calling exit() without mpi_fini(), but do not treat it as
a fatal error. This partially reverts logic added in version 14.03.9.
-- sview - Fix displaying of suspended steps elapsed times.
-- Increase number of messages that get cached before throwing them away
when the DBD is down.
-- Fix jobs from starting in overlapping reservations that won't finish before
a "maint" reservation begins.
-- Restore GRES functionality with select/linear plugin. It was broken in
version 14.03.10.
-- Fix bug with GRES having multiple types that can cause slurmctld abort.
-- Fix squeue issue with not recognizing "localhost" in --nodelist option.
-- Make sure the bitstrings for a partitions Allow/DenyQOS are up to date
when running from cache.
-- Add smap support for job arrays and larger job ID values.
-- Fix possible race condition when attempting to use QOS on a system running
accounting_storage/filetxt.
-- Fix issue with accounting_storage/filetxt and job arrays not being printed
correctly.
-- In proctrack/linuxproc and proctrack/pgid, check the result of strtol()
for error condition rather than errno, which might have a vestigial error
code.
-- Improve information recording for jobs deferred due to advanced
reservation.
-- Exports eio_new_initial_obj to the plugins and initialize kvs_seq on
mpi/pmi2 setup to support launching.
* Changes in Slurm 14.11.1
==========================
-- Get libs correct when doing the xtree/xhash make check.
-- Update xhash/tree make check to work correctly with current code.
-- Remove the reference 'experimental' for the jobacct_gather/cgroup
plugin.
-- Add QOS manipulation examples to the qos.html documentation page.
-- If 'squeue -w node_name' specifies an unknown host name print
an error message and return 1.
-- Fix race condition in job_submit plugin logic that could cause slurmctld to
deadlock.
-- Job wait reason of "ReqNodeNotAvail" expanded to identify unavailable nodes
(e.g. "ReqNodeNotAvail(Unavailable:tux[3-6])").
* Changes in Slurm 14.11.0
==========================
-- ALPS - Fix issue with core_spec warning.
-- Allow multiple partitions to be specified in sinfo -p.
-- Install the service files in /usr/lib/systemd/system.
-- MYSQL - Add id_array_job and id_resv keys to $CLUSTER_job_table. THIS
COULD TAKE A WHILE TO CREATE THE KEYS SO BE PATIENT.
-- CRAY - Resize bitmaps on a restart and find we have more blades
than before.
-- Add new eio API function for removing unused connections.
-- ALPS - Fix issue where batch allocations weren't correctly confirmed or
released.
-- Define DEFAULT_MAX_TASKS_PER_NODE based on MAX_TASKS_PER_NODE from
slurm.h as per documentation.
-- Update the FAQ about relocating slurmctld.
-- In the memory cgroup enable memory.use_hierarchy in the cgroup root.
-- Export eio.c functions for use by MPI/PMI2.
-- Add SLURM_CLUSTER_NAME to job environment.
* Changes in Slurm 14.11.0rc3
=============================
-- Allow envs to override autotools binaries in autogen.sh
-- Added system services files.
-- If the jobs pends with DependencyNeverSatisfied keep it pending even after
the job which it was depending upon was cleaned.
-- Let operators (in addition to user root and SlurmUser) see job script for
other user's jobs.
-- Perl API modified to return node state of MIXED rather than ALLOCATED if
only some CPUs allocated.
-- Double Munge connect retry timeout from 1 to 2 seconds.
-- sview - Remove unneeded code that was resolved globally in commit
98e24b0dedc.
-- Collect and report the accounting of the batch step and its children.
-- Add configure checks for faccessat and eaccess, and make use of one of
them if available.
-- Make configure --enable-developer also set --enable-debug
-- Introduce a SchedulerParameters variable kill_invalid_depend, if set
then jobs pending with invalid dependency are going to be terminated.
-- Move spank_user_task() call in slurmstepd after the task_g_pre_launch()
so that the task affinity information is available to spank.
-- Make /etc/init.d/slurm script return value 3 when the daemon is
not running. This is required by Linux Standard Base Core
Specification 3.1
* Changes in Slurm 14.11.0rc2
=============================
-- Logs for jobs which are explicitly requeued will say so rather than saying
that a node in their allocation failed.
-- Updated the documentation about the remote licenses served by
the Slurm database.
-- Insure that slurm_spank_exit() is only called once from srun.
-- Change the signature of net_set_low_water() to use 4 bytes instead of 8.
-- Export working_cluster_rec in libslurmdb.so as well as move some function
definitions needed for drmaa.
-- If using cons_res or serial cause a fatal in the plugin instead of causing
the SelectTypeParameters to magically set to CR_CPU.
-- Enhance task/affinity auto binding to consider tasks * cpus-per-task.
-- Fix regression the priority/multifactor which would cause memory corruption.
Issue is only in rc1.
-- Add PrivateData value of "cloud". If set, powered down nodes in the cloud
will be visible.
-- Sched/backfill - Eliminate clearing start_time of running jobs.
-- Fix various backwards compatibility issues.
-- If failed to launch a batch job, requeue it in hold.
* Changes in Slurm 14.11.0rc1
=============================
-- When using cgroup name the batch step as step_batch instead of
batch_4294967294
-- Changed LEVEL_BASED priority to be "Fair_Tree"
-- Port to NetBSD.
-- BGQ - Add cnode based reservations.
-- Alongside totalview_jobid implement totalview_stepid available
to sattach.
-- Add ability to include other files in slurm.conf based upon the ClusterName.
-- Update strlcpy to latest upstream version.
-- Add reservation information in the sacct and sreport output.
-- Add job priority calculation check for overflow and fix memory leak.
-- Add SchedulerParameters option of pack_serial_at_end to put serial jobs at
the end of the available nodes rather than using a best fit algorithm.
-- Allow regular users to view default sinfo output when
privatedata=reservations is set.
-- PrivateData=reservation modified to permit users to view the reservations
which they have access to (rather then preventing them from seeing ANY
reservation).
-- job_submit/lua: Fix job_desc set field logic
* Changes in Slurm 14.11.0pre5
==============================
-- Fix sbatch --export=ALL, it was treated by srun as a request to explicitly
export only the environment variable named "ALL".
-- Improve scheduling of jobs in reservations that overlap other reservations.
-- Modify sgather to make global file systems easier to configure.
-- Added sacctmgr reconfig to reread the slurmdbd.conf in the slurmdbd.
-- Modify scontrol job operations to accept comma delimited list of job IDs.
Applies to job update, hold, release, suspend, resume, requeue, and
requeuehold operations.
-- Refactor job_submit/lua interface. LUA FUNCTIONS NEED TO CHANGE! The
lua script no longer needs to explicitly load meta-tables, but information
is available directly using names slurm.reservations, slurm.jobs,
slurm.log_info, etc. Also, the job_submit.lua script is reloaded when
updated without restarting the slurmctld daemon.
-- Allow users to specify --resv_ports to have value 0.
-- Cray MPMD (Multiple-Program Multiple-Data) support completed.
-- Added ability for "scontrol update" to references jobs by JobName (and
filtered optionally by UserID).
-- Add support for an advanced reservation start time that remains constant
relative to the current time. This can be used to prevent the starting of
longer running jobs on select nodes for maintenance purpose. See the
reservation flag "TIME_FLOAT" for more information.
-- Enlarge the jobid field to 18 characters in squeue output.
-- Added "scontrol write config" option to save a copy of the current
configuration in a file containing a time stamp.
-- Eliminate native Cray specific port management. Native Cray systems must
now use the MpiParams configuration parameter to specify ports to be used
for commmunications. When upgrading Native Cray systems from version 14.03,
all running jobs should be killed and the switch_cray_state file (in
SaveStateLocation of the nodes where the slurmctld daemon runs) must be
explicitly deleted.
* Changes in Slurm 14.11.0pre4
==============================
-- Added job array data structure and removed 64k array size restriction.
-- Added SchedulerParameters options of bf_max_job_array_resv to control how
many tasks of a job array should have resources reserved for them.
-- Added more validity checking of incoming job submit requests.
-- Added srun --export option to set/export specific environment variables.
-- Scontrol modified to print separate error messages for job arrays with
different exit codes on the different tasks of the job array. Applies to
job suspend and resume operations.
-- Fix race condition in CPU frequency set with job preemption.
-- Always call select plugin on step termination, even if the job is also
complete.
-- Srun executable names beginning with "." will be resolved based upon the
working directory and path on the compute node rather than the submit node.
-- Add node state string suffix of "$" to identify nodes in maintenance
reservation or scheduled for reboot. This applies to scontrol, sinfo,
and sview commands.
-- Enable scontrol to clear a nodes's scheduled reboot by setting its state
to "RESUME".
-- As per sbatch and srun documentation when the --signal option is used
signal only the steps and unless, in the case, of a batch job B is
specified in which case signal only the batch script.
-- Modify AuthInfo configuration parameter to accept credential lifetime
option.
-- Modify crypto/munge plugin to use socket and timeout specified in AuthInfo.
-- If we have a state for a step on completion put that in the database
instead of guessing off the exit_code.
-- Added squeue -P/--priority option that can be used to display pending jobs
in the same order as used by the Slurm scheduler even if jobs are submitted
to multiple partitions (job is reported once per usable partition).
-- Improve the pending reason description for various QOS limits. For each
QOS limit that causes a job to be pending print its specific reason.
For example if job pends because of GrpCpus the squeue command will
print QOSGrpCpuLimit as pending reason.
-- sched/backfill - Set expected start time of job submitted to multiple
partitions to the earliest start time on any of the partitions.
-- Introduce a MAX_BATCH_REQUEUE define that indicates how many times a job
can be requeued upon prolog failure. When the number is reached the job
is put on hold with reason JobHoldMaxRequeue.
-- Add sbatch job array option to limit the number of simultaneously running
tasks from a job array (e.g. "--array=0-15%4").
-- Implemented a new QOS limit MinCPUs. Users running under a QOS must
request a minimum number of CPUs which is at least MinCPUs otherwise
their job will pend.
-- Introduced a new pending reason WAIT_QOS_MIN_CPUS to reflect the new QOS
limit.
-- Job array dependency based upon state is now dependent upon the state of
the array as a whole (e.g. afterok requires ALL tasks to complete
sucessfully, afternotok is true if ANY tasks does not complete successfully,
and after requires all tasks to at least be started).
-- The srun -u/--unbuffered options set the stdout of the task launched
by srun to be line buffered.
-- The srun options -/--label and -u/--unbuffered can be specified together.
This limitation has been removed.
-- Provide sacct display of gres accounting information per job.
-- Change the node status size from uin16_t to uint32_t.
* Changes in Slurm 14.11.0pre3
==============================
-- Move xcpuinfo.[c|h] to the slurmd since it isn't needed anywhere else
and will avoid the need for all the daemons to link to libhwloc.
-- Add memory test to job_submit/partition plugin.
-- Added new internal Slurm functions xmalloc_nz() and xrealloc_nz(), which do
not initialize the allocated memory to zero for improved performance.
-- Modify hostlist function to dynamically allocate buffer space for improved
performance.
-- In the job_submit plugin: Remove all slurmctld locks prior to job_submit()
being called for improved performance. If any slurmctld data structures are
read or modified, add locks directly in the plugin.
-- Added PriorityFlag LEVEL_BASED described in doc/html/level_based.shtml
-- If Fairshare=parent is set on an account, that account's children will be
effectively reparented for fairshare calculations to the first parent of
their parent that is not Fairshare=parent. Limits remain the same,
only it's fairshare value is affected.
* Changes in Slurm 14.11.0pre2
==============================
-- Added AllowSpecResourcesUsage configuration parameter in slurm.conf. This
allows jobs to use specialized resources on nodes allocated to them if the
job designates --core-spec=0.
-- Add new SchedulerParameters option of build_queue_timeout to throttle how
much time can be consumed building the job queue for scheduling.
-- Added HealthCheckNodeState option of "cycle" to cycle through the compute
nodes over the course of HealthCheckInterval rather than running all at
the same time.
-- Add job "reboot" option for Linux clusters. This invokes the configured
RebootProgram to reboot nodes allocated to a job before it begins execution.
-- Added squeue -O/--Format option that makes all job and step fields available
for printing.
-- Improve database slurmctld entry speed dramatically.
-- Add "CPUs" count to output of "scontrol show step".
-- Add support for lua5.2
-- scancel -b signals only the batch step neither any other step nor any
children of the shell script.
-- MySQL - enforce NO_ENGINE_SUBSTITUTION
-- Added CpuFreqDef configuration parameter in slurm.conf to specify the
default CPU frequency and governor to be set at job end.
-- Added support for job email triggers: TIME_LIMIT, TIME_LIMIT_90 (reached
90% of time limit), TIME_LIMIT_80 (reached 80% of time limit), and
TIME_LIMIT_50 (reached 50% of time limit). Applies to salloc, sbatch and
srun commands.
-- In slurm.conf add the parameter SrunPortRange=min-max. If this is configured
then srun will use its dynamic ports only from the configured range.
-- Make debug_flags 64 bit to handle more flags.
* Changes in Slurm 14.11.0pre1
==============================
-- Modify etc/cgroup.release_common.example to set specify full path to the
scontrol command. Also find cgroup mount point by reading cgroup.conf file.
-- Improve qsub wrapper support for passing environment variables.
-- Modify sdiag to report Slurm RPC traffic by user, type, count and time
consumed.
-- In select plugins, stop triggering extra logging based upon the debug flag
CPU_Bind and use SelectType instead.
-- Added SchedulerParameters options of bf_yield_interval and bf_yield_sleep
to control how frequently and for how long the backfill scheduler will
relinquish its locks.
-- To support larger numbers of jobs when the StateSaveDirectory is on a
file system that supports a limited number of files in a directory, add a
subdirectory called "hash.#" based upon the last digit of the job ID.
-- More gracefully handle missing batch script file. Just kill the job and do
not drain the compute node.
-- Add support for allocation of GRES by model type for heterogeneous systems
(e.g. request a Kepler GPU, a Tesla GPU, or a GPU of any type).
-- Record and enable display of nodes anticipated to be used for pending jobs.
<