Skip to content

Releases: SchedMD/slurm

v25.11.6

14 May 20:01

Choose a tag to compare

Changes in 25.11.6

  • scontrol - Allow updating InstanceId for batches of nodes as is possible for updating NodeAddr and NodeHosts.
  • scontrol - Allow updating InstanceType for batches of nodes as is possible for updating NodeAddr and NodeHosts.
  • Fix problem when using sacctmgr to remove a default account for a user when more than one is set.
  • Fix sacctmgr silently ignoring trailing characters in numeric options.
  • Fix sbcast with auth/slurm when user doesn't exist on slurmctld.
  • Fix stepmgr crash with using sbcast with auth/slurm.
  • Fix memory leak in stepmgr stepd.
  • Reject untrusted REQUEST_COMPLETE_PROLOG.
  • Fix jobs getting stuck in COMPLETING state when PrologFlags=RunInJob is configured by passing EpilogMsgTime to slurmstepd.
  • Fix external nodes incorrectly marked as not responding after state transitions such as drain/undrain or resume.
  • slurmstepd - Prevent crash when UnkillableStepTimeout is reached and Slurm is configured with --enable-memory-leak-debug.
  • slurmctld - Fix possible hang during reconfigure due to slow client I/O due to timeout not being enforced.
  • slurmctld - Fix possible hang during shutdown due to slow client I/O due to timeout not being enforced.
  • slurmctld - Avoid race condition during shutdown that could cause a crash while attempting to read from a connection.
  • Fix parsing issue for GRES resources that contain a hyphen ("-") in their name when using sacctmgr.
  • Ensure that a request for zero licenses does not prevent a job from running when all licenses are in-use or reserved.
  • slurmctld - Fix crash on startup due to race condition when I/O is processed before the connection (conn) plugin finishes initialization.
  • slurmdbd - Fix crash from race condition during shutdown when a persistent connection closes its database connection after the accounting_storage plugin has already unloaded.
  • slurmrestd - Fixed memory leak resulting from specifying an empty node_list in the request body of the following endpoints: 'POST /slurm/v0.0.4[3-5]/reservation' 'POST /slurm/v0.0.4[3-5]/reservations'
  • Prevent deadlock when replacing nodes in reservations.
  • Fix slow scheduling for multi-segment jobs with topology/block when blocks have fewer available nodes than the requested segment size.
  • serializer/url-encoded - Allow non-NULL terminated strings to be passed to serialize_p_string_to_data().
  • serializer/yaml - Prevent fataling if the size of a yaml configuration file is a multiple of 4096 bytes.
  • Fix archive dump jobs "No records archived...but some found"
  • Fix gcc-16 build errors.
  • Fix slurmstepd crash in jobacctinfo_aggregate() handling when SlurmctldParameters=enable_stepmgr and JobAcctGatherType=jobacct_gather/none are set.
  • Fix slurmd >= 25.05 crash on HetJob step launches from srun <= 24.11.
  • Set the in-memory QOS priority to 0 after INFINITY is handled by slurmdbd.
  • Do not allocate maintenance nodes to new reservations.
  • slurmd - fix a potential crash during message forwarding
  • Fix out-of-bounds array errors by resizing leaf_usage when tres_cnt changes.
  • All features will be tested before jobs are preempted.
  • slurmstepd - when a node fails on which the batch step is running, don't deallocate the batch step until after the job completes or is requeued.

v25.05.8

14 May 20:01

Choose a tag to compare

Changes in 25.05.8

  • slurmctld - Correct race condition during reconfigure and creating new cluster in slurmdbd that could cause both daemons to deadlock.
  • slurmctld - Reject all job submissions as reserved user or group nobody(99).
  • sbatch,srun,salloc - Reject arg --uid=99.
  • sbatch,srun,salloc - Reject arg --gid=99.
  • slurmctld - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmstepd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • srun - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmdbd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmctld - Wait for forwarding threads to complete before shutdown to avoid crashing due to NULL dereferences or using unloaded plugins.
  • cons_tres - Prevent slurmctld SIGFPE during node selection.
  • slurmctld - Fix possible hang during reconfigure due to slow client I/O due to timeout not being enforced.
  • slurmctld - Fix possible hang during shutdown due to slow client I/O due to timeout not being enforced.
  • slurmctld - Avoid race condition during shutdown that could cause a crash while attempting to read from a connection.
  • slurmctld - Fix crash on startup due to race condition when I/O is processed before the connection (conn) plugin finishes initialization.
  • slurmdbd - Fix crash from race condition during shutdown when a persistent connection closes its database connection after the accounting_storage plugin has already unloaded.
  • Prevent deadlock when replacing nodes in reservations.
  • Fix gcc-16 build errors.
  • Fix build errors with recent versions of libcurl (8.16+).
  • Fix catching invalid gpu-freq numbered values.
  • Fix slurmd >= 25.05 crash on HetJob step launches from srun <= 24.11.

v26.05.0rc1

07 May 21:24

Choose a tag to compare

v26.05.0rc1 Pre-release
Pre-release

Changes in 26.05.0rc1

  • Add SLURM_JOB_QOS to Prolog/Epilog environment.
  • data_parser/v0.0.45 - Prevent memory leaks when freeing parsed lists.
  • Return an xstring from slurm_create_reservation() instead of one created with strdup().
  • scontrol - If a step terminates while its pids are bing queried 'scontrol listpids' will now print all successfully found pids instead of only logging an error.
  • Prevent stepd_connect() from overriding the connect calls errno on error.
  • slurmctld - Support 'verbose' query parameter in 'GET /readyz' endpoint.
  • slurmd - Support 'verbose' query parameter in 'GET /readyz' endpoint.
  • sacctmgr - In interactive mode, quiet/verbose will now apply to logging messages that are printed.
  • sacctmgr - Quiet (--quiet/-Q) and verbose (--verbose/-v) command line options are now mutually exclusive. sacctmgr will immediately exit if both options are specified.
  • sacctmgr - Quiet option (--quiet/-Q) is now applied to all logging messages, ensuring that it is enforced in all cases (e.g. logging from 'dump' previously would not honor --quiet)
  • NO_NORMAL_ALL will only be printed if all NO_NORMAL_* flags are set.
  • job_submit/lua - Log Lua stacktrace on runtime errors when calling slurm_job_submit() in job_submit.lua when 'debugflags=script' is set in slurm.conf or via environment SLURM_DEBUG_FLAGS=script.
  • job_submit/lua - Log Lua stacktrace on runtime errors when calling slurm_job_modify() in job_submit.lua when 'debugflags=script' is set in slurm.conf or via environment SLURM_DEBUG_FLAGS=script.
  • Added error handling and logging when a malformed RESPONSE_CONFIG RPC is received.
  • Reject QOS creation requests that use nonuser flags
  • Do not print nonuser QOS flags as valid flags
  • Add "thread" as possible flag to "debugflags=" in slurm.conf and slurmdbd.conf.
  • Do not allow clearing the partition from a reservation (e.g. scontrol update ReservationName=<res_name> PartitionName=''). Attempts to clear the partition from a reservation will be rejected by slurmctld. This change also fixes several potential slurmctld crashes.
  • Add DebugFlag=SelectType log for when a node is skipped during job scheduling attempts because it is in COMPLETING state.
  • slurmrestd - Add POWER_DOWN_ASAP and POWER_DOWN_FORCE to as valid node states in REST.
  • slurmctld - Remove Slurmctld job state cache including support for SchedulerParameters=enable_job_state_cache in slurm.conf.
  • slurmctld - Log error when saving to StateSaveLocation is too slow.
  • slurmctld - Include StateSaveLocation statistics with /readyz endpoint.
  • Fix error reading /proc/0/* when calling the api outside the step namespace.
  • Alter sh5util -j to not allow array or het job ids.
  • slurmctld - Improve ability to process RPCs in parallel by removing the need for the node write lock to process REQUEST_NODE_INFO, "metrics/partitions", and "metrics/nodes" requests, as well as when spawning the node health check agent.
  • slurmctld - No longer acquire the job write lock when spawning the node health check agent.
  • Fix long slurmd stop time when waiting on the slurmd to register.
  • Fix slurmstepd memleak when initializing cgroup plugins.
  • Fix slurmstepd memleak when initializing cgroup plugins.
  • scrun - Update scrun.lua example in man 1 scrun removing requirement to compile Lua with JSON support.
  • Fix not applying constraints if CpuSpecList string is larger than 1024 chars.
  • slurmrestd - Return 200 when querying a non existing partition. This affects the following endpoints: 'GET /slurm/v0.0.45/partition'
  • slurmctld - Preserve intermediate job scheduling values to provide consistent scontrol show job output before and after reconfiguring or restarting the controller.
  • Increase precision of time reported when timers issue warnings.
  • scontrol - Print 'Job 12_23 not found' errors on stderr instead of stdout.
  • stepmgr - handle when a steps requested ThreadsPerCore does not equal a nodes configured ThreadsPerCore
  • Fix bug where requests from denied uids (i.e. "Users=-") to skip, delete or view (if using PrivateData) reservations were not rejected properly. This bug only existed for clusters not using AccountingStorageEnforce=associations (including other options that imply enforcing associations)
  • Fix rare potential race condition in x11 forwarding that could result in a double free.
  • salloc/scrun/srun/slurmstepd - Move setting of SLURM_TASKS_PER_NODE to the controller.
  • gpu/nvml - The --gpu-freq job submission options will now set the actual Memory/GPU clock frequencies rather than the "Applications clocks" frequencies if the installed version of NVML supports it. This affects CUDA 11.3+ and prevents build errors in CUDA 13.0+ where the "Applications clocks" interface has been deprecated.
  • gpu/nvml - Fix bug that prevented clock frequencies being reset on all GPUs at job completion when cgroups is constraining devices and there are multiple GPUs on the node.
  • gpu/nvml - Fix bug that prevented --gpu-freq from being applied to the GPU clock frequency without specifying a memory clock frequency.
  • Fixed SLURM_CLUSTER_NAME to be set to correct cluster when multiple clusters are available in a batch job.
  • Respect arbitrary task distribution and return ESLURM_NOT_SUPPORTED if it is set together with an incompatible setting, namely topology/block, --spread-job, CR_LLN, pack_serial_at_end or bf_busy_nodes.
  • slurmctld,slurmdbd: Avoid segfault when persistent connections fail to establish fully.
  • Avoid non-needed numeric UID to user name translation when dumping node information node with unset reason for current node state. The following slurmrestd endpoints have changed: GET /slurm/v0.0.45/nodes GET /slurm/v0.0.45/node/{node_name} The following CLI commands have changed: scontrol show node {node_name} (--json|--yaml) scontrol show nodes (--json|--yaml)
  • sinfo - Avoid non-needed numeric UID to user name translation when dumping node information node with unset reason for current node state changing: sinfo (--json|--yaml)
  • slurmrestd - Add cores_per_socket to job submission to the following endpoints: GET /slurm/v0.0.45/job/submit GET /slurm/v0.0.45/job/allocate POST /slurm/v0.0.45/job/{job_id}
  • slurmctld - Refuse RESPONSE_PING_SLURMD from incorrect nodes
  • slurmctld - Skip MODE_3 HRes specific logic in backfill for job the do not request MODE_3 HRes.
  • select/cons_tres - fix use-after-free of node_usage[].jobs
  • Add status field to scontrol ping --json and scontrol ping --yaml.
  • Add status field to '.components.schemas."v0.0.45_controller_ping"' to following endpoint: GET /slurm/v0.0.45/ping
  • Add status field to sacctmgr ping --json and sacctmgr ping --yaml.
  • Add status field to '.components.schemas."v0.0.45_slurmdbd_ping"' to following endpoint: GET /slurmdb/v0.0.45/ping
  • slurmctld - Require authentication for the 'GET /readyz?verbose' endpoint, restricting access to only root and SlurmUser.
  • slurmctld - Add threadpool to avoid overhead of creating new process threads which kernel freezes entire process to complete. This can be enabled with SlurmctldParameters=threadpool=enabled.
  • Fix building with --with-jwt in a non-standard location.
  • sacct - Add '.jobs[].sluid' field to the following commands: 'sacct --json', 'sacct --yaml'
  • slurmrestd - Add '.jobs[].sluid' field to the following endpoints: 'GET slurmdb/v0.0.45/job', 'GET slurmdb/v0.0.45/jobs'
  • slurmrestd - Add 'GET /healthz', 'GET /readyz', and 'GET /livez' endpoints.
  • Fix potential glibc deadlock when tearing down the extern step when x11 forwarding is enabled.
  • Fix FreeBSD build for --format=binary files, which are currently used for command help and usage text.
  • Packaging - MUNGE is now a weak dependency to Slurm RPM and DEB packages, and can now be optionally installed or removed (installed by default).
  • Add SuspendTime as a NodeName parameter in slurm.conf, enabling per-node power save configuration.
  • slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.42/nodes/ POST /slurm/v0.0.42/node/{node_name}
  • slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.43/nodes/ POST /slurm/v0.0.43/node/{node_name}
  • slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.44/nodes/ POST /slurm/v0.0.44/node/{node_name}
  • slurmrestd - Deprecate ignored reason_uid field from the following endpoints: POST /slurm/v0.0.45/nodes/ POST /slurm/v0.0.45/node/{node_name}
  • Adding new archive/purge options to allow for explicit archiving of job_scripts and job_env without jobs.
  • When the url_parser plugin does not load, change the log from an error to a warning. This plugin is optional and may not always be built.
  • Fix rpmbuild slurm.spec --with selinux.
  • Use internal dependency generator in slurm.spec.
  • Switch to pkgconfig detection of many packages in slurm.spec.
  • Add reqTRES components to the clonensscript and clonensepilog environment variables.
  • Name all process POSIX threads consistently with format "worker[{index}]" when threads are not otherwise given a special name.
  • slurmctld - Fix unresponsive nodes not being marked DOWN in clusters with frequent reconfigurations, as each reconfigure was updating the SlurmdTimeout countdown.
  • slurmctld - If a node is replaced in a reservation mark that the reservation state changed. With bf_continue enabled, this fixes backfill potential incorrect planning if reservation node is replaced mid-cycle.
  • Cover rare edge case in job queue sorting.
  • Add job priority value to SLURM_RESUME_FILE.
  • sbatch/srun/salloc - Make --gres=gpu:N and --gpus-per-node mutually exclusive.
  • switch/hpe_slingshot - Add SwitchParameters=fm_authdir_ctld option.
  • slurmd - Support POSIX signal SIGPROF to log debug state.
  • slurmd - Increase default conmgr_max_connections from 50 to 512 to avoid connections being deferred on nodes with high...
Read more

v25.11.5

14 Apr 21:19

Choose a tag to compare

Changes in 25.11.5

  • slurmctld - Prevent crash when deleting the only node in the cluster which also belongs to an inactive reservation.
  • Fix assoc corruption on account add race condition.
  • slurmctld - Re-enforce accounting policy limits when updating a job's QOS/assoc/partition.
  • Prevent double call to requeue logic when PrologSlurmctld fails leading to extra records in database.
  • Fix backfill to honor partition OverSubscribe=EXCLUSIVE
  • stepmgr - Avoid leaking MPI ports when jobs that use the stepmgr are allocated nonconsecutive ports.
  • Fix always showing 0 for slurm_cpus_alloc, slurm_nodes_alloc and slurm_memory_alloc in the metrics/jobs endpoint.
  • Fix BPF token support compilation on systems with glibc >= 2.36 by using <sys/mount.h> where available instead of <linux/mount.h>.
  • Fix a regression in 25.11.0 that could cause bounded hang after hitting conmgr_max_connections.
  • Fix Insufficient Size error in NVML library call for long gpu names.
  • slurmctld - Correct race condition during reconfigure and creating new cluster in slurmdbd that could cause both daemons to deadlock.
  • slurmctld - Reject all job submissions as reserved user or group nobody(99).
  • sbatch,srun,salloc - Reject arg --uid=99.
  • sbatch,srun,salloc - Reject arg --gid=99.
  • Jobs that complete quickly will not be marked as runaway.
  • Correctly identify whether a job is in the DB.
  • slurmctld - Avoid possible race condition during shutdown that could cause a crash in the HTTP handling logic.
  • slurmctld - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmstepd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • srun - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmdbd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • Fix race condition with cgroups not migrating slurmd process quickly, which caused EBUSY errors on startup.
  • Fix slurmd reconfigure failure with cgroup/v2.
  • Fix a regression added in 25.05.0 concerning how the slurmctld inherits /run/slurmctld/sack.socket when using AuthType=auth/slurm to prevent clients that connected during a reconfigure from hanging indefinitely.
  • slurmctld - Wait for forwarding threads to complete before shutdown to avoid crashing due to NULL dereferences or using unloaded plugins.
  • Avoid failure for spank options that do not require arguments.
  • Allow archive load of qos_usage tables
  • namespace/linux - fix memory leak in slurmstepd when namespace_p_recv_stepd() fails.
  • namespace/linux - Fix potential crash on failure if mmap() or sem_init() fails during namespace construction.
  • namespace/linux - fix unlikely error that could cause sigkill to be sent to a job during shutdown.
  • namespace/linux - fix failure to detect namespace setup problems when launching a job.
  • Fix slurmctld crash when querying the metrics endpoint after a partition is deleted with finished jobs still present.
  • reservations - Fix creation with NodeCnt and Flags=IGNORE_JOBS failing when partition nodes are occupied.
  • cons_tres - Prevent slurmctld SIGFPE during node selection.

v25.11.4

12 Mar 20:59

Choose a tag to compare

Changes in 25.11.4

  • slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
  • Prevent "period_start should already be set" errors when purging slurmdbd data and fix file names for archives of purged slurmdbd data.
  • Skip x11 shutdown when x11 functionality was not requested.
  • Fix build errors with recent versions of libcurl (8.16+).
  • Fix scrun segfault with step_mgr and if environment is set.
  • Fix two memory leaks located in the job info struct.
  • Fix sacct not accepting -R flag.
  • switch/nvidia_imex - Fix parsing of --network=unique-channel-per-segment option.
  • topology/block - Fix parsing of --network=unique-channel-per-segment option.
  • Fix compile errors building against glibc-2.43
  • Prevent potential race that could cause process/script completion to go undetected. In the case of prolog/epilog, this would leave jobs stuck in CG state on nodes running many concurrent jobs. In the case of --get-user-env, it may time out resulting in jobs being requeued and held.
  • switch/nvidia_imex - fix use-after-free when switch plugin debug logging is enabled.
  • Fix bad umask() if switch/nvidia_imex fails to initialize.
  • switch/nvidia_imex - fix memory leak if imex_dev_major is set.
  • switch/nvidia_imex - fix potential memory leaks when unpacking the jobinfo structure.
  • switch/nvidia_imex - prevent job from starting when imex channel allocation fails.
  • When bf_continue is set, prevent backfill from potentially ending its cycle early due to the reason "System state changed" because of a node state change.
  • Fix underflow in GRES selection when RestrictedCoresPerGPU is configured and the job is exclusive.
  • Fix race on reconfigure that caused slurmctld to crash.
  • Docs - Update the version constraints for libjwt to reflect the fact that only 1.x may be used with Slurm.
  • Fix case when using sacctmgr where user assoc failed to be removed when removing an account with parent specified.
  • cgroup/v2 - Fix issue which caused memory.peak to be inconsistently used.
  • Prevent flex reservations from taking nodes from other reservations if those reservations do not request full nodes.
  • Fix slurmctld crash situation with srun --overcommit.
  • Adding log message to notify user of queries which are too large

v25.05.7

12 Mar 20:58

Choose a tag to compare

Changes in 25.05.7

  • Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
  • slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
  • Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
  • Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
  • slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
  • Fix compile errors building against glibc-2.43
  • Fix race on reconfigure that caused slurmctld to crash

v25.11.3

19 Feb 22:13

Choose a tag to compare

Changes in 25.11.3

  • Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
  • Fixed issue where RestrictedCoresPerGPU with shared gres are limited to using restricted cores on one job per sharing gres.
  • slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
  • Fix "sacctmgr show conf" to properly display CommitDelay in seconds instead of as a boolean.
  • Fix cron/requeued jobs being incorrectly reported as runaway
  • slurmctld - Prevent the double-removal of accounting usage for jobs being requeued that are in the COMPLETED or COMPLETING state.
  • When deleting a QOS from the DB, also remove it from partition QOS, AllowQOS and DenyQOS fields.
  • Fixed bug that could cause the detected CPU count to be lower than actual available CPU count. This bug could have resulted in the default value for conmgr_threads being lower than the number of available CPUs in sackd, scrun, slurmctld, slurmscriptd, slurmd, slurmstepd, slurmdbd, and slurmrestd when the assigned CPUs are not sequential.
  • slurmdbd - Prevent the following slurmdbd.conf options from overriding the default values of any in the list not specified: AllowNoDefAcct, AllResourcesAbsolute, DisableCoordDBD, DisableArchiveCommands.
  • salloc/sbatch - Nesting a non-stepmgr salloc or sbatch inside an existing job allocation that enabled the stepmgr will no longer result in the inner job's steps failing to launch.
  • Prevent slurmd -G from initializing sack processing thread.
  • Added SLURM_CLUSTER_NAME, SLURM_JOB_ACCOUNT and SLURM_JOB_GROUP environment variables when a step is launched.
  • slurmctld - Prevent marking external nodes as being unresponsive when reconfiguring if SlurmctldParameters=enable_configless is used.
  • Fix potential segfault when attempting to look up the controller address via DNS in configless mode.
  • Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
  • slurmrestd - Avoid memory leak on authentication failures with invalid bearer tokens.
  • Fix potential deadlock in _x11_signal_handler() during stepd_cleanup().
  • slurmctld - Fix reservations AllowedPartitions logic leading to incorrect purge of valid reservations in some use-cases.
  • slurmcltd - Avoid persistent connections hangs when enable_async_reply is configured.
  • Prevent potential controller segfault when reconfiguring after gres file updates.
  • Reparent slurmd to a subcgroup to avoid conflicting with systemd.
  • Fix sprio regression not handling comma separated list of jobids.
  • slurmctld,slurmd - Fix memory leak when container ID is populated.
  • slurmd - Fix P-core detection on processors with varying P-core frequencies and in cpuset-restricted environments.
  • namespace/linux - add disable_bpf_token option.
  • slurmctld - Avoid expedited requeue triggering a job to requeue when job exit code was zero.
  • slurmctld - Avoid expedited requeue of jobs while waiting for job epilog script to complete.
  • slurmctld - Prevent removing cloud nodes from the topology when putting them in the POWERED_DOWN state if they are present in topology.conf or topology.yaml and their node configuration did not specify the Topology option.
  • interfaces/topology - When modifying a nodes topology with the Topology option in slurm.conf or the slurmd --conf Topology, change the topology to fully match the new topology.
  • slurmctld - Allow changes to topology.conf or topology.yaml, and slurm.conf node configuration Topology option to take effect on a reconfigure or restart when power saving is enabled.
  • slurmctld - Prevent backfill from combining future timeslots if they have different license reservations.
  • Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
  • slurmdbd - Avoid race condition that could cause a hang during shutdown when incoming connection fails.
  • slurmdbd - Avoid crash during shutdown due to sacctmgr shutdown request.
  • Fix slurmctld assertion when using "enable_async_reply" and certmgr is used for a TLS enabled cluster.
  • Fix potential slurmd process leak when handling --get-user-env.
  • slurmcltd - Avoid race condition that could cause the StateSaveLocation updates to be missed during shutdown.
  • slurmcltd - Avoid race condition that could cause slurmctld to hang during shutdown before updating StateSaveLocation.
  • slurmctld - Avoid race condition that could cause shutdown to wait on the wrong thread.
  • Fix handling of 0 node test allocations in topology/block.
  • slurmctld - In backfill, prevent unnecessarily testing jobs at future times using the select plugin if it is guaranteed to fail.

v25.11.2

26 Jan 19:20

Choose a tag to compare

Changes in 25.11.2

  • slurmstepd - Revert regression that would apply job environment to container runtime invocation.
  • Fix issue where reservations may start while required GRES resources are still being used by jobs.
  • Fix slurmctld segfault when using --consolidate-segments.
  • Expose slurm.CONSOLIDATE_SEGMENTS flag in lua.
  • Expose the job record's segment_size in lua.
  • job_submit/lua - Expose the job_desc's segment_size in lua.
  • Prevent PMIx 5.0.8 and 5.0.9 clients from hanging when connecting to the PMIx server.
  • Clarify warning when BPF tokens are not supported.
  • slurmctld - Ensure we close already accepted conn before RPC flush check
  • slurmctld - Fix rpc_queue feature causing statesave corruption while shutdown
  • slurmctld - Ensure backfill has finished before saving state.
  • slurmctld - Ensure main scheduler has finished before saving state.
  • slurmctld - Fix error message while shutting down and state cannot be saved.
  • Fix slurmctld double free that occurs when purging array jobs from memory only when using the topology/block plugin.
  • Fix steps being rejected inside a batch job when using --cpus-per-task and --mem-per-cpu, and the job was submitted to multiple partitions, but not all of them had the same MaxMemPerCPU limit in place.
  • slurmctld - Fix crash after failed reconfiguration while running jobs and priority/multifactor enabled.
  • slurmctld - Fix jobs' QOS/association usage leading to potential underflow errors after a failed reconfiguration attempt.
  • Guess NodeName with gethostname instead of gethostname_short
  • Fix allowing job submissions when EnforcePartLimits=NO and the requested minimum number of nodes exceeds the total nodes in the specified partition(s).
  • Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
  • srun - fix bug where some input/output/error filename format identifiers were not expanded.
  • Fix detecting restricted cores with SlurmdSpecOverride in nodes with more than one socket.
  • slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.
  • Fix average calculation in latency timers to show more accurate timing logs.

v25.05.6

26 Jan 19:18

Choose a tag to compare

Changes in 25.05.6

  • Updating a job's qos will always replace the previous timelimit with the new qos' timelimit, unless another time limit is explicitly specified in the update command.
  • slurmctld - Prevent memory corruption when fanning out messages to the slurmds if TreeWidth is more then or equal to 46341 and the number of nodes in the cluster is more then or equal to (TreeWidth + 1).
  • Fix slurmctld potential deadlock when trying to schedule jobs starting many years in the future. Slurm only supports one year time limits.
  • Fix accounting for memory on steps without pids, like the extern step, which caused them to be killed if OvermemoryKill was set.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.42/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.43/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.44/job/submit'.
  • slurmctld - Fixed segfault when running configless and a malformed REQUEST_CONFIG RPC is received.
  • slurmctld - Fixed segfault when using newly added remote licenses.
  • Fix memory leak on slurmctld for jobs that use --exclusive=topo
  • Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
  • slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.

v25.11.1

26 Jan 19:19

Choose a tag to compare

Changes in 25.11.1

  • data_parser/v0.0.41 - Prevent memory leaks when freeing parsed lists.
  • data_parser/v0.0.42 - Prevent memory leaks when freeing parsed lists.
  • data_parser/v0.0.43 - Prevent memory leaks when freeing parsed lists.
  • data_parser/v0.0.44 - Prevent memory leaks when freeing parsed lists.
  • slurmctld - Prevent a fatal when min_exempt_priority is not the last option listed in PreemptParameters.
  • Updating a job's qos will always replace the previous timelimit with the new qos' timelimit, unless another time limit is explicitly specified in the update command.
  • When debugflags=script is set in slurm.conf, Lua runtime error message will be logged with backtrace.
  • slurmctld - Prevent memory corruption when fanning out messages to the slurmds if TreeWidth is more then or equal to 46341 and the number of nodes in the cluster is more then or equal to (TreeWidth + 1).
  • When GrpTRES and MaxTRESPU are set on different QOSes and both QOSes are applied to a job, ensure that both limits are honored.
  • Fix issue where a cli command or process could get stuck indefinitely when trying to retrieve a slurm.conf from slurmctld.
  • Fix slurmctld potential deadlock when trying to schedule jobs starting many years in the future. Slurm only supports one year time limits.
  • Fix pam_slurm_adopt when using namespace/linux plugin.
  • topology/tree - Prevent overflow error when calculating fanout depth.
  • The state string for nodes in the MIXED+FAIL state will now appear as "FAILING" rather than just "FAIL", similar to what is already done for nodes in the ALLOCATED+FAIL state.
  • slurmctld - Prevent a divide by zero crash by fataling if the following SlurmctldParameters have a value of less than or equal to 0: rl_table_size, rl_bucket_size, rl_refill_rate, and rl_refill_period.
  • Fix missing updates to reservation TRES and accounting when node(s) replaced due to REPLACE or REPLACE_DOWN flags.
  • slurmctld - Cancel interactive job if prolog RPC never reaches its receiver.
  • slurmctld - Cancel interactive jobs that never ran the prolog in the purge jobs logic.
  • Fix accounting for memory on steps without pids, like the extern step, which caused them to be killed if OvermemoryKill was set.
  • NO_NORMAL_ALL will only be printed if all NO_NORMAL_* flags are set.
  • slurmctld - Prevent the controller from believing it has a job's federation cluster lock when it does not.
  • Fix jobs incorrectly stuck waiting for resources when launched with specific client flag combinations containing "--hint=nomultithread".
  • Fix allocated licenses still showing after removing all allocated licenses.
  • accounting_storage/mysql - Disallow creating users if requested user list is empty or usernames are empty strings.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.42/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.43/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.44/job/submit'.
  • slurmrestd - Revert regression that changed the error from "Authentication failure" to "Authentication does not apply to request" when a HTTP request lacks any authentication credentials.
  • When a job requests multiple partitions and cannot run in one of them due to topology, allow the main scheduler to evaluate jobs in the other requested partitions.
  • slurmctld - Acquire the node write lock instead of the node read lock when querying 'GET /metrics/nodes' and 'GET /metrics/partitions' endpoints.
  • slurmctld - Fixed segfault when running configless and a malformed REQUEST_CONFIG RPC is received.
  • Remove error output for missing optional spank plugin.
  • slurmctld - when unable to schedule a job with preferred node features, don't exclude the partition from further scheduling attempts in the same iteration.
  • Fix issue with RestrictedCoresPerGPU with shared gres.
  • Fix rpmbuild --with libcurl option.
  • Add new JobAcctGatherParams=no_file_cache to change how memory usage (RSS) is reported when using cgroup/v2. With this flag set we will subtract active_file and inactive_file from the value reported in memory.current to avoid counting the file cache. memory.peak will then not be used to get the MaxRSS and getting memory spikes will depend on the JobAcctGatherFrequency parameter.
  • namespace/linux - fix bug that could leave defunct processes in the jobs namespace.
  • namespace/linux - kill and reap the namespace process during job teardown.
  • namespace/linux - Fix issue with user_ns_script that may result in STDIN closing, which may result in 'Unable to receive "ok ack"' error on slurmstepd or other undefined behavior.
  • Fix error reading /proc/0/* when calling the api outside the step namespace.
  • slurmctld - Fixed segfault when using newly added remote licenses.
  • Fix SIGCHLD not being sent to tasks.
  • bitmap2node_name() is not cleaned up properly when reservation logging is enabled.
  • Fix issue with jobs running on slurmd's with version 25.05.x or older getting aborted when slurmd re-registers with slurmctld.
  • Fix memory leak on slurmctld for jobs that use --exclusive=topo
  • Prevent jobs that cannot fit in the reservation's time limit from being attracted to a magnetic reservation.
  • Fix slurmstepd segfault for older versioned batch jobs (25.05 and older) submitted without using -o/--output on submission.