# SLURM scheduler conveniences
Dask jobqueue is currently able to scale a cluster for exactly one batch partition only. For work sessions to be as smooth as possible the scientific user needs to be enabled to do an informed decision about the batch partition to use. This Jupyter notebook collects a few tools that are, or might be useful, in achieving exactly that.

## Idle compute nodes

A simple SLURM command that filters the reported node state information for the number of currently idle nodes.

Problem: On JUWELS, I did not experienced this to be very helpful to understand if it is possible to successfully get nodes from e.g. the `batch` partition. During my experiments it worked sufficiently reliable only for the `esm` and `devel` partitions. Haven't tried to get a structured overview on every partition here, though.

In [1]:
sinfo -t idle --format="%9P %.5a %.5D %.5t"

PARTITION AVAIL NODES STATE
batch*       up  1276  idle
devel        up    19  idle
mem192       up   236  idle
esm          up     5  idle
large      down  1276  idle
gpus         up    15  idle
develgpus    up    10  idle
maint        up  1325  idle


## Job start-up time estimate
Problem: This doesn't take into account the scheduler backfilling mechanism and is therefore also not very accurate. Though, certainly more specific than the idle node count above.

In [2]:
cat > scheduling.sh << EOF
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH --cpus-per-task=96
#SBATCH --mem=79G
#SBATCH --test-only
hostname
EOF

In [3]:
date && scontrol show partition | grep PartitionName | cut -f2 -d"=" | \
xargs -I {} bash -c "echo '##### {} ##### '  && \
sbatch --account esmtst --time 00:15:00 --nodes 3 --partition {} scheduling.sh"
printf "" # prevent displaying of xargs non-zero exit codes

Sun Nov  8 20:48:23 CET 2020
##### batch ##### 
sbatch: Job 2950027 to start at 2020-11-08T22:45:23 using 288 processors on nodes jwc01n[016-018] in partition batch
##### devel ##### 
sbatch: Job 2950028 to start at 2020-11-08T20:48:23 using 288 processors on nodes jwc00n[017-019] in partition devel
##### mem192 ##### 
sbatch: Job 2950029 to start at 2020-11-08T20:48:23 using 288 processors on nodes jwc08n[280-282] in partition mem192
##### esm ##### 
sbatch: Job 2950030 to start at 2020-11-08T20:48:24 using 288 processors on nodes jwc00n[000,003,006] in partition esm
##### large ##### 
sbatch: Job 2950031 to start at 2020-11-08T20:48:24 using 288 processors on nodes jwc01n[016-018] in partition large
##### gpus ##### 
allocation failure: Invalid generic resource (gres) specification
##### develgpus ##### 
allocation failure: Invalid generic resource (gres) specification
##### maint ##### 
allocation failure: Invalid account or account/partition combination specified


## LLview for JUWELS

Problem: Not all batch partitions are implemented and I have sometimes experienced connectivity problems with [the JUWELS LLview web page](https://llview.fz-juelich.de/LLweb/juwels/jobreport/login.php). Also, at least on JUWELS the remote client way as described in the docs [here](https://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html) is currently not working.

## Idle node projection

WIP: Can we inform users about what to expect for any partition during their workday in more detail? Of course the accuracy of this would heavily depend on the behaviour of other users submitting batch jobs on the system, but it would still be a nice feature to get a feeling about system occupation.

In [4]:
# this is only a starting point...
squeue -o "%.9P %.2t %.10M %.6D %R %e" | grep batch

    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependency) N/A
    batch PD       0:00     12 (Dependen