# Slurm
Slurm is a widely used cluster manager and job scheduling system. It's used to submit jobs in an HPC system.  

### First contact with Slurm
Basic commands to communicate with the cluster are:
- `srun` to directly run a command on a computing node. This is usually used to have interactive sessions slurm
- `sinfo` to get info on specific jobs (selected by jobid, user etc.). Useful to monitor your jobs
- `sbatch` to submit jobs to the cluster. This is useful if you want to submit scripts etc. 

## How does the cluster look like
We can use `sinfo` to get a glimpse of the cluster structure:
```bash
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
icb_cpu*           up 7-00:00:00     15    mix ibis216-010-[022-023,034-035,051,064,071],ibis216-224-[010-011],icb-neu-[001-003],icb-rsrv[05-06,08]
icb_cpu*           up 7-00:00:00     22  alloc ibis-ceph-[002-006,008-019],ibis216-010-[011-012,020-021,033]
icb_cpu*           up 7-00:00:00     19   idle ibis216-010-[001-004,007,024-032,036-037,068-070]
icb_gpu            up 7-00:00:00      9    mix icb-gpusrv[02-08],supergpu02pxe,supergpu03pxe
icb_gpu            up 7-00:00:00      1   idle icb-gpusrv01
icb_interactive    up   12:00:00      9  down* clara,fonsi,heidi,hias,icb-lisa,icb-mona,icb-sarah,sepp,wastl
icb_interactive    up   12:00:00      1    mix icb-iris
icb_rstrct         up 5-00:00:00      1    mix icb-neu-003
bcf                up 12-00:00:0      1    mix ibis216-010-005
bcf                up 12-00:00:0      1   idle ibis216-010-006
```

## What are the running jobs?
We can use `squeue` to get that info
```bash
            535882   icb_cpu nf-Veloc thomas.w  R 1-00:59:00      1 ibis216-224-010
            538003   icb_cpu rhapsody emilio.d  R   22:16:26      1 ibis216-010-071
            541083   icb_gpu EMBEDDIN leander.  R      51:45      1 supergpu03pxe
            541090   icb_gpu EMBEDDIN leander.  R      42:29      1 supergpu03pxe
            541091   icb_gpu EMBEDDIN leander.  R      41:46      1 supergpu03pxe
```

## How about a specific job?
We can look at specific jobs with `scontrol show jobid [JOBID]`
```bash
(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ sbatch submit_interactive.sh
Submitted batch job 543650
(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ sq
            543650   icb_cpu interact giovanni  R       0:00      1 ibis216-010-051
(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ scontrol show jobid 543650
JobId=543650 JobName=interactive
   UserId=giovanni.palla(138707) GroupId=OG-ICB-User(20000) MCS_label=N/A
   Priority=4294048901 Nice=1000 Account=icb-user QOS=icb_stndrd
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:12 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2020-09-10T12:01:00 EligibleTime=2020-09-10T12:01:00
   AccrueTime=2020-09-10T12:01:01
   StartTime=2020-09-10T12:01:01 EndTime=2020-09-10T22:01:01 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-10T12:01:01
   Partition=icb_cpu AllocNode:Sid=vicb-submit-02.scidom.de:24925
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=ibis216-010-051
   BatchHost=ibis216-010-051
   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=8G,node=1,billing=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=8 MinMemoryNode=8G MinTmpDiskNode=0
   Features=xeon_6126|opteron_6234|opteron_6376|opteron_6378 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive/submit_interactive.sh
   WorkDir=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive
   StdErr=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive/interactive_543650.err
   StdIn=/dev/null
   StdOut=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive/interactive_543650.out
   Power=
  ```

## Establish an interactive slurm session
```bash
srun -p icb_interactive -w ibis216-010-022 -c 1 -t 00:15:00 --mem=200 --pty bash
```

The `--pty` is used to assign the commmand. In this case, we just want to get a bash terminal. One way I often use this is
```bash
(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ srun -p icb_gpu -w icb-gpusrv03 --pty nvidia-smi
Thu Sep 10 13:11:07 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.31       Driver Version: 440.31       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:65:00.0 Off |                  N/A |
| 61%   83C    P2   147W / 250W |  12005MiB / 12066MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:B3:00.0 Off |                  N/A |
| 62%   83C    P2   140W / 250W |  12005MiB / 12066MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     60099      C   python                                     11993MiB |
|    1     60098      C   python                                     11993MiB |
+-----------------------------------------------------------------------------+
```
Very useful if you want to get quick info on efficiency of gpu usage etc.

## Guidelines for srun
In general, always be specific with the arguments, although it's true that slurm systems usually have sound defaults values.

## More on sbatch
`sbatch` is useful to submit scripts as jobs. Usually, you have interactive sessions with `srun` for prorotyping, but then you want to use `sbatch` for the major computation. Argemunts are exactly the same as `srun`, but specified differently.

```bash
#!/bin/bash

#SBATCH -o slurm_output.txt
#SBATCH -e slurm_error.txt
#SBATCH -J MyFancyJobName
#SBATCH -p icb_cpu
#SBATCH --nodelist=ibis-ceph-002
#SBATCH -c 1
#SBATCH --mem=2G
#SBATCH -t 00:15:00
#SBATCH --nice=10000 

echo "Starting stuff at `date`"
# You can put arbitrary unix commands here, call other scripts, etc...
sleep 10
echo "Computering..."
sleep 900
echo "Ending stuff at `date`"
```

## Interactive session with sbatch
In a typical datascience workflow, you might want to start your coding with a jupyter instance. This is a way to do it.  

Create a script `submit_interactive.sh` that looks like this:
```bash
#!/bin/bash

#SBATCH -o "interactive_%j.out"
#SBATCH -e "interactive_%j.err"
#SBATCH -J interactive
#SBATCH -c 8 # default values is 2
#SBATCH --constraint="xeon_6126|opteron_6234|opteron_6376|opteron_6378"
#SBATCH --mem=8GB
#SBATCH -t 10:00:00
#SBATCH --nice=10000

./run_jupyter.bash -e myenv
```  

and another script `run_jupyter.bash`, that looks like this:
```bash
#!/bin/bash

source ~/.bashrc

while getopts ":e:" opt; do
  case $opt in
    e) env="$OPTARG"
    ;;
    \?) echo "Invalid option -$OPTARG" >&2
    ;;
  esac
done

conda activate $env
cd /storage/groups/ml01/workspace/giovanni.palla
jupyter lab --no-browser --ip=0.0.0.0
```

## Interactive session with sbatch
After ~30 seconds, you will read the link for the jupyter session in the `.err` file
```bash
(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ cat interactive_543650.err
[I 12:01:26.392 LabApp] JupyterLab extension loaded from /home/icb/giovanni.palla/miniconda3/envs/sfaira/lib/python3.8/site-packages/jupyterlab
[I 12:01:26.392 LabApp] JupyterLab application directory is /home/icb/giovanni.palla/miniconda3/envs/sfaira/share/jupyter/lab
[I 12:01:26.401 LabApp] Serving notebooks from local directory: /storage/groups/ml01/workspace/giovanni.palla
[I 12:01:26.401 LabApp] The Jupyter Notebook is running at:
[I 12:01:26.401 LabApp] http://ibis216-010-051.scidom.de:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690
[I 12:01:26.401 LabApp]  or http://127.0.0.1:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690
[I 12:01:26.401 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:01:26.424 LabApp]

    To access the notebook, open this file in a browser:
        file:///mnt/home/icb/giovanni.palla/.local/share/jupyter/runtime/nbserver-1565-open.html
    Or copy and paste one of these URLs:
        http://ibis216-010-051.scidom.de:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690
     or http://127.0.0.1:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690
```

## Another command: sacct
Useful to check all your recent jobs (finished, cancelled, etc)

```bash
(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
543650       interacti+    icb_cpu   icb-user          8    RUNNING      0:0
543650.batch      batch              icb-user          8    RUNNING      0:0
543650.exte+     extern              icb-user          8    RUNNING      0:0
543818       nvidia-smi    icb_gpu   icb-user          2  COMPLETED      0:0
543818.exte+     extern              icb-user          2  COMPLETED      0:0
543818.0     nvidia-smi              icb-user          2  COMPLETED      0:0
```