# Getting started with slurm


## Summary 

In this lab you will learn some of the basics of slurm and a few of the more advanced features that you might helpful in the future. We will be doing this lab on our class cluster but everything we are doing will also work on Stanford's Sherlock system (or any other slurm cluster) often with more interesting results because they are bigger more complex systems.


# Slurm basics


Slurm is a workload manager. Its jobs is to efficiently and fairly provide cluster resources. A slurm controlled cluster is ususally broken up into parittions where different sets of rules on use and are often composed of different hardware.  You use slurm resources by making a request to the slurm controller node for resources.  Your request is evaluated for priority based on what are referred to as "fairshare" rules. When there are free resources and you have the top priority you are granted access.

## sinfo

The sinfo command is very valuable way to find out about a cluster. Using the command without any options will give you a general idea about the cluster you are using.


In [10]:
!sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite     11  idle~ slurm-gp257-devel-compute-0-[1-11]
debug*       up   infinite      1  alloc slurm-gp257-devel-compute-0-0
cpu          up   infinite      5  idle~ slurm-gp257-devel-compute-1-[0-4]
t4           up   infinite      9  idle~ slurm-gp257-devel-compute-2-[1-9]
t4           up   infinite      1  alloc slurm-gp257-devel-compute-2-0
preempt      up   infinite     49  idle~ slurm-gp257-devel-compute-3-[1-49]
preempt      up   infinite      1  alloc slurm-gp257-devel-compute-3-0


By default the sinfo command gives you the names of the various partitions, the number of nodes in each partition, their status, and the name of the nodes in the various partitions.  This represents a very small amount of the sinfo commands capabilites. For example I can find the number of cores on a given node using the -n (for a specific node) and -O (for specific information) flags.

In [11]:
!sinfo -n slurm-gp257-devel-compute-0-0 -O CORES -O DISK

TMP_DISK            
0                   


Below are some additional useful options.  Generic resources, Gres, refers to GPUs among other things.


| Option | Description|
| :--- | ------- |
| Cores | Number of cores per socket.|
| Disk  | Size of temporary disk space per node in megabytes.|
| FreeMem | The total memory, in MB, currently free on the node as reported by the OS. This value is for informational use only and is not used for scheduling.|
| Gres  | Generic resources (gres) associated with the nodes. |
| GresUsed | Generic resources (gres) currently in use on the nodes. |
| Memory | Size of memory per node in megabytes. |
| c  | Preemption mode. |
|  Threads | Number of threads per core. |
| Time  |Maximum time for any job in the format "days-hours:minutes:seconds". |


In the cell below build a table describing the nodes on the cluster using the above commmand.

In [12]:
def print_cluster_info(cluster_name, node_start, node_end):
    options = ['Cores', 'Disk', 'FreeMem', 'Gres', 'GresUsed', 'Memory', 'Threads', 'Time']
    info = '\t\t\tINFO: '
    for opt in options:
        info += f"\t{opt:8}" 
    print(info)
    
    
    for inode in range(node_start, node_end+1):
        node_name = cluster_name + f'-{inode}'
        node_info = node_name
        for opt in options:
            out = !sinfo -n $node_name -O $opt
            node_info += f"\t{out[1].strip():8}"
        print(node_info)

print_cluster_info('slurm-gp257-devel-compute-0', 0, 11)

			INFO: 	Cores   	Disk    	FreeMem 	Gres    	GresUsed	Memory  	Threads 	Time    
slurm-gp257-devel-compute-0-0	4       	0       	29210   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-1	4       	0       	29726   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-2	4       	0       	30788   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-3	4       	0       	30753   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-4	4       	0       	30752   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-5	4       	0       	30752   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-6	4       	0       	30761   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-7	4       	0       	30752   	(null)  	gpu:0   	31408   	1       	infinite
slurm-gp257-devel-compute-0-8	4       	0       	30748   	(null)  	gpu:0   	31408   	1       	infinite


# squeue

Thre squeue command gives us information about what is actually running on the cluster.  Often it useful to use the -p partiton option to reduce the infomation it produces on a cluster.


In [13]:
! squeue -p debug

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1947     debug     bash haipeng_  R       6:45      1 slurm-gp257-devel-compute-0-0


The results of the command gives you all the jobs that have been submitted to the cluster. It displays the jobid, the parition the job has been submitted, its status, how long it has been running, and the nodes the job is running on.  Below is the list of status possibilites.


|Status|Code|Explaination|
| ----- | ------ | ----- |
|COMPLETED|CD|The job has completed successfully.|
|COMPLETING|CG|The job is finishing but some processes are still active.|
|FAILED|F|The job terminated with a non-zero exit code and failed to execute.|
|PENDING|PD|The job is waiting for resource allocation. It will eventually run.|
|PREEMPTED|PR|The job was terminated because of preemption by another job.|
|RUNNING|R|The job currently is allocated to a node and is running.|
|SUSPENDED|S| A running job has been stopped with its cores released to other jobs.|
|STOPPED|ST|A running job has been stopped with its cores retained.|

The squeue and sinfo together provide a powerful tool in combination if you can't figure out why your job isn't starting (or fails)". Somplex examples include
- If you are requesting a job using a number of nodes you can see how many nodes are being currently used. Is it all by one user?
- If your job will not start, are you requesting more memory that in on the queue, more processors?
- Are other users using up all of the memory /GPUs currently?


# salloc, sbatch, and srun

There are two general ways to request slurm resources, salloc and sbatch.  The easiest way to remember the differece is to use salloc when you want to do something interactivelty, sbatch when you are wanting a submit a job that runs eventually.  srun is used to launch jobs on these allocated resources but can be used directly (in whcih case it calls an salloc behind the scenes). 

The command you used to start this lab is a simple example

srun --pty --x11 /bin/bash

Here we requesteed a node on the default partition, with the default amount of memory, for the default amount of time, with a single core, said we wanted to run in terminal mode, and ran the command bash. 

In almost all cases you want to specify the resources you want to use it and the time you want to use them. The primary reason you want to specify the resources you need is the defults are often not approriate. You many need more memory, need more cores, etc. Almost as important is that it can often time give you an advatnage with the scheduler. If your job doesn't take much memory when another user is using most of the memory on a node, your job will start, while someone using the default might not. If the scheduler is trying to run a job that requires many cores, specifying your job will only run for a few minutes, the scheduler will run your job while waiting for other jobs to finish.





# Specifying job parameters (the basics)

We can specify slurm configuration parameters on the command line, or when using sbatch in the submission shell file by begining a line with #SBATCH before specify an option.

We can specify the parition using the -partition partName flag. The length of a job with --time=day-hours:minutes:seconds.  We can specify to use gpus by using the gres flag  --gres=gpu[[:type]:number].


## Specifying tasks

Slurm thinks of a job that has one or tasks. These task(s) run on one or more nodes.  We specify the number of tasks using --ntasks=ntasks.  If each parallel tasks uses multiple cores we can use --cpus-per-task=ncpus.  We can specify multiple tasks per core (--tasks-per-cpu=tasks) or node (--tasks-per-nodes). 

## Memory

There are a number of ways to specify. We can specify the total memory for a job with --mem=<size>units. We can also specify memory per cpu --mem-per-cpu=<size>[units] and per GPU --mem-per-gpu=<size>[units].



# Simple submission

In this part of the lab you will write a simple program to calculate pi using Leibniz’s formula.


X = 4 - 4/3 + 4/5 - 4/7 + 4/9 - ....

This series is never-ending, the more the terms this series contains, the closer the value of X converges to Pi value.

Use the next three cells to 

- write and save a python script that caclulates Pi using Leibniz formula. You should sum to 1000000 terms.
- write a sbatch that
    - submits a job to the compute partition
    - submits a time limit of 2 hours for the job
    - specifies one GB of memory
    - runs on a single core
- submits the jobs to the cluster


In [14]:
%%writefile leib.py
#!/usr/bin/env python3

ite_num = 1000000

res = 0
for i in range(ite_num):
     res += 4 * ((-1) ** i) / (2 * i + 1)

    
f=open("result1.txt","w")
f.write(f"{res}")
f.close()

Overwriting leib.py


In [15]:
%%writefile submit.sh
#!/bin/bash
#SBATCH --job-name=lesib
#SBATCH --partition=debug
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00:00
#SBATCH --mem-per-cpu=1GB

spack load python@3.10.8
python3 ./leib.py

Overwriting submit.sh


In [16]:
!sbatch submit.sh

Submitted batch job 1948


In [23]:
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1947     debug     bash haipeng_  R       9:15      1 slurm-gp257-devel-compute-0-0
               881   preempt     bash tculliso  R   18:50:18      1 slurm-gp257-devel-compute-3-0
               529        t4     bash clapp_st  R 1-17:35:35      1 slurm-gp257-devel-compute-2-0


# Slurm advanced

Slurm also has a number of advanced options that can be quite useful.  

## Resubmitting

Clusters are always trying to maximize there usage.  As a result they often have a mechanism that allows unused resources used for free at or a reduced cost. In general you get to use these resources until someone who has/or is willing to pay/has paid a higher cost. At Stanford, the Sherlock clusters owner's queue operates in this manner. If you submit a job to the queue it will use all free resources if the owner of the resource requests those resources your job will be killed. GCP has the concept of preemitbible nodes, which opperate at a much lower cost, but the jobs can be killed at any time.

Many applications can take advantage of these potentially killed resources.  They are ideal for jobs that take a short time and don't use many cores. Slurm provides the option --requeue to enable the use of these resources.  Any job that doesn't complete is automatically resubmitted to the queue.

## Job arrays

There are many cases where you want to run a series of very similar jobs. Each job could be submitted in sequence but this can overload a slurm controller load.  Slurm job arrays provide an effective way to approach these jobs. The command line option --array offers a potential solution. You can submit an array in many forms:

- --array=0-31    # jobs 0 1 2 3 4 5 .. 31
- --array=1,3,5,7 # jobs 1 3 5 7
-  --array=1-7:2  # jobs 1 3 4 5 7

Each job will set the environmental variable SLURM_ARRAY_TASK_ID with the corresponding ID.


## Dependency

Another powerful feature of slurm is the ability to build dependencies. Basically guarantee that a task doesn't start until a given condition has been met. You can set a dependency using --dependency=<type:job_id[:job_id] where type is one of the following.


- after:jobid[:jobid...]	job can begin after the specified jobs have started
- afterany:jobid[:jobid...]	job can begin after the specified jobs have terminated
- afternotok:jobid[:jobid...]	job can begin after the specified jobs have failed
- afterok:jobid[:jobid...]	job can begin after the specified jobs have run to completion with an exit code of zero (see the user guide for caveats).



# Complex submission

A second way to calculate pi is to use random numbers. You can calculate pi by taking by 
 - choosing two random numbers between 0 and 1
 - checking whether to sum of the squares of those numbers is <= 1
 - 4 times the fraction of numbers that meet this criteria will be equal to pi


Your job is to

- write a program that follows the above procedure to estimate pi using a large number of tests
    - it should seed to random number generated based on reading the environmental variable SLURM_ARRAY_TASK_ID.
    - it should write the estimate to a file that uses the SLURM_ARRAY_TASK_ID in its name
- You should write a second program that reads a series of files with the name pattern above and averaes them and writes the result to result2.txt.
- You should write two slurm submission scripts
    - The First should submit to the preempt partition a job array of 10000 and resubmit if the job fails (make sure you test your code by running it on this node before submitting)
    - The second job should depend on the first job finishing with an exit code 0 befofre running the second python script. This job should run on debug partition

In [24]:
%%writefile random_pi.py
#!/usr/bin/env python3

import os
import random

task_id = os.environ.get('SLURM_ARRAY_TASK_ID')
random.seed(task_id)

in_circle = 0
x = random.uniform(0, 1)
y = random.uniform(0, 1)
if x**2 + y**2 <= 1:
    in_circle = 1

f=open(f"test_{task_id}.txt","w")
f.write(f"{in_circle}")
f.close()

Overwriting random_pi.py


In [56]:
%%writefile ave_pi.py
#!/usr/bin/env python3

ntask = 1000
count = 0
for task_id in range(ntask):
    f=open(f"test_{task_id}.txt","r")
    count += float(f.read())
    f.close()

res = 4 * (count / ntask)

f=open("result2.txt","w")
f.write(f"{res}")
f.close()

Overwriting ave_pi.py


In [28]:
%%writefile submit1.sh
#!/bin/bash
#SBATCH --job-name=random_pi
#SBATCH --partition=preempt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00:00
#SBATCH --mem-per-cpu=1GB
#SBATCH --array=0-1000
#SBATCH --requeue

spack load python@3.10.8
python3 ./random_pi.py  

Overwriting submit1.sh


In [30]:
!sbatch submit1.sh

Submitted batch job 1949


In [57]:
%%writefile submit2.sh
#!/bin/bash
#SBATCH --job-name=random_pi
#SBATCH --partition=debug
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00:00
#SBATCH --mem-per-cpu=1GB
#SBATCH --dependency=afterok:1949 

spack load python@3.10.8
python3 ./ave_pi.py  

Overwriting submit2.sh


In [58]:
!sbatch submit2.sh

Submitted batch job 2953


In [60]:
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1947     debug     bash haipeng_  R    1:48:16      1 slurm-gp257-devel-compute-0-0
              2952     debug     bash jdstitt_  R       2:25      1 slurm-gp257-devel-compute-0-2
               529        t4     bash clapp_st  R 1-19:14:36      1 slurm-gp257-devel-compute-2-0


# Finishing up

Add this lab to your class github site after adding all of the files that you have created for this lab.