# Slurm jobs on NeSI GPUs

In this section, we will revisit our first model and try to execute it non-interactively on NESI's GPUs, using a Slurm job.


## Introduction to HPC

An HPC (high performance computing) platform is a set of nodes put together with fast interconnection, in order to execute massively parallel computations or many different computations in parallel.

<img src="../images/mahuika_maui_real.png" width=900 />

Each computation is a **job**, which will be run non-interactively on a set of nodes that possess the right amount of CPU, memory and GPU requested by the user, for a given period of time.
All jobs are handled by a **scheduler**, that queue and assign them to nodes depending on available resources and requests.
NeSI uses **Slurm** (Simple Linux Utility for Resource Management) for this.

In this workshop, we will submit a job via Slurm to train a model using a node that has a GPU.

## Create a Python script

We will see in a moment how to send a job request to Slurm using a *job submission script*.
Before this, we need to convert our code into a self-contained script that can be run from this job submission script, like a command line executable.
The following cell will write a python script file containing the code to train the example model from the [Image classification](02_classification.ipynb) notebook.

The last line `model.save("outputs/trained_model_flowers")` ensures that the trained model is saved at the end of the script, here in a folder called `outputs/trained_model_flowers`.

In [None]:
%%writefile scripts/train_model.py
import pathlib
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt

# define the hyper-parameters of the model
batch_size = 32
epochs = 15

# load training dataset and split it in train, validation and test sets
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file("flower_photos", origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

img_height = 180
img_width = 180
num_classes = 5

train_ds = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="training",
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size,
)

val_ds = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="validation",
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size,
)

val_batches = tf.data.experimental.cardinality(val_ds)
test_ds = val_ds.take(val_batches // 2)
val_ds = val_ds.skip(val_batches // 2)

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=tf.data.AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

# model definition, using data augmentation during training
data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal", input_shape=(img_height, img_width, 3)),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.1),
    ]
)

model = Sequential(
    [
        data_augmentation,
        layers.Rescaling(1.0 / 255),
        layers.Conv2D(16, 3, padding="same", activation="relu"),
        layers.MaxPooling2D(),
        layers.Conv2D(32, 3, padding="same", activation="relu"),
        layers.MaxPooling2D(),
        layers.Conv2D(64, 3, padding="same", activation="relu"),
        layers.MaxPooling2D(),
        layers.Dropout(0.2),
        layers.Flatten(),
        layers.Dense(128, activation="relu"),
        layers.Dense(num_classes),
    ]
)

model.summary(line_length=80)

# compile and train the model
model.compile(
    optimizer="adam",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

history = model.fit(train_ds, validation_data=val_ds, epochs=epochs)

# evaluate the model and plot learning curves
test_loss, test_acc = model.evaluate(test_ds, verbose=2)
print(f"test accuracy: {test_acc}")

plt.style.use('seaborn')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
epochs_range = range(epochs)

ax1.plot(epochs_range, history.history["accuracy"], label="Training Accuracy")
ax1.plot(epochs_range, history.history["val_accuracy"], label="Validation Accuracy")
ax1.axhline(test_acc, color="k", label="Test Accuracy")
ax1.legend(loc="lower right")
ax1.set_title("Training and Validation Accuracy")

ax2.plot(epochs_range, history.history["loss"], label="Training Loss")
ax2.plot(epochs_range, history.history["val_loss"], label="Validation Loss")
ax2.axhline(test_acc, color="k", label="Test Loss")
ax2.legend(loc="upper right")
ax2.set_title("Training and Validation Loss")

fig.savefig("outputs/learning_curves.png")

# save the model on disk
model.save("outputs/trained_model_flowers")

If you check the `scripts` folder, you should now see a file called `train_model.py`.

*Note: JupyterLab also provides a text editor that can be used create and edit text files. Click on this [link](scripts/train_model.py) to open it.*

## Create a Slurm job script

To tell Slurm how to run our code, we need to write a **job submission script** whose role is to:

1. detail the resources requirements (CPUs, RAM, GPU, time limit, etc.),
1. setup the software environment,
1. run our script.

The job submission script is a regular *bash script* with:

1. a header with comments to tell Slurm about the resources we want,
1. some command using *environment modules* to load the right software stack for us,
1. a bash command to run our script

<img src="../images/anatomyofslurm_bashscript.png" width=900 />

Execute the following cell to create the job submission script `train_model.sl`:

1. This job will request 2 CPUs, 8 GB of RAM and a A100-1g.5gb GPU (1/7th of a A100) for 10 minutes using the training account `nesi99991`.
1. The software stack for this workshop has been prepared as a conda environment, hence we need to load the conda module and activate the conda environment.
1. The final line of the script run our training script.

In [None]:
%%writefile scripts/train_model.sl
#!/usr/bin/env bash
#SBATCH --account=nesi99991
#SBATCH --time=00-00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8GB
#SBATCH --gpus-per-node=A100-1g.5gb:1
#SBATCH --output=logs/slurm-%j.out

# load required environment modules
module purge
module load Miniconda3 cuDNN/8.1.1.33-CUDA-11.2.0

# activate the conda environment
source $(conda info --base)/etc/profile.d/conda.sh
export PYTHONNOUSERSITE=1
conda deactivate
conda activate /nesi/project/nesi99991/ml102_20220616/jupyter_kernel_env

# execute the script
python scripts/train_model.py

## Submit a Slurm job and check results

Now that we have a training script and job submission script, we can submit our job to Slurm and wait for the results 🎉.

To interact with Slurm, there is a set of command line tools:

- `sbatch JOBSCRIPT` to submit the job script to Slurm, and returns a **JOB ID** number,
- `squeue --me` to check the position of your jobs in the Slurm queue,
- `scancel JOBID` to cancel your job,
- etc.

<img src="../images/batch_system_flow.png" width=900 />

You can enter them in a terminal running on NeSI, in JupyterLab or an SSH session.

Here we will use an alternative, using the syntax `!command` in a notebook running on NeSI.

First, let's submit our job.

In [None]:
!sbatch scripts/train_model.sl

The `sbatch` command returns a number, the job ID, that is unique to this computation.

Next, let's see if your job has already started or is waiting for resources, using the `squeue` command.

In [None]:
!squeue --me

Because the current Jupyter session is also running as a Slurm job, you should see at least one job called `spawner-jupy`.

If it's the only one, then your job has finished.
⚠️ It does not mean that your job has been successful.
To check the final status of your job, you can:

- look at the log file called `slurm-JOBID.out`, which captures everything that your Python script would have printed on the command line,
- use the command `sacct -j JOBID` to print the status of your job (FAILED, COMPLETED, etc.).