# Submitting Python Scripts to an HPC Cluster with SLURM

This lecture explains how to submit Python scripts to an HPC cluster using SLURM, both interactively with `salloc` and using batch scripts.

## Sample python scripts:

You should have 2 scripts:

### Script 1 (CPU testing):
```bash
vim cpu_script.py
```

```Python
# CPU test
import time
import os

print("Running cpu_script.py")
print(f"Hostname: {os.uname().nodename}")
print("Starting CPU task...")
time.sleep(20)  # Sample code for testing
print("CPU task complete.")
```

### Script 2 (GPU testing):

- ```nvidia-smi` is a command-line utility provided by NVIDIA that stands for NVIDIA System Management Interface. It can be used to monitor and manage NVIDIA GPU devices.


```bash
vim gpu_script.py
```

```Python
# GPU test
import os
import subprocess
import time

print("Running my_gpu_script.py (CUDA Direct)")
print(f"Hostname: {os.uname().nodename}")

try:
    # Run nvidia-smi to check for GPU availability
    subprocess.run(["nvidia-smi"], check=True)

    # Display nvidia-smi
    print("CUDA device detected.") 
    print("Running nvidia-smi")

except subprocess.CalledProcessError:
    print("No CUDA device found. Running on CPU.")
    time.sleep(20) # Sample code for testing

print("Task complete.")
```


First, test them in your laptops. Then, ```scp``` them into the HPC cluster

## Method 1: Interactive Job Submission with `salloc`

`salloc` allocates resources and provides an interactive shell on a compute node, allowing you to run Python scripts and other commands directly.

### Basic `salloc` Usage on CPU partitions:

#### Example 1:

These SLURM commands allocate 1 hour, 1 CPU core, and 1GB of memory  in the cpu partition.

```bash
salloc --partition cpu --cpus-per-task=1 --time=1:00:00 --mem=1G
```

```bash
salloc --partition cpu -n 1 --time=1:00:00 --mem=1G
```

```-n``` typically refers to cores, the exact interpretation can depend on how your cluster's SLURM configuration is set up. 


#### Script run output:

```bash
salloc --partition cpu --cpus-per-task=1 --time=1:00:00 --mem=1G

ssh dgx-node-0-x

conda activate py310

python cpu_script.py 
Running my_cpu_script.py
Hostname: dgx-node-0-0.cedia.edu.ec
Starting CPU intensive task...
CPU task complete.
```


#### **Example 2: (this is NOT possible in the HPC Cedia cluster)**

These SLURM commands allocate 1 hour, 1 CPU core, and 1GB of memory  in the cpu partition.

```bash
salloc -p cpu -N 2 --ntasks-per-node=16 --mem=32G
```

Here:

```-N 2```: Requests 2 nodes

```--ntasks-per-node=16```: Requests 16 tasks (cores) per node. Since you're requesting 2 nodes, this results in a total of 32 cores (16 cores/node * 2 nodes = 32 cores).

```--mem=32G:``` Requests 32 gigabytes of memory. This memory will be allocated across the two nodes. Depending on the system configuration, this can mean 16GB per node, or that the system will allocate 32GB total, and the system will handle memory allocation between the two nodes.




### Basic `salloc` Usage on On GPU partitions:

#### Example:

```bash
salloc -p gpu -n 1 -c 16  --mem=1G --gres=gpu:a100_2g.10gb:1 --time=00:10:00
```

Here:

```-p gpu```: This specifies that the job should be submitted to the "gpu" partition (or queue). This means that the resources allocated will be on nodes that have GPUs available.

```-n 1```: This requests 1 task. In SLURM, a task typically corresponds to a single process.

```-c 16```: This specifies that 16 CPU cores should be allocated per task. In this case, since there is only one task, the total number of CPU cores reserved is 16.

```--mem=1G```: This requests 1 gigabyte of memory.

```--gres=gpu:a100_2g.10gb:1:``` It requests a generic resource (GRES), specifically a GPU with the name **100_2g.10gb** (A100 GPU), which has **2 GB** of dedicated ram, and **10 GB** of shared ram.

```:1```: This specifies that one instance of the requested GPU resource should be allocated.

```--time=00:10:00:``` This sets the time limit for the job to 10 minutes (0 hours, 10 minutes, 0 seconds).


#### Script run output:

```bash
ssh dgx-node-0-x

conda activate py310

python gpu_script.py 

Running my_gpu_script.py (CUDA Direct)
Hostname: dgx-node-0-0.cedia.edu.ec
Wed Mar 26 15:37:28 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                   On |
| N/A   29C    P0              44W / 400W |     75MiB / 40960MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    5   0   0  |              25MiB /  9856MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
CUDA device detected.
Running nvidia-smi
Task complete.
```


## Method 2: Standalone Job Submission with Scripts: Batch jobs

Batch jobs are used for running longer or more complex tasks that don't require direct interaction between the user and the HPC.


### Test Batch Script on CPUs:

We will create a shell script (e.g., ```job1.sh```) containing your SLURM directives and commands:

```bash
vim job1.sh
```

```bash
#!/bin/bash
#SBATCH --job-name=my_job1
#SBATCH --partition=cpu
#SBATCH --time=00:05:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --output=my_job1.out
#SBATCH --error=my_job1.err

# Your commands go here
echo "Starting job..."
date
sleep 60
echo "Job finished."
date
```

#### Submit job and check:

```bash
sbatch my_job1.sh
```

```bash
squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             39170       cpu   my_job wladimir  R       0:39      1 dgx-node-0-2
```

```bash
ls -ltr my_job.*

-rw-rw-r-- 1 wladimir.banda wladimir.banda  0 mar 26 16:15 my_job1.err
-rw-rw-r-- 1 wladimir.banda wladimir.banda 88 mar 26 16:16 my_job1.out
```


### Test Batch Script on GPUs:

We will create a shell script (e.g., ```job2.sh```) containing your SLURM directives and commands:

```bash
vim job2.sh
```

```bash
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --time=0:05:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=1G
#SBATCH --gres=gpu:a100_2g.10gb:1
#SBATCH --partition=gpu
#SBATCH --output=gpu_job.out
#SBATCH --error=gpu_job.err

# Your GPU enabled commands here
echo "Starting GPU job..."
date
nvidia-smi
sleep 120
echo "GPU Job finished."

```

#### Submit job and check:

```bash
sbatch my_job2.sh
```

```bash
squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             39171       gpu  gpu_job wladimir  R       0:29      1 dgx-node-0-0
```

```bash
ls -ltr my_job2.*

-rw-rw-r-- 1 wladimir.banda wladimir.banda    0 mar 26 16:22 gpu_job.err
-rw-rw-r-- 1 wladimir.banda wladimir.banda 2739 mar 26 16:24 gpu_job.out
```
