<a href="https://colab.research.google.com/github/JacobDowns/CSCI-491-591/blob/main/hpc_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HPC and SLURM Basics
* Now that we're setup on Anvil let's go a couple of HPC / SLURM basics
* For reference here's the Anvil [user guide](https://www.rcac.purdue.edu/knowledge/anvil)





## Linux Preliminaries

**Navigation**

| Command | Description | Example |
|----------|--------------|----------|
| `pwd` | Print current working directory | `pwd` → `/home/username` |
| `ls` | List files and directories | `ls -l`, `ls -lh` |
| `cd` | Change directory | `cd /projects/cis250773` |
| `cd ..` | Move up one directory |  |
| `cd ~` | Go to home directory |  |
| `tree` | Display directory tree | `tree -L 2` |
| `du -sh *` | Show size of each folder | Useful for checking quotas |


**Managing Files and Directories**
| Command | Description | Example |
|----------|--------------|----------|
| `mkdir` | Create a new directory | `mkdir data`, `mkdir -p results/run1` |
| `cp` | Copy files or directories | `cp input.txt /scratch/$USER/` |
| `mv` | Move or rename files | `mv old.txt new.txt` |
| `rm` | Remove a file | `rm temp.txt` |
| `rm -r` | Remove directory recursively | `rm -r old_results/` ⚠️ irreversible |
| `touch` | Create an empty file | `touch notes.txt` |
| `cat` | Print file contents | `cat job.o12345` |
| `head` / `tail` | Show first/last lines of a file | `head -n 10 log.txt`, `tail -f log.txt` |
| `nano`, `vi`, `vim` | Edit files in terminal | `nano test.slurm` |
| `chmod` | Change file permissions | `chmod +x script.sh` |

**Checking Disk Usage / Quotas**
| Command | Description | Example |
|----------|--------------|----------|
| `df -h` | Show filesystem disk usage |  |
| `du -sh .` | Show total size of current directory |  |
| `du -sh *` | Show size of all subdirectories |  |


## HPC Cluster Preliminaries
* HPC clusters will typically include a few specialized types of nodes
* Login Node
  * Where you login, set up environments, submit jobs
* Compute nodes
  * Where jobs actually run
  * This is where the heavy computation actually happens
  * **Don't do heavy computation on login nodes!**
* Storage nodes
  * Where shared filesystems live like your home directory and projects

## Slurm
* Slurm is the Simple Linux Utility for Resource Management
* It's the job scheduler on the cluster that coordinates how resources are used
* It controls
  * Who runs based on queue order, priority, and allocation
  * Where they run (which nodes)
  * When they run (once resources are free)
* Hence, you don't directly run something on a compute node, you submit a request for resources with Slurm

## Managing environments
* To do computation on a compute node, you'll use Slurm with commands like `sbatch` and `salloc`
* To support a wide variety of software, HPC systems often use **modules** that allow you to setup a particular software environment
* Compute nodes start with a minimal clean Linux environment
* To start out with, nothing you loaded or activated interactively on the login node carries over



### Modules
* Modules on an HPC cluster are configuration files that dynamically set up your shell environment to use a particular version of a software
* Modules are dynamic, so they can be loaded and unloaded each session
* Some examples of module related commands:
```
module avail            # list available modules
module load gcc/11.2.0  # load GCC 11.2.0
module list             # show currently loaded modules
module unload gcc/11.2.0
module purge    # unload everything
```
* Note that these modules modify your environment only temporarily
* When you log back in, these will disappear
* The module system enables many different users to work with different software versions on the same cluster

### Setting up Environments
* Often you will setup an environment on the login node, then explicitly recreate the envionrment in your job script
* Hence, on the login node, you might:
  * Create an anaconda environment of virtual environment
  * Compile software
  * Manage files
* To run code on the compute node, you'll need to recreate the same environment on the compute node
* Compute nodes have access to the shared file system
  * For example, in Anvil, compute nodes can access your home directory, scratch, and the projects directory

### Example: Numba-CUDA Environment
* You can set up an environment to run Numba CUDA on Anvil by first setting up an Anaconda environment on the login node in your home directory

```
module purge
module load modtree/gpu
module load anaconda
conda create -y -n numba-cuda -c conda-forge python=3.11 numba cudatoolkit
```

* As an example, let's say we want to run this Numba CUDA code for doing vector addition on Anvil
```python
import numpy as np
from numba import cuda

@cuda.jit
def vec_add(a, b, c):
    i = cuda.grid(1)
    if i < a.size:
        c[i] = a[i] + b[i]

def main():
    # Print device info
    dev = cuda.get_current_device()
    print(f"Using GPU: {dev.name.decode() if hasattr(dev.name, 'decode') else dev.name}")
    print(f"Compute capability: {dev.compute_capability}")
    print(f"Max threads per block: {dev.MAX_THREADS_PER_BLOCK}")

    # Problem size
    n = 1_000_000
    a = np.random.rand(n).astype(np.float32)
    b = np.random.rand(n).astype(np.float32)
    c = np.empty_like(a)

    # Configure kernel launch
    threads_per_block = 256
    blocks = (n + threads_per_block - 1) // threads_per_block

    # Move data to device
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_c = cuda.device_array_like(a)

    # Launch kernel
    vec_add[blocks, threads_per_block](d_a, d_b, d_c)

    # Copy result back and verify
    d_c.copy_to_host(c)
    max_err = np.max(np.abs(c - (a + b)))
    print(f"Max error: {max_err:.3e}")
    print("SUCCESS" if max_err < 1e-6 else "CHECK FAILED")

if __name__ == "__main__":
    main()
```

* To run this test script, we define a SLURM file that activates the environment we set up and executes the script

```
#!/bin/bash
# FILENAME: numba_cuda.slurm

#SBATCH -A cis250773-gpu          # your GPU allocation
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
#SBATCH --time=0:5:00
#SBATCH -J numba_cuda
#SBATCH -o numba_cuda.o%j
#SBATCH -e numba_cuda.e%j
#SBATCH -p gpu

module purge
module load modtree/gpu
module load anaconda              

# Activate the environment you created in step 1
conda activate numba-cuda

# Run the program
python numba_cuda_test.py
```
* Note that the compute node has access to the same file system as the login node so we can access the conda environment we set up on the login node
* Loading the same modules ensures that everything is consistent

* This job can be submitted witht the command
```
sbatch numba_cuda.slurm
```
* Note that you'll most likely want to keep your job scripts in your home directory

### Ways to use Slurm
1. As shown above, the main way of using SLURM is by submitting a job with
```
sbatch my_job.slurm
```
* This will cause Slurm to queue the job and eventually run it
2. The second way of using Slurm is in interactive mode with a command like
```
salloc -A your-account -p gpu --gpus=1 --time=30:00
```
* This allows you to run commands manually, which can be useful for debugging / experimenting to make sure your environment is properly setup on a compute node

## Batch Scripts
* There are some common flags that will often apear in batch scripts for Slurm jobs

| Category | Flag | Description | Example |
|-----------|------|--------------|----------|
| **Job Identification** | `#SBATCH -J <name>` | Sets a short, descriptive job name | `#SBATCH -J gpu_test` |
|  | `#SBATCH -o <file>` | File for standard output (`stdout`) | `#SBATCH -o job.o%j` |
|  | `#SBATCH -e <file>` | File for standard error (`stderr`) | `#SBATCH -e job.e%j` |
|  | `#SBATCH -D <path>` | Set working directory for the job | `#SBATCH -D /home/username/jobs` |
| **Resources** | `#SBATCH -A <account>` | Allocation or project account to charge | `#SBATCH -A cis250773-gpu` |
|  | `#SBATCH -p <partition>` | Queue/partition to submit to | `#SBATCH -p gpu` |
|  | `#SBATCH --nodes=<n>` | Number of nodes requested | `#SBATCH --nodes=1` |
|  | `#SBATCH --ntasks-per-node=<n>` | Number of tasks (processes) per node | `#SBATCH --ntasks-per-node=1` |
|  | `#SBATCH --cpus-per-task=<n>` | Number of CPU cores per task | `#SBATCH --cpus-per-task=4` |
|  | `#SBATCH --gpus-per-node=<n>` | Number of GPUs per node | `#SBATCH --gpus-per-node=1` |
|  | `#SBATCH --mem=<size>` | Amount of memory per node or per job | `#SBATCH --mem=16G` |
| **Time Management** | `#SBATCH --time=<hh:mm:ss>` | Walltime limit for the job | `#SBATCH --time=02:00:00` |
|  | `#SBATCH --begin=<time>` | Delay start until specific time | `#SBATCH --begin=tomorrow` |
| **Email Notifications** | `#SBATCH --mail-user=<email>` | Email address for job notifications | `#SBATCH --mail-user=username@purdue.edu` |
|  | `#SBATCH --mail-type=<type>` | When to send emails (`BEGIN`, `END`, `FAIL`, etc.) | `#SBATCH --mail-type=END,FAIL` |
| **Job Arrays** | `#SBATCH --array=<range>` | Submit multiple similar jobs as an array | `#SBATCH --array=1-10` |
| **Dependencies & Advanced** | `#SBATCH --dependency=<type:jobid>` | Start job only after another finishes | `#SBATCH --dependency=afterok:12345` |
|  | `#SBATCH --requeue` | Allow job to be requeued if preempted | `#SBATCH --requeue` |
| **Miscellaneous** | `#SBATCH --export=<vars>` | Export environment variables to the job | `#SBATCH --export=ALL` |
|  | `#SBATCH --chdir=<path>` | Change directory before running job | `#SBATCH --chdir=/projects/cis250773/` |
|  | `#SBATCH --constraint=<feature>` | Request nodes with a specific feature | `#SBATCH --constraint=zen3` |


> **Notes:**
> - `%j` in filenames (like `job.o%j`) is replaced by the **job ID** automatically.  
> - You can check available partitions and resources with `sinfo` or the cluster’s documentation.  
> - Use `--time` conservatively — Slurm will **kill the job** when the limit expires.  
> - Email notifications only work if the cluster’s mail system is configured for users.  
> - For interactive testing, use `salloc` or `srun` instead of a batch script.

* Along with these flags, Slurm has many command line arguments that are used to do things like submit jobs, view jobs in the queue, and so on
* These are useful if you primarily want to interact with the cluster from the command line as opposed to the OnDemand interface

| Command | Description | Common Usage Example |
|----------|--------------|----------------------|
| **`sbatch`** | Submit a batch script to the Slurm scheduler | `sbatch job.slurm` |
| **`squeue`** | View the queue of running and pending jobs | `squeue -u $USER` |
| **`scancel`** | Cancel a submitted job | `scancel 123456` |
| **`sacct`** | Display accounting data for completed jobs | `sacct -j 123456` |
| **`sinfo`** | Show partition and node availability | `sinfo -p gpu` |
| **`scontrol`** | View or modify job and node information | `scontrol show job 123456` |
| **`salloc`** | Allocate resources for an interactive job | `salloc -A cis250773 -p gpu --gpus=1 --time=30:00` |
| **`srun`** | Run a command or program under Slurm control | `srun python script.py` |
| **`sstat`** | Show real-time status for a running job | `sstat -j 123456.batch` |
| **`sprio`** | Display job priorities in the queue | `sprio` |
| **`squeue --start`** | Estimate job start times | `squeue --start -u $USER` |
| **`sreport`** | Generate usage and efficiency reports | `sreport cluster utilization` |
| **`sview`** | Open a GUI for job/cluster monitoring (if available) | `sview` |


> **Notes:**
> - `$USER` is a built-in environment variable for your username.  
> - Job states in `squeue`:  
>   - `PD` = Pending  
>   - `R` = Running  
>   - `CG` = Completing  
>   - `CD` = Completed  
>   - `F` = Failed  
> - Use `sacct` after a job finishes to check CPU, memory, and walltime usage.  
> - For quick testing, `salloc` + `srun` gives you an interactive shell or command on a compute node.



## Storing and Managing Files
* Anvil has a few different file systems that you can use for different purposes
* You can see the main ones using the `myquota` command on Anvil
```
myquota

Type       Location             Size    Limit    Use   Files   Limit    Use
===========================================================================
home       x-jdowns            3.0GB   25.0GB  12.0%      -       -      -
scratch    x-jdowns            0.0KB  100.0TB   0.0%      1     1.0M   0.0%
projects   x-cis250773         2.8GB    5.0TB   0.1%   10.5K    1.0M   1.0%
```
* The user's **home** directory can be used to store personal software scripts etc.
* **scratch** is temporary user storage for I/O activity
  * It's useful for fast read / write access to datasets
  * However, it's not long term storage!
  * It is purged after 30 days (access time)
* **projects** is a shared, per allocation storage space for common datasets and software installation


### File Transfer
* There are a few different ways of handling file transfers from your system to a cluster
* One is to use the OnDemand file browser interface
* There are also command line tools like `scp` and `rsync`
* You can even set up VSCode for remote SSH access so you can edit files on the cluster directly in VSCode
* You can also use `git` to clone code repositiories


| Command | Description | Example |
|----------|--------------|----------|
| `scp` | Secure copy between local and remote | `scp file.txt scp FDS-6.10.1_SMV-6.10.1_lnx.sh x-jdowns@login07.anvil.rcac.purdue.edu:/home/x-jdowns` |
| `rsync` | Sync directories efficiently | `rsync -av data/ x-jdowns@login07.anvil.rcac.purdue.edu::~/data_backup/` |
| `sftp` | Interactive file transfer | `sftp x-jdowns@login07.anvil.rcac.purdue.edu:` |
| (GUI) | File transfer via app | FileZilla, VSCode or WinSCP, OnDemand UI |

# Containers

* Using modules or Conda environments will probably suffice for many use cases
* However, sometimes the software you want to use on the HPC cluster is difficult (or impossible) to setup with the module system
* A useful alternative to using modules is to use containers

* **What is a container?**
  * A lightweight, isolated user-space that packages an app and its dependencies
  * A container is a bit like a lightweight virutal machine VM
  * Docker is perhaps most commonly used tool for containeraization
  * In HPC, we typically use Apptainer (the open successor to Singularity) to run containers—often built as Docker images elsewhere
* Anvil has **Apptainer** installed by default
  * Apptainer can pull Docker images and convert them to .sif files it can use
> Many common software packages have prebuilt containers!
* **Why containers on HPC?**
  * Reproducibility: exact versions of libs/tools
  * Portability: same image runs on laptop, cloud, HPC
  * Stability: avoid conflicts with cluster software

## Example: Fire Dynamics Simulator
* The Fire Dynamics Simulator is a computational fluid dynamics (CFD) model of fire-driven fluid flow
* It is an extremely sophisticated but computationally demanding model
* The easiest way to use FDS is via prebuilt binaries
* However, this leads to conflicts with some of Anvil's system software
* We could alternatively compile FDS from source
* However, it's easier to use an existing container!
> **Note:** While I don't expect most of you are interested in using FDS, this demonstrates how you can setup a container based workflow.

## Walkthrough
Here we'll walkthrough how you can run the FDS model on Anvil to demonstrate the use of containers.

1. On the shell in Anvil you can get the installed version of apptainer with
```
apptainer --version
```

2. Apptainer allows you to pull docker images and converts them to the correct format automatically. For example, you can pull a container that has FDS installed with
```
apptainer pull fds.sif docker://openbcl/fds
```

> **Note**: This container already exists in /anvil/projects/x-cis250773/containers. You can reuse this container in this walkthrough.


You can open an anteractive shell in the container using the command
```
apptainer shell /anvil/projects/x-cis250773/containers/fds.sif
```
In the shell type in `fds` to get information about the FDS version installed in the container. You can exit out of the shell with the `exit` command.

### Input File
Now that we have the container, we need to setup an example simulation to run. You can use the following simulation setup that creates a simple box of fuel that will be burned.

> **Note:** You can find this already in /anvil/projects/x-cis250773/fds_tests/box_burn_away1.fds

```
&HEAD CHID='box_burn_away1', TITLE='Test BURN_AWAY feature' /

The FOAM box is evaporated away by the high thermal radiation
from HOT surfaces. The mass of the box is 0.4^3 m3 * 20 kg/m3 = 1.28 kg.
This should be compared to the final value of fuel density volume integral,
computed by the first DEVC.

&MESH IJK=10,10,10 XB=-0.3,0.7,-0.4,0.6,0.0,1.0, MULT_ID='mesh' /
&MULT ID='mesh', DX=1.0, DY=1.0, I_UPPER=1, J_UPPER=1 /

&TIME T_END=30. DT = 0.01/

&SPEC ID='METHANE' /

&MATL ID                   = 'FOAM'
      HEAT_OF_REACTION     = 800.
      CONDUCTIVITY         = 0.2
      SPECIFIC_HEAT        = 1.0
      DENSITY              = 20.
      NU_SPEC              = 1.
      SPEC_ID              = 'METHANE'
      REFERENCE_TEMPERATURE= 200. /

&SURF ID                   = 'FOAM SLAB'
      COLOR                = 'TOMATO 3'
      VARIABLE_THICKNESS   = T
      BURN_AWAY            = T /

&OBST XB=0.30,0.70,0.30,0.70,0.30,0.70, SURF_ID='FOAM SLAB', BULK_DENSITY=20., MATL_ID='FOAM' /

&SURF ID='HOT', TMP_FRONT=1100., COLOR='RED' /

&VENT PBX=-0.3, SURF_ID='HOT' /
&VENT PBX= 1.7, SURF_ID='HOT' /
&VENT PBY=-0.4, SURF_ID='HOT' /
&VENT PBY= 1.6, SURF_ID='HOT' /
&VENT PBZ= 0.0, SURF_ID='HOT' /
&VENT PBZ= 1.0, SURF_ID='HOT' /

&BNDF QUANTITY='WALL TEMPERATURE' /
&BNDF QUANTITY='MASS FLUX', SPEC_ID='METHANE' /

&SLCF PBZ=0.5, QUANTITY='TEMPERATURE', CELL_CENTERED=T /
&SLCF PBZ=0.5, QUANTITY='DENSITY', CELL_CENTERED=T /
&SLCF PBZ=0.5, QUANTITY='MASS FRACTION', SPEC_ID='METHANE', CELL_CENTERED=T /

&DEVC XB=-0.3,1.7,-0.4,1.6,0,1, QUANTITY='DENSITY', SPEC_ID='METHANE', SPATIAL_STATISTIC='VOLUME INTEGRAL', ID='Mass fuel' /

&TAIL /
```

### Setting up the Slurm Job
Next, we need to setup the Slurm batch script. The contents of the slurm box will look like this.

> **Note:** You'll need to modify the `CASE_DIR` to output to your scratch or user directory, or a custom directory in the shared project folder. You can find a version of this file in /anvil/projects/x-cis250773/fds_tests/run_fds.slurm


```
#!/bin/bash
#SBATCH -A cis250773
#SBATCH -p shared
#SBATCH -J fds_box_output
#SBATCH --time=00:10:00
#SBATCH -o fds_box_output.o%j
#SBATCH -e fds_box_output.e%j
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G

set -euo pipefail

# --- Configuration ---
IMG="/anvil/projects/x-cis250773/containers/fds.sif"
CASE_DIR="your own personal output directory here!"
INPUT_FILE="box_burn_away1.fds"

# Create a unique output directory
OUT_DIR="${CASE_DIR}/output/run_${SLURM_JOB_ID}"
mkdir -p "$OUT_DIR"

echo "Starting FDS job $SLURM_JOB_ID"
echo "Case directory : $CASE_DIR"
echo "Output directory: $OUT_DIR"

# Copy the input file into the output dir (optional, for reproducibility)
cp "${CASE_DIR}/${INPUT_FILE}" "$OUT_DIR/"

# Run FDS inside the container
apptainer exec --bind "$OUT_DIR:$OUT_DIR" --pwd "$OUT_DIR" "$IMG" \
  fds "$INPUT_FILE"

# --- Post-processing ---
echo "Simulation completed. Output files are in: $OUT_DIR"
echo "Contents:"
ls -lh "$OUT_DIR"

# You can rename/move summary logs if you like
mv "fds_box_output.o${SLURM_JOB_ID}" "$OUT_DIR/" 2>/dev/null || true
mv "fds_box_output.e${SLURM_JOB_ID}" "$OUT_DIR/" 2>/dev/null || true
```

The main command to look at here is
```
apptainer exec --bind "$OUT_DIR:$OUT_DIR" --pwd "$OUT_DIR" "$IMG" \
  fds "$INPUT_FILE"
```

* The exec command is going to tell apptainer to execute a command inside the container.
* The `--bind "$OUT_DIR:$OUT_DIR"` argument mounts a directory from the host file system into the container using the same name for both.
* The agrument `--pwd "$OUT_DIR"` sets the working directory in the container.
* `$IMG$` is the path to the container image.
* `fds $INPUT_FILE` runs the `fds` command in the container with the given input file

### Running
Once you've created an modified your own version of this file, navigate to the directory you specified in `$CASE_DIR` and submit the job as normal:
```
sbatch run_fds_container.slurm
```
Pay attention to the output once the job begins by inspecting the standard output and error files. What is being printed out? What files does FDS write?

### Visualizing
* If you want to see the results of the simulation your ran, you can do so using an application called Smokeview
* You can download SmokeView for different platforms [here](https://pages.nist.gov/fds-smv/downloads.html)
* The easiest way to visualize the output is to install Smokeview on your local machine and transfer the `box_burn_away1.smv` file to your machine
* This file can be inspected in Smokeview