# Testing the Casper Compute Environment and Running a GPU Program

By: Daniel Howard, March 14th, 2022

Here is a test notebook for NCAR's GPU Computing Workshop series to make sure your user account is setup appropriately and demonstrate simple GPU program examples on NCAR's [Casper computing cluster](https://arc.ucar.edu/knowledge_base/70549550). You may also run this notebook on other personal machines or HPC clusters but minimal support will be provided in this case. To initialize and run this notebook on Casper: 
1. Start a Jupyter Hub session via the **[NCAR JupyterHub portal](https://jupyterhub.hpc.ucar.edu/stable/)**
2. For this notebook, choose "**Casper login node**" under the "Cluster Selection" pulldown.
3. **Navigate to the folder** you'd prefer to save the GPU workshop github repo (Default is your `$HOME` folder)
4. Select the **git icon** (diamond square below Dask icon) on the side panel at the left side of the browser window
5. Select "**Clone a Repository**" 
6. Enter this git repository address **`https://github.com/dphow/GPU_workshop.git`**
7. Navigate into the newly cloned `GPU_workshop` directory and select the file **`TestCasper.ipynb`** in the folder `00_TestCasper`

Once you have this notebook displayed and running under the Bash kernel (check active kernel in top right of window), then run each code cell below in order by selecting the cell and pressing SHIFT+ENTER. Please report if there are any issues or concerns to workshop organizers or reach out via the [NCAR GPU Users Slack](ncargpuusers.slack.com).

For all registered workshop users, your provided NCAR CIT account should have permissions to use the **UCIS0004** project below. Please use this project ID to charge your compute jobs when running work on Casper. You may use this ID for small workshop related learning work on the order of 30 minutes walltime or less, ideally less than 5 minutes. However, no production scale jobs should be submitted using this project's allocation as it is meant to be shared across the full GPU workshop learning community. If you'd like to request your own allocation for more compute intensive work, please reference the [Allocations documentation](https://arc.ucar.edu/knowledge_base/74317835). For student and early career faculty university users, there are [opportunities available](https://arc.ucar.edu/knowledge_base/75694351) for small one-time allocation rewards for unsponsored work, typically to enable dissertation research or provide seed grants towards funded research.

Please run the below cell to initialize the workshop Project ID (or your own Project ID) for later cells.

In [1]:
export PROJECT=UCIS0004

## Display Information about the GPU

First, we are going to submit a job on Casper's PBS job scheduler to run some simple work on a GPU node. To submit jobs, we are going to use the `qsub` and `qcmd` command. You can learn more about `qsub` and other options for submitting compute jobs to Casper's HPC cluster, including GPUs, at the documentation portal at [arc.ucar.edu - Starting Casper Jobs with PBS](https://arc.ucar.edu/knowledge_base/72581396).

We will now display info about the GPUs on Casper. This is achieved by two commands.

* `nvaccelinfo` - Displays static information about all currently connected GPUs. Available under the NVHPC SDK.
* `nvidia-smi` - Displays dynamic information about all currently connected GPUs. Able to achieve more detailed queries of the GPU state by referencing options available via the command's help text with `nvidia-smi -h`. Available with both the CUDA Toolkit and the NVHPC SDK.

Typically, jobs are submitted to Casper and Cheyenne via batch scripts like this pre-configured script [batch_accelinfo.sh](batch_accelinfo.sh) and the command `qsub batch_accelinfo.sh`. These scripts contains header information for the PBS job scheduler to interpret and determine how to place jobs across the supercomputing cluster. However, here we will use a custom wrapper script `qcmd` for simple executables which submits `qsub` jobs that redirects output directly back to the terminal. Try out `qcmd` in the below cells to see output from the `nvaccelinfo` and `nvidia-smi` commands.

If you are running on a local GPU enabled machine or are already running interactively on a GPU node, expand and run the hidden cell (click the ellipse ...). Otherwise, please keep in mind it is best to run on the `gpudev` queue (30 minute jobs or less for development, profiling, and debug purposes) during the weekday from 8am to 6:30pm MT. If you are running this outside working hours, edit `-q gpudev` to instead `-q casper` which is shared amongst all Casper users and production jobs. Delays to access GPUs will depend on current availability and intensity of your compute reource request.

In [None]:
qcmd -A $PROJECT -q gpudev -l walltime=60 -- /glade/u/apps/opt/nvhpc/22.2/Linux_x86_64/22.2/compilers/bin/nvaccelinfo

In [None]:
qcmd -A $PROJECT -q gpudev -l walltime=60 -- nvidia-smi

In [None]:
bash batch_accelinfo.sh

As you can see, these two commands give plenty of info on the GPU(s) currently available to a running process, including memory usage, temperature, current running GPU processes, and hardware details. If you like you might try out other commands, such as by referencing `nvidia-smi -h` for the many other different types of queries. For example, `nvidia-smi -q` provides substantially more real-time information and using more arguments, can be further configured for different specific information which you may want to log from the GPU.

You may also want to try targetting different GPUs on Casper to see how other devices differ. In this case, we will now compare with the `gp100` GPUs that are available as part of Casper's Data Analysis and Visualization nodes.

In [None]:
qcmd -A $PROJECT -q casper -l select=1:ncpus=1:ngpus=1 -l gpu_type=gp100 -l walltime=60 -- nvidia-smi

This output should likely show some processes already running on a `GP100 GPU`. This is because these nodes frequently run ongoing visualization and desktop virtualization environments for users to connect to. For the `V100 GPUs`, these accelerator devices are typically provided for general purpose GPU computing and offer exclusive use per each requested job.

## Running a GPU Program
Now we are going to make sure you can compile CPU/GPU programs and run them. We will go over more details about this process in future sessions.

First, the below cell will load the needed compiler software then compile both a CPU and a GPU program that runs a simple Jacobi heat equation solver. To note, the same source files are used in both compilations but the GPU compilation is asked to honor the OpenACC directives which are included as comment lines in the source files [jacobi.f90](jacobi.f90) and [laplace2d.f90](laplace2d.f90). Future sessions will use examples from [miniWeather](https://github.com/mrnorman/miniWeather) and [MPAS](https://ncar.ucar.edu/what-we-offer/models/model-prediction-across-scales-mpas).

To note, small compilation projects and other minimal computational load tasks are permitted to run on the login nodes like we are doing below. But the actual executable and computationally expensive runtimes should only be run on batch compute nodes. If a user process runs an expensive application on the login nodes which impacts other users' experience, their program may be automatically terminated and repeat incidents may cause their user account to be temporarily limited.

In [6]:
module purge
module load ncarenv cuda/11.6 nvhpc/22.2 &> /dev/null
module list


Currently Loaded Modules:
  1) ncarenv/1.3   2) cuda/11.6   3) nvhpc/22.2

 



In [7]:
nvfortran -fast -o laplace_cpu laplace2d.f90 jacobi.f90 && echo 'Compilation for CPU Successful!'
rm -f *.o *.mod

laplace2d.f90:
jacobi.f90:
Compilation for CPU Successful!


In [8]:
nvfortran -fast -gpu=cc70 -acc -Minfo=accel -o laplace_gpu laplace2d.f90 jacobi.f90 && echo 'Compilation for GPU Successful!'
rm -f *.o *.mod

laplace2d.f90:
initialize:
     35, Generating enter data create(anew(:,:))
         Generating enter data copyin(a(:,:))
calcnext:
     48, Generating present(anew(:,:),a(:,:))
         Generating copy(error) [if not already present]
         Generating NVIDIA GPU code
         49, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
             Generating reduction(max:error)
         50,   ! blockidx%x threadidx%x collapsed
swap:
     66, Generating present(a(:,:),anew(:,:))
         Generating NVIDIA GPU code
         67, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         68,   ! blockidx%x threadidx%x collapsed
dealloc:
     79, Generating exit data delete(anew(:,:),a(:,:))
jacobi.f90:
Compilation for GPUs Successful!


We can now run the compiled code for the CPU and GPU separately on a Casper batch compute node by running the below cells:

In [9]:
qcmd -l walltime=00:01:00 -l select=1:ncpus=1 -q casper -A $PROJECT -- `pwd`/laplace_cpu

Waiting on job launch; 2337450.casper-pbs with qsub arguments:
    qsub  -q casper@casper-pbs -A UCIS0004 -l walltime=00:01:00 -l select=1:ncpus=1


Jacobi relaxation Calculation: 4096 x 4096 mesh
0  0.250000
100  0.002397
200  0.001204
300  0.000804
400  0.000603
500  0.000483
600  0.000403
700  0.000345
800  0.000302
900  0.000269
completed in     46.051 seconds


In [10]:
qcmd -l walltime=00:01:00 -l select=1:ncpus=1:ngpus=1 -l gpu_type=v100 -q gpudev -A $PROJECT -- `pwd`/laplace_gpu

Waiting on job launch; 2337453.casper-pbs with qsub arguments:
    qsub  -l gpu_type=v100 -q gpudev@casper-pbs -A UCIS0004 -l walltime=00:01:00 -l select=1:ncpus=1:ngpus=1 -l gpu_type=v100


Jacobi relaxation Calculation: 4096 x 4096 mesh
0  0.250000
100  0.002397
200  0.001204
300  0.000804
400  0.000603
500  0.000483
600  0.000403
700  0.000345
800  0.000302
900  0.000269
completed in      0.901 seconds


The CPU program takes ~50 seconds to complete so the ouput from that job may take a while to appear. The GPU job should complete much faster but depends on availability of GPU devices at the time. With the output, the program also tracks and prints runtime so you should see a the real measurement of the substantially lower runtime for the GPU program. Again, you may need to change `-q gpudev` to `-q casper` if submitting jobs outside of weekdays 8am to 6:30pm MT.

If you have additional time, you may want to try adjusting the size of the Jacobi problem by modifying the `n` and `m` values in [jacobi.f90](jacobi.f90#L20) line 20 (ensure `n=m`). You can then test the scaling performance of this GPU code for different domain sizes.

## Conclusion
If you were able to get through all the above examples with no problems, **CONGRATULATIONS!** You should be ready for future interactive sessions as part of this GPU Computing Workshop series.

If you have any questions, problems running this notebook, or issues accessing the compute cluster, please reach out to workshop organizers over email or the [NCAR GPU Users Slack](https://ncargpuusers.slack.com).