![NCAR UCAR Logo](../NCAR_CISL_NSF_banner.jpeg)
# Optimizing AI/ML Workflows in Python for GPUs

By: Daniel Howard [dhoward@ucar.edu](mailto:dhoward@ucar.edu), Consulting Services Group, CISL & NCAR 

Date: August 25th, 2022

In this notebook we analyse the overall workflow of typical machine learning/deep learning projects, emphasizing how to work towards optimal performance on GPUs. We will NOT cover theory of or how to implement AI based projects. We will cover:

1. Background on machine learning research in Earth sciences
2. Setting up Python virtual `conda` environments
    * The RAPIDS AI software suite
    * GPU enabled TensorFlow and PyTorch
3. Enabling tuning and profiling with TensorFlow and PyTorch
4. Profiling with DLProf and performance optimizations for NVIDIA Tensor Cores

## Workshop Etiquette
* Please mute yourself and turn off video during the session.
* Questions may be submitted in the chat and will be answered when appropriate. You may also raise your hand, unmute, and ask questions during Q&A at the end of the presentation.
* By participating, you are agreeing to [UCAR’s Code of Conduct](https://www.ucar.edu/who-we-are/ethics-integrity/codes-conduct/participants)
* Recordings & other material will be archived & shared publicly.
* Feel free to follow up with the GPU workshop team via Slack or submit support requests to [rchelp.ucar.edu](https://support.ucar.edu)
    * Office Hours: Asynchronous support via [Slack](https://ncargpuusers.slack.com) or schedule a time with an organizer

## Start a JupyterHub Session

Head to the [NCAR JupyterHub portal](https://jupyterhub.hpc.ucar.edu/stable) and __start a JupyterHub session on Casper PBS Login Node__ and open the notebook at `14_OptimizeAIML/14_OptimizeAIML.ipynb`. Be sure to clone (if needed) and update/pull the NCAR GPU_workshop directory. You are welcome to use an interactive GPU node for the final few cells of this notebook

```shell
# Use the JupyterHub GitHub GUI on the left panel or the below shell commands
git clone git@github.com:NCAR/GPU_workshop.git
git pull
```

## Notebook Setup
__The`GPU_TYPE=gp100` nodes do not have tensor cores!__ Thus, the `gpuworkshop` queue is not as useful for this session. Saying as much, please set `GPU_TYPE=v100` and use the `gpudev` or `casper` queue both during the workshop and for independent work. See [Casper queue documentation](https://arc.ucar.edu/knowledge_base/72581396#StartingCasperjobswithPBS-Concurrentresourcelimits) for more info.  

# What is Machine Learning and Deep Learning?

ML and DL are essentially statistical models that are designed to learn and predict behavior from a large amount of input training data.

<img src="img/Venn_AI.jpg" alt="Venn diagram of AI" style="width:600px;"/>

The BAMS article "[Outlook for Exploiting Artificial Intelligence in the Earth and Environmental Sciences](https://journals.ametsoc.org/view/journals/bams/102/5/BAMS-D-20-0031.1.xml)" by Boukabara, et al highlights additional applications of AI in the Earth Sciences.

## Overiew of an Earth Science AI Workflow - Remote Sensing

Multiple steps are needed to enable AI for Earth Science. __GPUs are critical in the most expensive step, model building and training__, since they perform well with matrix algebra, foundational to ML methods.

<img src="img/ML_RemoteSensing.png" alt="Conceptual overview of the derivation of digital entities from remotely sensed data using deep-learning techniques" style="width:500px;"/>

Image: [Object Detection and Image Segmentation with Deep Learning on Earth Observation Data: A Review—Part II: Applications](https://www.mdpi.com/2072-4292/12/18/3053/htm) by Hoeser, et al

## Why Use AI for Earth Science?
Earth Science is largely built on physics based theories and dynamical interactions with the biosphere. Today, these models have scaled to enormous sizes, consuming significant computational resources and data storage.

![Left, E3SM on ANL-Theta, 2022; Right, ECMWF on ORNL-Summit, 2021](img/ECMWF_and_Sat.png)

4km global runs of [E3SM](https://e3sm.org/) (left) over 100 forecast years uses 120M core-hours and 250 GB/forecast day, or 12 PB. 1km ECMWF runs (right), as [in this article](https://climatemodeling.science.energy.gov/news/nils-wedi-1-km-resolution-ecmwf-esm-simulation) and by Nils Wedi [keynote at ESMD 2020](https://acme-climate.atlassian.net/wiki/download/attachments/1929707601/WEDI_Nils_D1S1_2020-1026.pdf?api=v2).

__AI offers an opportunity to reduce computational resources required__. Feel free to consult [A Review of Earth Artificial Intelligence](https://www.sciencedirect.com/science/article/pii/S0098300422000036) for current "Grand Challenges"

## Surrogate Models

Novel ways are being explored to more efficiently utilize Earth Science data and reduce required computational resources. A __surrogate model__ in machine learning is a __statistical model__ designed to more efficiently approximate the output of a physics based model.

![Surrogate Modeling, Guo](img/Surrogate_Modeling.png)
Image: [Introduction to Surrogate Modeling](https://towardsdatascience.com/an-introduction-to-surrogate-modeling-part-i-fundamentals-84697ce4d241), Shuai Guo. See "[Learning Nonlinear Dynamical Systems from Data Using Scientific Machine Learning](https://anl.app.box.com/s/hvrk3t8qpg2u218ynggjn87jw1864h12)" by Maulik, ANL.

## Neural Ordinary Differential Equations

For example, a stabilized neural ODE can be designed to accurately simulate shocks and chaotic dynamics.

![Neural ODE with Neural Networks](img/Neural_ODE.png)

See paper by Linot, et al "[Stabilized Neural Ordinary Differential Equations for Long-Time Forecasting of Dynamical Systems](https://arxiv.org/abs/2203.15706)".

## Physics Informed Neural Networks (PINNs)
Other applications to consider are Physics Informed Neural Networks. PINNs attempt to embed known physics relationships into the design of a machine learning model.

![Physics Informed Neural Networks with Navier Stokes](img/PINN_NS.png)
Image: [Wikipedia](https://en.wikipedia.org/wiki/Physics-informed_neural_networks)

This may include defining the Navier-Stokes conservation laws as conditiona to minimize in a ML model's loss function.

## Resources for Engaging and Learning AI in Earth Sciences

Feel free to reach out to [rchelp@ucar.edu](mailto:rchelp@ucar.edu) if you want assistance recreating environments for any code examples.

1. OLCF AI 4 Science Fluid Flow Tutorial ([GitHub](https://github.com/muralikrishnangm/tutorial-ai4science-fluidflow)) - Uses [MiniWeatherML](https://github.com/mrnorman/miniWeatherML)
2. OpenHackathons GPU Bootcamp ([GitHub](https://github.com/openhackathons-org/gpubootcamp/)) - [HPC AI Examples](https://github.com/openhackathons-org/gpubootcamp/tree/master/hpc_ai) for PINNs, CFD, and Climate
3. NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography ([AI2ES.org](https://www.ai2es.org/)) - [Education Materials](https://www.ai2es.org/products/education/) and [2022 Trust-a-thon GitHub](https://github.com/ai2es/tai4es-trustathon-2022)
4. Argonne ALCF
   * [2021 Simulation, Data, and Learning Workshop for AI](https://www.alcf.anl.gov/events/2021-alcf-simulation-data-and-learning-workshop) ([GitHub](https://github.com/argonne-lcf/sdl_ai_workshop))- Has detailed [DL profiling tutorial notebooks](https://github.com/argonne-lcf/sdl_ai_workshop/tree/master/04_profilingDeepLearning) plus [video](https://www.youtube.com/watch?v=cdLIlOUnRCU)
   * 2022 Introduction to AI-driven Science on Supercomputers ([GitHub](https://github.com/argonne-lcf/ai-science-training-series))
5. Data Driven Atmospheric and Water Dynamics Beucler Lab (U. of Lausanne - Switzerland)
   * [Getting Started with Machine Learning](https://wp.unil.ch/dawn/getting-started-with-machine-learning/) curated resource list
6. [NOAA Workshop on Leveraging Artificial Intelligence in Environmental Sciences](https://www.noaa.gov/ai/events/4th-noaa-ai-workshop) - 4th Workshop free to register, virtual Sept 6-9 2022
7. National Academies - 2022 workshop [Machine Learning and Artificial Intelligence to Advance Earth System Science: Opportunities and Challenges](https://www.nationalacademies.org/event/02-07-2022/machine-learning-and-artificial-intelligence-to-advance-earth-system-science-opportunities-and-challenges-a-workshop)
8. [Climate Informatics](http://www.climateinformatics.org/) community - [Conferences](http://www.climateinformatics.org/conferences/) and [Hackathons](http://www.climateinformatics.org/hackathons/)
9. Book - [Deep learning for the Earth Sciences -- A comprehensive approach to remote sensing, climate science and geosciences](https://github.com/DL4ES/DL4ES)
10. [climatechange.ai](https://www.climatechange.ai/) - Global initiative to catalyze impactful work at the intersection of climate change and machine learning.

# How to Manage Python Software for ML and DL Models
The Python ecosystem already provides many robust pre-built software packages and libraries which are continually maintained. __Learning about and employing the Python ecosystem well can simplify the process of using machine learning tools__. 

The kernel `GPU_Workshop` already has many useful packages plus others (notably [Horovod](https://horovod.ai/) for distributed deep learning) which you are welcome to explore on your own beyond this workshop. 

Run the below cell to get a listing of all packages installed in the `GPU_Workshop` conda environment.

In [None]:
!mamba list -p /glade/work/dhoward/conda/envs/GPU_Workshop/

## Setting Up Conda Environments

Since ensuring compatibility and reproducibility is difficult across python package environments, __you are encouraged to maintain your own personalized `conda` virtual environments__. Nonetheless, NCAR provides a base set of commonly used Python packages via the [NCAR Package Library (NPL)](https://arc.ucar.edu/knowledge_base/83853599). NPL does include the faster package management tool `mamba` which uses the same command syntax as `conda`.

If you prefer to install your own and not use `module load conda`, we encourage [Mambaforge](https://github.com/conda-forge/miniforge). In general, `mamba` is safe to use compared to `conda`. To update all non-pinned packages in an environment, you can use `mamba update --all`.

## Choosing Conda Channels

To source packages, the channel `conda-forge` is recommended and set as priority on Casper but other channels you may consider are `ncar`, `nvidia`, `rapidsai`, `intel`, `pytorch`, and `anaconda` among others. 

* Learn to manage channels [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) using your `$HOME/.condarc` file
* Define pinned packages, ie packages that should stay at a specific version or use a specific build type, via the `/path/to/env/conda-meta/pinned` file

## RAPIDS AI Environment

`rapidsai` channel provides [RAPIDS](https://rapids.ai/about.html), an open source and NVIDIA maintained suite to execute end-to-end data science and analytics pipelines entirely on GPUs. Feel free to explore RAPIDS [Getting Started Notebooks](https://docs.rapids.ai/start).

![Scale Up and Out in Python](img/ScaleUpOut_RAPIDS.png)

__Scale Up with RAPIDS__ tools and __Scale Out with Dask/UCX or Horovod__ tools.

### Python Packages and RAPIDS Equivalents

![Python Packages and RAPIDS Equivalents](img/PythonPackages_CPUtoGPU.png)

Setting `conda config --set channel_priority flexible` in `~/.condarc`, follow install directions [here](https://rapids.ai/start.html) or by:

```shell
conda create -n rapids-22.08 -c rapidsai -c nvidia -c conda-forge  \
rapids=22.08 python=3.9 cudatoolkit=11.5
```

## Installing Customized Python Packages

For more personalized environments, an example process to setup a `conda` environment on Casper is below: 
```shell
module load conda
# Creates environment in /glade/work/$USER/conda-envs/my-env-name or a fully specified path
mamba create -n my-env-name
mamba activate my-env-name

# The Python version installed here will automatically be pinned
# Recommend to not use the latest Python version (3.10+) given compatibility issues
mamba install python=3.9*

# Ensures we get MKL optimized packages to run on Casper's Intel CPUs
mamba install numpy scipy pandas scikit-learn xarray "libblas=*=*mkl"
# Ensures common packages provide MPI support (typically defaults to OpenMPI). 
# Useful to pin packages in `/path/to/env/conda-meta/pinned` file.
mamba install mpi4py fftw=*=mpi* h5py=*=mpi* netcdf4=*=mpi* 
```

## GPU Enabled Python Packages and Tools
ML libraries [pytorch](https://pytorch.org/get-started/locally/) and [tensorflow](https://www.tensorflow.org/install/pip) require additional steps to ensure they are installed with GPU support.

```shell
mamba install cudatoolkit cudnn cupy nvtx
# Make sure package wheel ID includes *cuda* to verify GPU support
mamba install pytorch=1.12.1=cuda112*
# Don’t use tensorflow-gpu package as package solver is inconsistent in condo-forge channel
# TF recommends pip install for latest official version but conda-forge versions also work
mamba install tensorflow=2.9.1=cuda112*

# Enables added profiling capabilities, only available via pip and PyPI or NVIDIA's package index
pip install nvidia-pyindex
pip install nvidia-dlprof nvidia-dlprof-pytorch-nvtx
pip install tensorboard_plugin_profile
```

Each library's documentation linked above has more info about installation options. As of this workshop, TensorFlow guarantees support up to CUDA v11.2 and PyTorch up to CUDA v11.6 so we specified builds with `=cuda112*`. Run `mamba search <package>` to view all available packages given available channels.

### Horovod for Distributed Deep Learning

For distributed deep learning with [Horovod](https://horovod.ai/) instead of Dask, see below or [link](https://horovod.readthedocs.io/en/stable/install_include.html) for how to use pip to install Horovod from PyPI on Casper.
```shell
module load cuda/11.7 gnu/10.1.0
mamba install pip gxx_linux-64 cmake nccl
export HOROVOD_NCCL_HOME=$CONDA_PREFIX
export HOROVOD_CUDA_HOME=$CUDA_HOME
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[tensorflow,keras,pytorch]
horovodrun --check-build
```

A useful tutorial for Horovod was given as part of the [Argonne Training Program on Extreme-Scale Computing](https://extremecomputingtraining.anl.gov/agenda-2022/) (ATPESC) - [Data Parallel Deep Learning](https://anl.app.box.com/s/ujkvbb8glmq7n6gzjhxza7vx7n3wa86u)

## Sharing Package Environments
Once setup, you can share or give access to your Python virtual environments, which is vitally important to consider towards enabling reproducible science.

1. On a shared cluster, share a path to your environment, see `mamba env list`. Make sure you provide read access plus write access if you want others to be able to modify the environment. Then run `mamba activate /path/to/env`
2. Others may instead clone a readable environment with `mamba create --name cloned_env --clone /path/to/original_env`
3. To distribute your environment, run `mamba env export > my-env.yml`. Others can then install this environment with `mamba env create -f /path/to/yaml-file`

# Running a Profiler on TensorFlow and PyTorch Models
Both `tensorflow` and `pytorch` have built in tools and `tensorboard` GUI interface for DL profiling, most typically during the training portion of a deep learning model. Base guides for using these built-in tools follow:

* PyTorch
    * [Profiler Tutorial](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
    * [Building a Benchmark Tutorial](https://pytorch.org/tutorials/recipes/recipes/benchmark.html)
    * [PyTorch Profiler with TensorBoard Tutorial](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html)
* TensorFlow
    * [TensorFlow Profiler Guide](https://www.tensorflow.org/guide/profiler) 
    * [TensorBoard Profiler Analysis Guide](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras)
* TensorBoard - [Callbacks API Class](https://keras.io/api/callbacks/tensorboard/)

## Easy Ways to Implement TensorFlow and PyTorch Profilers
__PyTorch__

```python
model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
```

__TensorFlow__ - See [API](https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/start) for additional options

```python
import tensorflow as tf

tf.profiler.experimental.start('/path/to/log/output/')

# ... training loop ...

tf.profiler.experimental.stop()
```

## Using NVIDIA Tools for Profiling DL Models

The tools `nsys` and `ncu` are similarly adaptable to run against DL Python codes. The [`dlprof` tool](https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/) was previously developed to run `nsys` on DL models then output a TensorBoard interface. However, `dlprof` is no longer being developed in favor of previous built in profiling methods.

* PyTorch
    * DNN Layer annotations are disabled by default
    * Use `with torch.autograd.profiler.emit_nvtx():`
    * Manually with `torch.cuda.nvtx.range_(push/pop)`
    * TensorRT backend is already annotated
* TensorFlow
    * Annotated by default with NVTX, _only in `nvidia-pyindex` TF 1.X containers_
        * `export TF_DISABLE_NVTX_RANGES=1` to disable for production
    * For TensorFlow 2.X, must manually inline NVTX ranges or use `dlprof --mode=tensorflow2 ...`
    
NVIDIA provides their own guides, such as [NVIDIA Deep Learning Performance](https://docs.nvidia.com/deeplearning/performance/index.html). A small example using the `nsys`/`ncu` tools and `dlprof` with DL models can be found [here](https://github.com/argonne-lcf/sdl_ai_workshop/tree/master/04_profilingDeepLearning/NvidiaProfiler). `dlprof` can still work well in NVIDIA [NGC Containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) but compatibility elsewhere is not well supported.

## Common Performance Considerations
1. I/O
    * Use designated TF/PT data loaders
        * TensorFlow - [Better Performance with the `tf.data` API](https://www.tensorflow.org/guide/data_performance)
        * PyTorch - [Datasets & Dataloaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
    * Multithreading, eg [Multi-Worker Training with Keras](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras)
2. CPU to/from GPU data copies
    * Rewrite code with TF/PT tensors or use CuPy, etc
    * Overlap copy and computation
3. Batch size - Increase batch size up to GPU is saturated
4. Precision (Background: See Theo Mary's [Mixed Precision Artithmetic](https://www.youtube.com/watch?v=9ZnwfPvAlHM) talk at London Math Society)
    * Consider mixed precision, [NVIDIA Mixed Precision Training Guide](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html)
    * Automatic Mixed Precision (AMP) settings
        * [PT Guide](https://pytorch.org/docs/stable/notes/amp_examples.html): `scaler = torch.cuda.amp.GradScaler()`
        * [TF Guide](https://www.tensorflow.org/guide/mixed_precision): `policy = mixed_precision.Policy('mixed_float16'); mixed_precision.set_global_policy(policy)`
    * Ensure usage of Tensor Cores with Mixed Precision
    
TensorFlow provides a comprehensive guide, [Optimize TensorFlow GPU performance with the TensorFlow Profiler](https://www.tensorflow.org/guide/gpu_performance_analysis)

### Performance Improvements with Tensor Cores

![Tensor Core Arithmetic and Estimated Throughput](img/Tensorcore.png)

# Profiler Runs of a Geomagnetic Field LSTM Model
This Long Short-Term Memory (LSTM) example comes courtesy of the [Trustworthy AI for Environmental Science Trust-a-thon](https://github.com/ai2es/tai4es-trustathon-2022/tree/main/space). You can follow the original example, with data preparation and explanation of how the LSTM model is implemented in the [source notebook](https://github.com/ai2es/tai4es-trustathon-2022/blob/main/space/magnet_lstm_tutorial.ipynb).

To begin, let's first download data to use for training and validation of our LSTM model.

In [None]:
%%capture captured_io
%%bash

# Download data we need. If a directory "data/" already exists, we'll assume the data are already downloaded.
#      The above "magic" statements are used to capture shell in/out and to run the following Bash commands.
if [ ! -d "data" ]; then
  wget --verbose https://ngdc.noaa.gov/geomag/data/geomag/magnet/public.zip
  wget --verbose https://ngdc.noaa.gov/geomag/data/geomag/magnet/private.zip
  unzip public.zip
  unzip private.zip
  mkdir -v data
  mv -v public private data/
  mv -v public.zip private.zip data/
fi
# Uncomment for debugging if you have trouble downloading:
#print(captured_io)

## Profile the `magnet_lstm_tutorial.py` Python Script
The full Geomagnetic Field LSTM model is condensed into the Python file [magnet_lstm_tutorial.py](magnet_lstm_tutorial.py). Recall that profiling does not require analyzing the full runtime of most models. In DL, most operations are highliy repetitive so the profiler only needs to sample a small portion of the runtime. __Minimizing the time for profile runs can speed up the iterative development process__.

* __TODO Line 237__: Adjust parameter `n_epochs=1` in order to minimize profiling time.
* __TODO Line 295/301__: Add TensorBoard callbacks as defined below.

```python
tboard_callback = keras.callbacks.TensorBoard(
    log_dir = "profile_results", histogram_freq = 1, profile_batch = '500,520')

...

model.fit(
    ...,
    callbacks = [tboard_callback]
)
```

__Question__: How else could you minimize runtime of a "profile run" but still maintain model configuration parameters equivalent to production runs?

In [None]:
%%bash
qsub pbs_job.sh

In [None]:
%%bash
# Run this cell if in an interactive GPU node job
#module load cuda/11.7 &> /dev/null
#python magnet_lstm_tutorial.py

## Open the Profile Report in TensorBoard
Typically you'll run this on the login node of Casper and will need to do some `ssh` port forwarding to access the server. You can generally follow these steps:

1. `ssh -L$PORTA:localhost:$PORTB $USER@casper.ucar.edu`
2. `module load conda`
3. `mamba activate /glade/work/dhoward/conda/env/GPU_Workshop`
3. `cd /path/to/log/output`
4. `tensorboard --port $PORTB --bind_all --logdir </path/to/log/output/>`  and wait for message that says server has started
5. Open browser on your laptop and go to `localhost:$PORTA`

In [None]:
# If on a local machine with GPU, use thsee commands to open the profile.
# Otherwise, port forwarding is needed on Casper
# %load_ext tensorboard
# %tensorboard --logdir=profile_results

## Analyizing Profiles in TensorBoard

![TensorBoard Performance Summary](img/TensorBoard_PerformanceSummary.png)
Performance improvement heuristics are often provided with links to more detailed information.

![TensorBoard Kernel Stats](img/TensorBoard_KernelStats.png)

Important to emphasize __where Tensor Core use is eligible__ in your model and determine if it's appropriate for employing reduced precision.

![TensorBoard Trace Timeline](img/TensorBoard_Trace.png)