# Module 4: Project 5- Introduction to Parallel Programming and Performance Optimization on HPC Systems with respect to Energy Consumption

In this module, participants will learn the principles of parallel programming and performance optimization techniques for HPC systems. They will explore how to leverage parallelism to improve code execution on distributed memory systems using MPI and on GPUs using CUDA. The module will include three examples, each with an original application code and an optimized version demonstrating improvements in execution time and energy consumption.

## Outline

1. Example 1: Python MPI Optimization
2. Example 2: Python CUDA Optimization
3. Example 3: Python MPI + CUDA Optimization

## Example 1: Python MPI Optimization

### Introduction to MPI and parallel programming concepts

Message Passing Interface (MPI) is a standardized and portable message-passing system designed to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several well-tested and efficient implementations of MPI, many of which are open-source or in the public domain. These fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

![MPI](https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/MPI_logo.svg/1200px-MPI_logo.svg.png)

### Analyzing the original Python MPI code

We will start by analyzing a simple MPI code written in Python. This code will demonstrate the basic MPI operations such as sending and receiving messages between different processes.

### Identifying bottlenecks and performance issues

After understanding the basic operations, we will use profiling tools to identify the bottlenecks and performance issues in the code. This will give us insights on which parts of the code we need to optimize.

### Implementing MPI communication improvements

Based on the identified bottlenecks, we will implement improvements in the MPI communication. This can include techniques such as non-blocking communication, collective communication, and optimizing the communication topology.

### Measuring and comparing the performance of the optimized code

Finally, we will measure the performance of the optimized code and compare it with the original code. This will demonstrate the effectiveness of the optimization techniques.

## Example 2: Python CUDA Optimization

### Introduction to CUDA programming and GPU architecture

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach known as GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming.

![CUDA](https://developer.nvidia.com/sites/default/files/pictures/2018/parallel-computing-cuda-software-ecosystem.png)

### Understanding the original Python CUDA code

We will start by understanding a simple CUDA code written in Python. This code will demonstrate the basic CUDA operations such as defining a kernel function, allocating GPU memory, and launching the kernel.

### Exploring memory access patterns and cache optimization

We will then explore the memory access patterns in the code and how they affect the performance. We will also discuss how to optimize the cache usage in CUDA.

### Applying loop unrolling and other memory optimization techniques

Based on the memory access patterns, we will apply optimization techniques such as loop unrolling and memory coalescing. These techniques can significantly improve the performance of the code.

### Benchmarking the performance and energy efficiency of the optimized code

Finally, we will benchmark the performance and energy efficiency of the optimized code and compare it with the original code. This will demonstrate the effectiveness of the optimization techniques.

Note: To run the following Example 2 codes, you'll need to have a compatible NVIDIA GPU and the required CUDA libraries installed.

## Example 3: Python MPI + CUDA Optimization

### Combining MPI and CUDA for distributed GPU computing

In this section, we will discuss how to combine MPI and CUDA for distributed GPU computing. This approach allows us to leverage the power of multiple GPUs in a distributed system.

![MPI+CUDA](https://www.olcf.ornl.gov/wp-content/uploads/2011/08/CUDA_Aware_MPI1.png)

### Reviewing the original Python MPI+CUDA code

We will start by reviewing a Python code that uses both MPI and CUDA. This code will demonstrate how to perform parallel computation on multiple GPUs using MPI for inter-GPU communication and CUDA for GPU computation.

### Optimizing MPI communication and GPU memory transfers

We will then optimize the MPI communication and GPU memory transfers in the code. This can involve techniques such as overlapping communication and computation, using CUDA-aware MPI, and optimizing the communication pattern.

### Utilizing asynchronous execution for better performance

We will also discuss how to utilize asynchronous execution in CUDA and MPI to achieve better performance. This can involve techniques such as using CUDA streams and non-blocking MPI communication.

### Evaluating the improvements achieved in terms of performance and energy efficiency

Finally, we will evaluate the improvements achieved in terms of performance and energy efficiency. We will compare the performance and energy consumption of the optimized code with the original code.

Note: To run the following Example 3 codes, you'll need to have a multi-GPU setup and the required MPI and CUDA libraries installed. Also need to choose the exclusive GPU configuration settings when starting JupyterHub on Perlmutter.

In [None]:
# Example 1: Python MPI Optimization
# Original Code

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = {'a': 7, 'b': 3.14}
    comm.send(data, dest=1, tag=11)
elif rank == 1:
    data = comm.recv(source=0, tag=11)

# This is a simple MPI code where process 0 sends a dictionary to process 1.
# The communication is blocking, which means process 0 will wait until the message is received by process 1.

In [None]:
# Example 1: Python MPI Optimization
# Optimized Code

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = {'a': 7, 'b': 3.14}
    req = comm.isend(data, dest=1, tag=11)
    req.wait()
elif rank == 1:
    req = comm.irecv(source=0, tag=11)
    data = req.wait()

# In the optimized code, we use non-blocking communication.
# This means process 0 can continue to do other work while the message is being sent.

In [None]:
# Example 2: Python CUDA Optimization
# Original Code

import numpy as np
from numba import cuda

@cuda.jit
def add_kernel(x, y, out):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x
    i = tx + ty * bw

    if i < x.size:
        out[i] = x[i] + y[i]

x = np.arange(100).astype(np.float32)
y = 2 * x
out = np.empty_like(x)

threadsperblock = 32
blockspergrid = (x.size + (threadsperblock - 1)) // threadsperblock
add_kernel[blockspergrid, threadsperblock](x, y, out)

# This is a simple CUDA code where we define a kernel function to add two arrays.
# The kernel is launched with a certain number of thread blocks and threads per block.

In [None]:
# Example 2: Python CUDA Optimization
# Optimized Code

import numpy as np
from numba import cuda

@cuda.jit
def add_kernel(x, y, out):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x
    i = tx + ty * bw

    if i < x.size:
        out[i] = x[i] + y[i]

x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array_like(x)

threadsperblock = 32
blockspergrid = (x.size + (threadsperblock - 1)) // threadsperblock
add_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)

out = out_device.copy_to_host()

# In the optimized code, we transfer the data to the GPU before launching the kernel and
# transfer the result back to the host after the kernel execution.
# This can reduce the overhead of data transfer between the host and the device.

In [None]:
# Example 3: Python MPI + CUDA Optimization
# Original Code

from mpi4py import MPI
import numpy as np
from numba import cuda

@cuda.jit
def add_kernel(x, y, out):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x
    i = tx + ty * bw

    if i < x.size:
        out[i] = x[i] + y[i]

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

x = np.arange(100).astype(np.float32)
y = 2 * x
out = np.empty_like(x)

if rank == 0:
    comm.send(x, dest=1, tag=11)
    comm.send(y, dest=1, tag=12)
elif rank == 1:
    x = comm.recv(source=0, tag=11)
    y = comm.recv(source=0, tag=12)

    threadsperblock = 32
    blockspergrid = (x.size + (threadsperblock - 1)) // threadsperblock
    add_kernel[blockspergrid, threadsperblock](x, y, out)

# This is a simple MPI+CUDA code where process 0 sends two arrays to process 1,
# and process 1 uses CUDA to add the two arrays.

In [None]:
# Example 3: Python MPI + CUDA Optimization
# Optimized Code

from mpi4py import MPI
import numpy as np
from numba import cuda

@cuda.jit
def add_kernel(x, y, out):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x
    i = tx + ty * bw

    if i < x.size:
        out[i] = x[i] + y[i]

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

x = np.arange(100).astype(np.float32)
y = 2 * x
out = np.empty_like(x)

if rank == 0:
    req1 = comm.isend(x, dest=1, tag=11)
    req2 = comm.isend(y, dest=1, tag=12)
    req1.wait()
    req2.wait()
elif rank == 1:
    req1 = comm.irecv(source=0, tag=11)
    req2 = comm.irecv(source=0, tag=12)
    x = req1.wait()
    y = req2.wait()

    x_device = cuda.to_device(x)
    y_device = cuda.to_device(y)
    out_device = cuda.device_array_like(x)

    threadsperblock = 32
    blockspergrid = (x.size + (threadsperblock - 1)) // threadsperblock
    add_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)

    out = out_device.copy_to_host()

# In the optimized code, we use non-blocking communication for MPI and transfer the data to the GPU before launching the kernel.
# We also transfer the result back to the host after the kernel execution.
# This can reduce the overhead of data transfer and improve the performance.