# Module 4: Project 4-Introduction to High-Performance Computing (HPC) and Energy Consumption in Programming
High-Performance Computing (HPC) refers to the utilization of parallel processing techniques to execute computational tasks efficiently and quickly. It involves the use of supercomputers and parallel processing techniques for solving complex computational problems.  In this notebook, you will be introduced and exploring concepts related to parallel programming paradigms in HPC.
![HPC Analysis](https://www.nersc.gov/assets/Uploads/Perlmutter-panel-art-5-2021__ResizedImageWzYwMCwyMjVd.jpg)
## Energy Consumption in HPC
Energy consumption is a critical aspect of HPC. With the growing demand for computational power, the energy required to fuel HPC systems has also increased. Understanding and optimizing energy consumption is vital for both economic and environmental reasons.

In this notebook, we will explore the concepts related to HPC with a focus on energy consumption. We will provide examples that demonstrate the use of MPI (Message Passing Interface) and CUDA (Compute Unified Device Architecture) in HPC systems.

# Introduction to Parallel Programming
Parallel programming is a programming model wherein the execution flow of the application is divided into multiple concurrent threads to increase the computational speed and solve larger problems. It is widely used in high-performance computing (HPC) to perform complex computations more efficiently.

In parallel programming, tasks are divided into subtasks that are processed simultaneously (in parallel) on different processors or cores in a computer. This approach can significantly reduce the execution time for large-scale computational problems.

![Parallel Programming](https://miro.medium.com/max/700/1*QV1NtKlFP2fngZ3mXCtmOQ.png)

In the image above, the same task is divided into four subtasks that are executed simultaneously, resulting in a significant reduction in total execution time.

Parallel programming is essential in HPC because it allows us to leverage the full potential of modern multi-core and multi-processor systems. It's used in a variety of applications, including simulations, data analysis, machine learning, and many others.

# Introduction to MPI (Message Passing Interface)
MPI is a standardized and portable message-passing system designed to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several well-tested and efficient implementations of MPI, including some that are free or in the public domain.

MPI programs are parallel applications that use a SPMD (single program, multiple data) model. This means that the same program is executed on all processors, and tasks are divided among the processors to be executed simultaneously.

![MPI](https://www.hpcwire.com/wp-content/uploads/2017/05/MPIlogo2.gif)

In the image above, each processor is running the same program but is working on a different part of the data. The processors communicate with each other by sending and receiving messages, which are coordinated by the MPI library.

## Example 1: MPI (Message Passing Interface) in Python
MPI is a standardized and portable message-passing system designed to function on parallel computers. It enables processes to communicate by sending and receiving messages.

### Calculating π (pi) using MPI and the Monte Carlo Method
We'll demonstrate a simple MPI program that calculates the value of π (pi) using the Monte Carlo method. This method utilizes random sampling to obtain numerical results for mathematical problems.

Note: To run the following code, you'll need to have an MPI environment set up, and the code must be executed using the `mpiexec` command.

In [None]:
from mpi4py import MPI
import random

def calculate_pi(rank, size):
    random.seed(rank)
    inside_circle = 0
    total_points = 1000000
    points_per_process = total_points // size

    for _ in range(points_per_process):
        x, y = random.random(), random.random()
        if x**2 + y**2 <= 1:
            inside_circle += 1

    return 4 * inside_circle / points_per_process

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

local_pi = calculate_pi(rank, size)
global_pi = comm.reduce(local_pi, op=MPI.SUM, root=0)

if rank == 0:
    print('Estimated value of π:', global_pi / size)

# Introduction to CUDA (Compute Unified Device Architecture)
CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation.

![CUDA](https://assets.nvidiagrid.net/ngc/logos/Cuda.png)


In a CUDA program, the CPU (referred to as the host) and the GPU (referred to as the device) work together. The sequential part of the application still runs on the CPU, and the computationally-intensive part is offloaded to the GPU.



## Example 2: Using CUDA in Python
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It allows developers to use NVIDIA GPUs for general-purpose computing.

### Matrix Multiplication using CUDA
In this example, we'll demonstrate how to use CUDA in Python to perform matrix multiplication. This operation is a common task in scientific computing and can be parallelized efficiently on GPUs.

Note: To run the following code, you'll need to have a compatible NVIDIA GPU and the required CUDA libraries installed.

In [None]:
import numpy as np
from numba import cuda

@cuda.jit
def matrix_mul_cuda(A, B, C):
    row, col = cuda.grid(2)
    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]
        C[row, col] = tmp

A = np.random.rand(24, 12)
B = np.random.rand(12, 22)
C = np.zeros((24, 22))

d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_C = cuda.device_array(C.shape, np.float64)

threadsperblock = (16, 16)
blockspergrid_x = int(np.ceil(A.shape[0] / threadsperblock[0]))
blockspergrid_y = int(np.ceil(B.shape[1] / threadsperblock[1]))
blockspergrid = (blockspergrid_x, blockspergrid_y)

matrix_mul_cuda[blockspergrid, threadsperblock](d_A, d_B, d_C)
C = d_C.copy_to_host()

print('C =', C)

## Example 3: Combining MPI and CUDA for Parallel Computing
In this example, we'll demonstrate how to combine MPI and CUDA to perform parallel computing across multiple GPUs. We'll use MPI to distribute the work among different processes, and within each process, we'll use CUDA to perform computations on the GPU.

### Vector Addition using MPI and CUDA
We'll create a simple program that performs vector addition across multiple GPUs using both MPI and CUDA.

Note: To run the following code, you'll need to have a multi-GPU setup and the required MPI and CUDA libraries installed. Also need to choose the exclusive GPU configuration settings when starting JupyterHub on Perlmutter.

# Combining MPI and CUDA for Parallel Computing
In some cases, we can combine MPI and CUDA to perform parallel computing across multiple GPUs. We use MPI to distribute the work among different processes, and within each process, we use CUDA to perform computations on the GPU.

This approach allows us to leverage the full potential of multi-GPU setups, where each GPU can be working on a different part of the data simultaneously. This can lead to significant reductions in execution time for large-scale computational problems.

![MPI and CUDA](https://www.olcf.ornl.gov/wp-content/uploads/2011/08/CUDA_Aware_MPI_1.png)

In the image above, each MPI process is associated with a different GPU. Each process performs computations on its GPU, and the processes communicate with each other via MPI.

In [None]:
from mpi4py import MPI
import numpy as np
from numba import cuda

@cuda.jit
def vector_add_cuda(A, B, C):
    i = cuda.grid(1)
    if i < C.size:
        C[i] = A[i] + B[i]

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

N = 1000000
A = np.random.rand(N).astype(np.float32)
B = np.random.rand(N).astype(np.float32)
C = np.zeros_like(A)

start = rank * N // size
end = (rank + 1) * N // size

d_A = cuda.to_device(A[start:end])
d_B = cuda.to_device(B[start:end])
d_C = cuda.device_array_like(d_A)

threadsperblock = 1024
blockspergrid = (d_A.size + (threadsperblock - 1)) // threadsperblock

vector_add_cuda[blockspergrid, threadsperblock](d_A, d_B, d_C)
C[start:end] = d_C.copy_to_host()

comm.Allreduce(MPI.IN_PLACE, C, op=MPI.SUM)

if rank == 0:
    print('C =', C)