

# Ultrasound Signal Processing with GPUs – Introduction to Parallel Programming

**NVIDIA CUDA** 



### **License / Attribution**



 Materials for the short-course "Ultrasound Signal Processing with GPUs – Introduction to Parallel Programming" are licensed by us4us Ltd. the IPPT PAN under the <u>Creative Commons Attribution-NonCommercial 4.0 International License.</u>

- Some slides and examples are borrowed from the course "The GPU Teaching Kit" that is licensed by NVIDIA and the University of Illinois under the <u>Creative Commons Attribution-</u> NonCommercial 4.0 International License.
  - All the borrowed slides are marked with





021 2

# **GPU Architecture**



# Flynn Taxonomy of parallelism

- Two dimensions:
  - Number of <u>instruction streams</u>: single vs. multiple
  - Number of <u>data streams</u>: single vs. multiple
- SISD single-instruction single-data
  - Pipelining and ILP (Instruction Level Parallelism) on a uniprocessor
- SIMD single-instruction multiple-data (aka Vector processor)
  - DLP (Data Level Parallelism) on a vector processor
- MIMD multiple-instruction multiple-data
  - DLP, TLP (Thread Level Parallelism) on a parallel processor
  - SPMD: single-program multiple data

021 4

# **Architecture CPU vs. GPU**



Source: https://docs.nvidia.com/cuda/

8/20/2021 5

# **NVIDIA** Ampere Architecture



Source: https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

# **NVIDIA Streaming Multiprocessor (SM)**





Source: https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

# Just a few numbers ...

| Data Center GPU       | NVIDIA Tesla<br>P100 | NVIDIA Tesla V100 | NVIDIA A100    |
|-----------------------|----------------------|-------------------|----------------|
| GPU Codename          | GP100                | GV100             | GA100          |
| GPU Architecture      | NVIDIA Pascal        | NVIDIA Volta      | NVIDIA Ampere  |
| GPU Board Form Factor | SXM                  | SXM2              | SXM4           |
| SMs                   | 56                   | 80                | 108            |
| TPCs                  | 28                   | 40                | 54             |
| FP32 Cores / SM       | 64                   | 64                | 64             |
| FP32 Cores / GPU      | 3584                 | 5120              | 6912           |
| FP64 Cores / SM       | 32                   | 32                | 32             |
| FP64 Cores / GPU      | 1792                 | 2560              | 3456           |
| INT32 Cores / SM      | NA                   | 64                | 64             |
| INT32 Cores / GPU     | NA                   | 5120              | 6912           |
| Tensor Cores / SM     | NA                   | 8                 | 4 <sup>2</sup> |
| Tensor Cores / GPU    | NA                   | 640               | 432            |
|                       |                      |                   |                |

| Data center GPU                        | NVIDIA Tesla<br>P100 | NVIDIA Tesla V100           | NVIDIA A100                  |
|----------------------------------------|----------------------|-----------------------------|------------------------------|
| GPU Codename                           | GP100                | GV100                       | GA100                        |
| GPU Architecture                       | NVIDIA Pascal        | NVIDIA Volta                | NVIDIA Ampere                |
| Compute Capability                     | 6.0                  | 7.0                         | 8.0                          |
| Threads / Warp                         | 32                   | 32                          | 32                           |
| Max Warps / SM                         | 64                   | 64                          | 64                           |
| Max Threads / SM                       | 2048                 | 2048                        | 2048                         |
| Max Thread Blocks / SM                 | 32                   | 32                          | 32                           |
| Max 32-bit Registers / SM              | 65536                | 65536                       | 65536                        |
| Max Registers / Block                  | 65536                | 65536                       | 65536                        |
| Max Registers / Thread                 | 255                  | 255                         | 255                          |
| Max Thread Block Size                  | 1024                 | 1024                        | 1024                         |
| FP32 Cores / SM                        | 64                   | 64                          | 64                           |
| Ratio of SM Registers to FP32<br>Cores | 1024                 | 1024                        | 1024                         |
| Shared Memory Size / SM                | 64 KB                | Configurable up to 96<br>KB | Configurable up to 164<br>KB |

8/20/2021

Source: https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

# Generation to generation ...

### More on Scalability

- Performance growth with HW generations
  - Increasing number of compute units (cores)
  - Increasing number of threads
  - Increasing vector length
  - Increasing pipeline depth
  - Increasing DRAM burst size
  - Increasing number of DRAM channels
  - Increasing data movement latency

□ INVIDIA.

[ ILLINO:

# **CUDA Architecture**



# What is CUDA® (Compute Unified Device Architecture)

- NVIDIA CUDA is a General-Purpose Parallel Computing Platform and Programming Model
- Introduced back in 2006 by NVIDIA® to leverage the parallel compute engine in NVIDIA GPUs.
- CUDA is designed to support various languages and application programming interfaces.
- Now, other hardware platforms support CUDA programming (e.g. FPGA, Intel<sup>®</sup> OneAPI).



Source: https://docs.nvidia.com/cuda/

### A Thread as a Von-Neumann Processor

A thread is a "virtualized" or "abstracted" Von-Neumann Processor



# **CUDA – Grid of Cooperative Thread Arrays**



Source: https://docs.nvidia.com/cuda/

# **CUDA / GPU Automatic Scalability**

Grid of Thread Blocks





Source: https://docs.nvidia.com/cuda/

IUS GPU short-course

Source: https://docs.nvidia.com/cuda/

# Arrays of Parallel Threads

- A CUDA kernel is executed by a grid (array) of threads
  - All threads in a grid run the same kernel code (Single Program Multiple Data)
  - Each thread has indexes that it uses to compute memory addresses and make control decisions



**◎** NVIDIA

**I** ILLINOIS

### blockldx and threadldx

Each thread uses indices to decide what data to work on

blockIdx: 1D, 2D, or 3D (CUDA 4.0)

threadIdx: 1D, 2D, or 3D

 Simplifies memory addressing when processing multidimensional data

- Image processing
- Solving PDEs on volumes
- ...



[ ILLINOIS

# Thread Blocks: Scalable Cooperation



- Divide thread array into multiple blocks
  - Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
  - Threads in different blocks do not interact

■ ILLINOIS

### **Thread Blocks / Grids**

- Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU.
- One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA blocks.
- Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one time.

- CUDA defines built-in 3D variables for threads and blocks
- CUDA architecture limits the numbers of threads per block (1024 threads per block limit).



Source: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

### **CUDA Execution Model**

- Heterogeneous host (CPU) + device (GPU) application C program
  - Serial parts in host C code
  - Parallel parts in device SPMD kernel code



### Dynamic Parallelism



Source: https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/

# Partial Overview of CUDA Memories



- Device code can:
  - R/W per-thread registers
  - R/W all-shared global memory
- Host code can
  - Transfer data to/from per grid global memory

We will cover more memory types and more sophisticated memory models later.

**◎** INVIDIA.

[ ILLINOIS

# **Host/Device Memory**

### **CUDA Device Memory Management API functions**



cudaMalloc()

- Allocates an object in the device global memory
- Two parameters
- Address of a pointer to the allocated object
- Size of allocated object in terms of bytes
- cudaFree()
  - Frees object from device global memory
  - One parameter
    - Pointer to freed object

### Host-Device Data Transfer API functions



- cudaMemcpy()
  - memory data transfer
  - Requires four parameters
    - Pointer to destination
    - Pointer to source
    - Number of bytes copied
    - Type/Direction of transfer
  - Transfer to device is synchronous with respect to the host

**◎** INVIDIA

I HERITAGIS

O INVIDIA

I ILLINOIS

### Vector Addition – Traditional C Code

```
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
    int i;
    for (i = 0; i < n; i++) h C[i] = h A[i] + h B[i];
int main()
    // Memory allocation for h A, h B, and h C
    // I/O to read h A and h B, N elements
    vecAdd(h A, h B, h C, N);
```



🚳 DVIDI

I ILLINOI

22

# Vector Addition, Explicit Memory Management

```
... Allocate h A, h B, h C ...
void vecAdd(float *h A, float *h B, float *h C, int n)
  int size = n * sizeof(float); float *d A, *d B, *d C;
  cudaMalloc((void **) &d_A, size);
  cudaMalloc((void **) &d B, size);
  cudaMalloc((void **) &d C, size);
  cudaMemcpy(d A, h A, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
  // Kernel invocation code – to be shown later
   cudaMemcpy(h C, d C, size, cudaMemcpyDeviceToHost);
   cudaFree(d A); cudaFree(d B); cudaFree (d C);
... Free h A, h B, h C ...
```



### Heterogeneous Computing vecAdd CUDA Host Code



```
#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
  int size = n* sizeof(float);
 float *d_A, *d_B, *d_C;
 // Part 1
 // Allocate device memory for A, B, and C
 // copy A and B to device memory
 // Part 2
 // Kernel launch code - the device performs the actual
vector addition
 // Part 3
 // copy C from the device memory
 // Free device vectors
```

**◎** INVIDIA

8/20/2021

# **GPU Acceleration**



# 3 Ways to Accelerate Applications

# **Applications**

Libraries

Compiler Directives

Programming Languages

Easy to use Most Performance Easy to use Portable code

Most Performance Most Flexibility

# **NVIDIA GPU Accelerated Libraries**

**DEEP LEARNING** 







LINEAR ALGEBRA







SIGNAL, IMAGE, VIDEO







PARALLEL ALGORITHMS







### **Developer Tools - Profilers**





https://developer.nvidia.com/performance-analysis-tools

### **Profiling Tools**



See lecture 2-4 for an overview of all tools

IDIA ILLINOIS

e

DVIDIA



# Optimization



# Why Python!?

### Why is Python so Popular?

- 1. Easy to learn and use.
- 2. Mature and supportive Python Community.
- 3. Support from Big-Players (Corporate Sponsors).
- 4. "Batteries Included" hundreds of libraries and frameworks available.
- 5. Versatility, efficiency, reliability, and speed.



- 6. Big data, machine learning and Cloud computing.
- 7. First-choice Language (see the ranking!).
- 8. The flexibility of Python language.
- 9. Use of Python in academics.
- 10. Automation

| Langua | ige Ranking. ILLE Spectium |          |   |   |          |       |
|--------|----------------------------|----------|---|---|----------|-------|
| Rank   | Language                   | Туре     |   |   |          | Score |
| 1      | Python▼                    | <b>#</b> |   | Ç | <b>@</b> | 100.0 |
| 2      | Java▼                      | <b>#</b> | 0 | Ç |          | 95.3  |
| 3      | C▼                         |          | 0 | Ç | 0        | 94.6  |
| 4      | C++ <b>▼</b>               |          |   | Ģ | <b>@</b> | 87.0  |
| 5      | JavaScript <del>▼</del>    | <b>#</b> |   |   |          | 79.5  |
| 6      | R▼                         |          |   | Ģ |          | 78.6  |
| 7      | Arduino▼                   |          |   |   | <b>@</b> | 73.2  |
| 8      | Go♥                        | <b>#</b> |   | Ç |          | 73.1  |
| 9      | Swift▼                     |          | 0 | Ç |          | 70.5  |
| 10     | Matlab▼                    |          |   | Ç |          | 68.4  |

source: https://spectrum.ieee.org/static/interactive-the-top-programming-languages-2020

Language Ranking: IEEE Spectrum

2020

# There should be one – and preferably only one – obvious way to do it (The Zen of Python)

- There are many solutions for GPU acceleration in Python ...
- Python GPU programming:
  - NUMBA, pyCUDA, pyOpenCL
- Libraries:
  - Numpy on the GPU: CuPy
  - Numpy on the GPU (again): Jax
  - Pandas on the GPU: RAPIDS cuDF
  - Scikit-Learn on the GPU: RAPIDS cuML
- Frameworks:
  - deep learning frameworks like PyTorch, TensorFlow, Caffe, MXNet



### **NUMBA**

- Accelerate Python Functions
- Numba translates Python functions to optimized machine code at runtime using the industry-standard <u>LLVM</u> compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.
- You don't need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.



### Numba makes Python code fast

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

Learn More

Try Numba »

```
from numba import jit
import random

@jit(nopython=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
        acc += 1
    return 4.0 * acc / nsamples</pre>
```

# Compiling A CUDA Program



# **HOW-TO run Python with CUDA?**

### **OPTIONS:**

- Install Python interpreter + CUDA locally.
- Without installation, you can use Python in the CLOUD
  - For GOOGLE COLAB start here: <a href="https://colab.research.google.com/notebooks/intro.ipynb">https://colab.research.google.com/notebooks/intro.ipynb</a>

- Other options in the Cloud: <a href="https://developer.nvidia.com/how-to-cuda-python">https://developer.nvidia.com/how-to-cuda-python</a>
- Many other options and good developers' tools are available.

21 36

# And NOW ...

- Take a 10 min break ...
- Then start the 1<sup>st</sup> Exercise: "CUDA Programming Model"

2021 37