# General Purpose Computing on Graphical Processing Units

May 11, 2011

#### What are GPUs?

- A GPU (on a graphics card) offloads/accelerates graphics rendering from a CPU.
- Modern GPU functions:
  - rendering polygons
  - texture mapping
  - coordinate transformations
  - accelerated video decoding
- Manufacturers
  - NVIDIA
  - ATI

GPGPU 2/24

#### **GPGPU**

- General Purpose Computing on Graphical Processing Units
  - using a GPU for applications traditionally handled by a  $\ensuremath{\mathsf{CPU}}$
- Stream Processing
  - stream of data
  - a series of operations applied to that stream—the kernel functions
- SPMD
  - single program, multiple data
  - related to, but not the same as SIMD

GPGPU 3/24

#### **Programming GPUs**

- Approach is to make use of the GPU AND the CPU
- CUDA
  - Compute Unified Device Architecture
  - Developed and distributed by NVIDIA
- OpenCL
  - tedious and not as good performance as CUDA (according to NVIDIA)

Now lets get started...

GPGPU 4/

#### **Setting Up CUDA**

#### See http:

//www.r-tutor.com/gpu-computing/cuda-installation/cuda3.2-ubuntu

- Make sure you have a graphics card, install Ubuntu.
- Disable the nouveau nvidia driver that comes with Ubuntu. Reboot in safe graphics mode (hold down shift key)
- Install the linux developer tools and the OpenGL development driver.
- Install the CUDA development driver (after downloading from CUDA download site). Switch to console mode for this (ctrl-alt-f2).
- Download and install the CUDA toolkit. Usually in /usr/local/cuda
- Download and install the CUDA SDK samples. Usually in your personal home directory.

GPGPU 5/24

#### Check out your system

- Run the device query sample program from your CUDA SDK samples:
  - \$ cd ~/CUDASDK3.2/C/bin/linux/release/
  - \$ ./deviceQuery
- Look at the output:
  - How many graphics devices are there?
  - How many multiprocessors and cores?
  - How much global memory?

**-** . . .

GPGPU 6/24

## **Device Query Screenshot**

```
File Edit View Terminal Help
iabond@it047333:~/CUDASDK3.2/C/bin/linux/release$ ./deviceOuerv
./deviceOuerv Starting...
CUDA Device Ouery (Runtime API) version (CUDART static linking)
There are 2 devices supporting CUDA
Device 0: "GeForce GTX 480"
 CUDA Driver Version:
                                                  3 2A
 CUDA Runtime Version:
                                                 3.20
 CUDA Capability Major/Minor version number:
                                                 2.0
 Total amount of global memory:
                                                 1610285056 bytes
 Multiprocessors x Cores/MP = Cores:
                                                 15 (MP) x 32 (Cores/MP) = 480 (Cores)
 Total amount of constant memory:
                                                 65536 bytes
 Total amount of shared memory per block:
                                                 49152 bytes
 Total number of registers available per block: 32768
  Warp size:
 Maximum number of threads per block:
                                                  1024
  Maximum sizes of each dimension of a block:
                                                  1024 x 1024 x 64
 Maximum sizes of each dimension of a grid:
                                                 65535 x 65535 x 1
 Maximum memory pitch:
                                                 2147483647 bytes
 Texture alignment:
                                                 512 bytes
 Clock rate:
                                                  1.40 GHz
 Concurrent copy and execution:
                                                  Yes
 Run time limit on kernels:
                                                 Nο
  Integrated:
                                                 Nο
 Support host page-locked memory mapping:
                                                  Yes
 Compute mode:
                                                 Default (multiple host threads can use this devic
e simultaneously)
 Concurrent kernel execution:
                                                  Yes
 Device has ECC support enabled:
                                                 Nο
 Device is using TCC driver mode:
                                                 No
Device 1: "GeForce 21A"
```

#### **GPU** architecture

Physical layout varies among GPU makes and models, but follows these general ideas:

- GPU divided into blocks
- Each block contains one or more streaming multiprocessors
- Each SM has a number of streaming processors
  - $\mbox{-}$  all share the same control logic and instruction cache within an SM
- All SPs from all SMs have access to up to 4 GB GDDR DRAM global memory
  - GDDR: graphics double data rate
  - DRAM: dynamic random access memory

| GPGPU

#### **NVIDIA GeForce GTX 480**



#### **CUDA** processing flow

Need to identify those parts of the program that operate on the host (CPU) and the device (GPU).



GPGPU 10/24

#### First CUDA Program

Perform element-wise vector addition, with each vector element being handled by one thread

```
// Import the cuda headers, along with any other required C headers
#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device. This is executed by ONE
// stream processor
__global__ void vec_add(float* A, float* B, float* C, int N)
{
    // What element of the array does this thread work on
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}</pre>
```

| GPGPU 11/24

```
// main routine that executes on the host
int main(void)
 int n;
  int N = 1000000;
  size t size = N * sizeof(float):
 // Allocate in HOST memory
  float* h_A = (float*)malloc(size);
  float* h_B = (float*)malloc(size);
  float* h_C = (float*)malloc(size);
  // Initialize vectors
  for (n = 0; n < N; ++n) {
    h_A[n] = 3.2333 * n;
    h_B[n] = 8.09287 * n;
```

| GPGPU 12/24

```
// Allocate in DEVICE memory (note the address of pointer argument)
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

// Copy vectors from host to device memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
```

| GPGPU 13/24

```
// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vec_add<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result from device memory into host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
```

| GPGPU 14/24

// Free device memory

```
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

// Free host memory
free(h_A);
free(h_B);
free(h_C);

Build using cuda compiler and linker
$ nvcc -o testprog1 testprog1.cu
$ ./testprog1
```

| GPGPU 15/24

#### Run a Profile Analysis

- \$ cd /usr/local/cuda/computeprof/bin
- \$ ./computeprof



#### **Important Constructs**

■ Important Functions

```
cudaMalloc(device_address, size);
cudaMemcpy(dest, source, size, cudaMemcpyHostToDevice)
cudaMemcpy(dest, source, size, cudaMemcpyDeviceToHost)
```

■ Function modifier keywords

```
__global__: called by host, executed on device
__device__: called by and executed on device
__host__: called by and executed on host
```

■ Kernel Functions

```
Code to be run on an SP mykernel<<<ble>blocks_per_grid, threads_per_block>>>
```

GPGPU 17/24

## **Programming Hardware Abstractions**

- host (CPU) and device (GPU)
- thread
  - concurrent code executed on an SP
  - fine grain unit of parallelism
- warp
  - group of threads executed in parallel (up to a maximum number)
- block
  - group of threads executed together and form the unit of resource assignment
- grid
  - group of thread blocks that must all complete before control is returned to the host

## **Organizing threads**

- Threads in a warp can share instruction stream
- Each thread has its own registers and local memory
- Threads in a block can communicate by shared memory
- All threads in a grid can access the same global memory (but 200–600 cycle penalty)
- Need to decide how many blocks in the grid, and how many threads in each block.
- Can arrange blocks and threads in 1, 2, or 3 dimensions

| GPGPU 19/24

### **Example: matrix multiplication**

```
// Matrix multiplication kernel: C = A * B
__qlobal__ void mx_mult(float* A. float* B. float* C. int width)
  // What is the matrix element for this thread?
  int col = blockDim.x * blockIdx.x + threadIdx.x:
  int row = blockDim.y * blockIdx.y + threadIdx.y;
  float sum = \theta:
  for (int k = 0; k < width; ++k) {
    float elementA = A[row * width + k];
    float elementB = B[k * width + coll:
    sum += elementA * elementB:
  C[row * width + coll = sum:
int main(void) {
  // 2 dimensional arrangement of threads and blocks
  int blockWidth = 30:
  int gridWidth = 15;
  dim3 dimBlock(blockWidth, blockWidth):
  dim3 dimGrid(gridWidth, gridWidth);
  mx_mult<<<dimGrid, dimBlock>>>(d_A, d_B, d_C, width);
```

### **CUDA Device Memory Types**

#### GlobalMemory

- largest memory on GPU and accessible by all threads
- slowest access time  $\sim$ 200–600 clock cycles
- lifetime: application

#### Registers

- fastest memory, used to store local variables of a single thread
- lifetime: thread

#### Local memory

- section of device memory used when variables of a thread do not fit the registers available
- lifetime: thread

#### ■ Shared memory

- fast on chip memory shared between all threads of a single block
- lifetime: block
- Texture memory
  - a cached region of global memory
  - each SM has its own texture memory cache on chip
  - lifetime: application
- Constant memory
  - a cached read-only region of deice memory on each SM
  - lifetime: application

#### **Performance**

- GPU bundles several threads together for execution into "warps"
- Thread index values within a warp are contiguous. For warp size of 32 (eg GTX480) we have threadIdx.x 0 → 31 in warp 0 threadIdx.x 32 → 63 in warp 1
- Bit more complicated for multidimensional thread organization

| GPGPU 22/24

### **Branching**

- Single Instruction Multiple Thread
  - executes instruction for all threads in the warp, before moving onto next instruction
- Divergence occurs when threads in a warp follow different control flows. Sequential passes are then needed which can affect performance
- Also be careful of conditionals based on thread ID

| GPGPU 23/24

#### **Memory Access**

- DRAM memory access patterns
  - Fast: accessing data from multiple and contiguous locations
  - Slow: truly random access
- Ideal access pattern in GPUs
  - all threads in a warp access consecutive global memory locations
- Coalescing memory access
  - hardware can combine all of these accesses into a single request
- Non-coalescing memory access affects performance

GPGPU 24/24