# Parallel Programming Language CUDA (C extension)

Jiwon Seo

# GPU Architecture GTX 1080





GPU memory
DDR5 DRAM

# GPU Architecture GTX 1080



Stream Multiprocessor (SM)

320 GB/sec (256 bit interface)

GPU memory
DDR5 DRAM

#### GPU Architecture GTX 1080



#### Running a Thread Block on SM



- Groups of 32 threads share instruction stream → Warp
- SM can have up to 64 runnable warps at a time
- SM at each clock:
  - Select up to 4 runnable warps from 64 runnable ones
  - Select up to 2 (runnable) instructions per warp

64개 워프?? 워프에 대한 context가 64개?

## Scheduling Thread Blocks



# Independency of Thread Blocks

- Any possible interleaving of blocks should be valid
  - presumed to run to completion without pre-emption
  - can run in any order
  - can run concurrently OR sequentially
- Blocks may coordinate but not synchronize
  - shared queue pointer: OK
  - shared lock: BAD ... can easily deadlock
- Independence requirement gives scalability

#### More on Warp

- A warp is a group of 32 threads
  - Warp is CUDA implementation detail on NVIDIA GPUs
  - Not a PL (programming language) abstraction
- A warp (32 thread group) shares an instruction stream
- In a thread block, threads 0-31 are in a same warp (so do therads 32-63, etc)
- SM can schedule and interleave execution of up to 64 warps (in GTX 1080)
- SM can run multiple thread block concurrently

#### Warps

# Multi-core CPU Control Control Control Control Control Control ALU ALU ALU ALU ALU ALU ALU



- A warp = 32 threads launched together
  - Usually, execute together as well

# Thread Scheduling Example

- SM implements zero-overhead warp scheduling
  - At any time, a subset of the warps is executed by SM \*
  - Warps whose next instruction has its inputs ready are eligible for execution
  - Eligible Warps are selected for execution on a prioritized scheduling policy
  - All threads in a warp execute the same instruction



#### More on GTX 1080



1

320 GB/sec

GPU memory (DDR5 DRAM)

1.6 GHz clock

20 SM cores per chip

20 x 128 = 2,560 SIMD mul-add ALUs = 8.1 TFLOPs

Up to 20 x 64 = 1280 interleaved warps per chip (40,960 CUDA threads/chip)

**TDP: 180 watts** (thermal design power)

Kernel execution requirements:

Each thread block has 128 threads

Each thread block requires 520 bytes of shared memory

Assume Kernel runs with large number of thread blocks



Kernel execution requirements:

Each thread block has 128 threads Each thread block requires 520 bytes of shared memory

Step 1: host sends CUDA device (GPU) a command ("execute this kernel")



**EXECUTE:** kernel

ARGS: N, input\_array, output\_array

NUM BLOCKS: 1000

**GPU Work Scheduler** 





Kernel execution requirements:

Each thread block has 128 threads

Each thread block requires 520 bytes of shared memory

Step 2: scheduler maps block 0 to core 0 (reserves execution contexts for 128 threads

and 520 bytes of shared storage)

EXECUTE: kernel

ARGS: N, input\_array, output\_array

NUM\_BLOCKS: 1000

NEXT = 1 GPU Work Scheduler

TOTAL = 1000





Kernel execution requirements:

Each thread block has 128 threads
Each thread block requires 520 bytes of shared memory

Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown)

TOTAL = 1000

EXECUTE: kernel
ARGS: N, input\_array, output\_array
NUM\_BLOCKS: 1000

NEXT = 2 GPU Work Scheduler





Kernel execution requirements:

Each thread block has 128 threads Each thread block requires 520 bytes of shared memory

Step 3: scheduler continues to map blocks to available execution contexts

(interleaved mapping shown)

EXECUTE: kernel

ARGS: N, input\_array, output\_array

NUM\_BLOCKS: 1000

NEXT = 3 GPU Work Scheduler

TOTAL = 1000





Kernel execution requirements:

Each thread block has 128 threads

Each thread block requires 520 bytes of shared memory

Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown).

Only two thread blocks fit on a core

(third block won't fit due to insufficient shared storage 3 x 520 bytes > 1.5 KB)

EXECUTE: kernel
ARGS: N, input\_array, output\_array
NUM\_BLOCKS: 1000

NEXT = 4 GPU Work Scheduler
TOTAL = 1000





Kernel execution requirements:

Each thread block has 128 threads Each thread block requires 520 bytes of shared memory

#### Step 4: thread block 0 completes on core 0







Kernel execution requirements:

Each thread block has 128 threads Each thread block requires 520 bytes of shared memory

Step 5: block 4 is scheduled on core 0 (mapped to execution contexts 0-127)







Kernel execution requirements:

Each thread block has 128 threads
Each thread block requires 520 bytes of shared memory

#### Step 6: thread block 2 completes on core 0



**EXECUTE:** kernel

ARGS: N, input\_array, output\_array

NUM\_BLOCKS: 1000

NEXT = 5 GPU Work Scheduler

TOTAL = 1000





Kernel execution requirements:

Each thread block has 128 threads

Each thread block requires 520 bytes of shared memory

Step 7: thread block 5 is scheduled on core 0 (mapped to execution contexts 128-255)



**EXECUTE:** kernel

ARGS: N, input\_array, output\_array

NUM BLOCKS: 1000

NEXT = 6 GPU Work Scheduler

TOTAL = 1000





#### Grid, Blocks, Threads

#### GPU abstraction: Grid, Block, Threads

- Threads:
  - 3D id, unique within a block
- Blocks:
  - 2D id, unique within a grid
- Dimensions set at launch
  - Can be unique for each grid
- Built-in variables:
  - threadIdx, blockIdx
  - blockDim, gridDim



#### CUDA Grid



#### Grid, Blocks, Threads

Our Array Additon Kernel: 1D block, 1D threads

```
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}</pre>
```

#### Grid, Blocks, Threads

Our Array Additon Kernel: 1D block, 1D threads



## Some Example Kernels

```
Assume blockDim.x = 4
global__ void kernel( int *a )
                                             Output:
int idx = blockldx.x*blockDim.x + threadldx.x;
a[idx] = 7;
global__ void kernel( int *a )
                                             Output:
int idx = blockldx.x*blockDim.x + threadldx.x;
a[idx] = blockldx.x;
global__ void kernel( int *a )
                                             Output:
int idx = blockldx.x*blockDim.x + threadldx.x;
a[idx] = threadIdx.x;
```

## Some Example Kernels

```
Assume blockDim.x = 4
global void kernel( int *a )
                                           Output: 7777777777777777
int idx = blockldx.x*blockDim.x + threadldx.x;
a[idx] = 7;
global__ void kernel( int *a )
                                           Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
int idx = blockldx.x*blockDim.x + threadldx.x;
a[idx] = blockldx.x;
global__ void kernel( int *a )
                                            Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
int idx = blockldx.x*blockDim.x + threadldx.x;
a[idx] = threadIdx.x;
```

# More Example – Shuffling Data

```
// Reorder values based on indices
// Each thread moves one element
 global void shuffle(int* prev array, int* new array, int* indices)
  int i = threadldx.x + blockDim.x * blockldx.x;
  if (i < N)
    new_array[i] = prev_array[indices[i]];
                                                                     Host Code
}
int main()
  // Run grid of (N+255)/256 blocks of 256 threads each
  shuffle<<< (N+255)/256, 256>>>(d old, d new, d ind);
```

## More Example Kernel (2D)

#### More Example Kernel (2D) cont'd

```
__global__ void kernel( int *a, int dimx, int dimy )
{
  int ix = blockldx.x*blockDim.x + threadldx.x;
  int iy = blockldx.y*blockDim.y + threadldx.y;
  int idx = ______

  a[idx] = a[idx]+1;
}
```

```
int main() {
  int dimx = 16, dimy = 16;
  int num bytes = dimx*dimy*sizeof(int);
  int *d a=0, *h a=0; // device and host pointers
  h a = (int*)malloc(num bytes);
  cudaMalloc( (void**)&d a, num bytes );
  if(0==h a || 0==d a) {
    printf("couldn't allocate memory\n");
    return 1;
  cudaMemset( d_a, 0, num_bytes );
  dim3 grid, block;
  block.x = 4; block.y = 4;
  grid.x = dimx / block.x;
  grid.y = dimy / block.y;
  kernel << grid, block >>> ( d a, dimx, dimy );
 cudaMemcpy(h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
  free( h_a );
  cudaFree( d a );
  return 0;
```

# More Example Kernel (2D) cont'd

#### CUDA Grid



# More Example Kernel (2D) cont'd

```
__global__ void kernel( int *a, int dimx, int dimy )
{
   int ix = blockldx.x*blockDim.x + threadldx.x;
   int iy = blockldx.y*blockDim.y + threadldx.y;
   int idx = iy*dimx + ix;

a[idx] = a[idx]+1;
}
```

#### CUDA Grid



# Communication Among Threads

- How do you do global communication?
- Finish a grid and start a new one

#### Global Communication

- Finish a kernel and start a new one
- All writes from all threads complete before a kernel finishes

```
step1<<<grid1,blk1>>>(...);
// The system ensures that all
// writes from step1 complete.
step2<<<grid2,blk2>>>(...);
```

- Or, write to a predefined memory location
  - Race condition! Updates can be lost

- What is the value of a in thread 0?
- What is the value of a in thread 1917?

- Thread 0 could have finished execution before 1917 started
- Or the other way around
- Or both are executing at the same time

 Answer: not defined by the programming model, can be arbitrary

#### **Atomics**

 CUDA provides atomic operations to deal with this problem

#### **Atomics**

- An atomic operation guarantees that only a single thread access a memory address
- No race condition, but ordering is still arbitrary
- Different types of atomic instructions
- atomic{Add, Sub, Exch, Min, Max,
  Inc, Dec, CAS, And, Or, Xor}
- More types in Fermi

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

#### Example: Histogram

```
// Determine frequency of colors in a picture
// Each thread looks at one pixel and increments
// a counter atomically
 global void histogram(int* color,
                          int* buckets) {
  int i = threadIdx.x
        + blockDim.x * blockIdx.x;
  int c = colors[i];
  atomicAdd(&buckets[c], 1);
```

#### Example: Workqueue

```
// each thread gets a task from a shared queue
global
void workq(int* work q, int* q counter,
           int* output, int queue max) {
  int i = threadIdx.x
        + blockDim.x * blockIdx.x;
  int q index =
    atomicInc(q counter, queue max);
  int result = do work(work q[q index]);
  output[i] = result;
```

#### **Atomics**

- Atomics are slower than normal load/store
- You can have the whole machine queuing on a single location in memory
- Atomics unavailable on (very) old GPUs (G80)!

#### Example: Global Min/Max (Naive)

```
// If you require the maximum across all threads
// in a grid, you could do it with a single global
// maximum value, but it will be VERY slow
global
void global max(int* values, int* gl max) {
  int i = threadIdx.x
        + blockDim.x * blockIdx.x;
  int val = values[i];
  atomicMax(gl max,val);
```

# Example: Global Min/Max (Better)

```
// uses intermediate/local maximum per thread
global
void global max(int* values, int* max,
                 int *regional maxes,
                 int num regions) {
  // i and val as before ...
  int region = i % num regions;
  if(atomicMax(&reg max[region],val) < val)</pre>
    atomicMax(max,val);
```

#### Global Min/Max

- Single value causes serial bottleneck
- Create hierarchy of values for more parallelism
- Performance will still be slow, so use judiciously
- Even better version in the future!

# Summary on Atomic Operations

 Can't use normal load/store for inter-thread communication because of race conditions

- Use atomic instructions for sparse and/or unpredictable global communication
  - See next lectures for shared memory and scan for other communication patterns
- Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism