Introduction to Parallel Programming - Part I #2

ShevaXu · 2017-08-26T02:56:38Z

Introduction to Parallel Programming - Part I

Course from Udacity by NVIDIA, and Github repo here.

1 The GPU Programming Model

GPU favors throughput over latency, and optimizes for it.

Typical GPU program

CPU allocates storage on GPU cudaMalloc
CPU copies input data to GPU cudaMemcpy
CPU launches kernel(s) on GPU to process the data
CPU copies result back from GPU cudaMemcpy

💡 Kernels look like serial programs. Write your program as it if it will run on one thread; the GPU will run that program on many threads.

GPU is good for

launching a large number of threads efficiently
running a large number of threads in parallel

kernal<<<gridSize, blockSize, shmem>>>(...)
kernal<<<dim3(bx, by, bz), dim3(tx, ty, tz)>>>(...)

GPU efficiently runs on many blocks and each block has a maximum #threads.

(Images from Wikipedia thread-block.)

2 GPU Hardware & Parallel Communication Model

Abstraction & Communication Patterns

Map 1-to-1: tasks read from and write to specific data elements (memory)

map(element, function)

Gather n-to-1: tasks compute where to read data

Scatter 1-to-n: tasks compute where to write data

Stencil n-to-1: tasks read from a fixed neighborhood in an array (data re-use)

Transpose 1-to-1: tasks re-order data elements in memory

Hardware & Model

GPU is responsible for allocating blocks to SMs

a thread block contains many threads
a SM may run more than one blocks
all the threads in one block may cooperate to solve a subproblem, but not to communicate with threads in other blocks (even in the same SM)

CUDA makes few guarantees about when and where thread blocks will run

Pros: flexibility -> efficiency; scalability
Cons: no assumptions block<->SM; no communications between blocks

CUDA guarantees:

all threads in a block run on the same SM at the same time
all blocks in a kernel finishe before any blocks from the next kernel run

Synchronization

Barrier __syncthreads(): wait till all threads arrive the proceed
Atomic Ops: only certain ops & data types, still no ordering

Strategies

Maximize arithmetic intensity: math/memory

maximize compute ops per thread
minimize time spent on memory
- move frequently-accessed data to fast memory: local > shared >> global >> host
- coalesce global memory accesses (coalesced >> strided)

Avoid thread divergence (branches & loops)

💡 How to use shared memory (static & dynamic)

UPDATE: Cooperative Groups in Cuda 9 features

define groups of threads explicitly at sub-block and multiblock granularities
enable new patterns such as producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire Grid
provide an abstraction for scalable code across different GPU architectures

__global__ void cooperative_kernel(...)
{

    // obtain default "current thread block" group
    thread_group my_block = this_thread_block();
    
    // subdivide into 32-thread, tiled subgroups
    // Tiled subgroups evenly partition a parent group into
    // adjacent sets of threads - in this case each one warp in size
    thread_group my_tile = tiled_partition(my_block, 32);

    // This operation will be performed by only the 
    // first 32-thread tile of each block
    if (my_block.thread_rank() < 32) {
        …
        my_tile.sync();
    }
}

3 & 4 Fundamental GPU Algorithms

Step complexity vs. Work complexity (total amount of opearions)

Reduce

Input:

set of elements
reduction operator
- binary
- associative ((a op b) op c = a op (b op c))

a b
\ /
 + c
 \ /
  + d
  \ /
   +

a b c d
\ / \ /
 +   +
  \ /
   +

logn step complexity - what if we only have p processor but n > p input? Brent's theorem

Scan

Input:

set of elements
binary associative operator
identity element I s.t. I op a = a

Two types of scan:

Exclusive scan - output all element before but not current element
Inclusive scan - output all element before and current element

Algorithm	Desp.	Work	Step	Notes
Serial Scan	-	`O(n)`	`n`
Hillis & Steele Scan	Starting with step 0, on step `i`, `op` yourself to your `2^i` left neighbor (if no such neighbor copy yourself)	`O(n*logn)`	`logn`	step efficient (more processor than work)
Blelloch Scan	reduce `->` downsweep (paper, wiki)	`O(n)`	`2*logn`	work efficient (more work than processor)

Sparse matrix/dense vector multiplication (SpMv) & Segmented scan

Histogram

Accumulate using atomics
Per-thread local histograms, then reduce
Sort, then reduce by key

Compact

Compact is most useful when we compact away a large number number of elements and the computation on each surviving element is expensive.

Predicate
Scan-in array A (0 for false and 1 for true)
Exclusive-sum-scan on array A -> scatter addresses (dense)
Scatter using addresses

Allocate

Using scan (good strategy)

Input: allocation request per input element
Output: location in array to write your thread's output

Sort

Odd-even sort - O(n) steps & O(n^2) works
(Parallel) Merge sort
1. Tons of tasks (each task small) - task per thread
2. Bunches of tasks (each task medium) - task per block
3. One task (big) - split task across SMs
Sorting network (e.g., bitonic sorter, image from wikipedia)

Radix sort - O(kn) where k is #bits of the representation (quite brute-force)
Quick sort

Oblivious - behavior is independent of some aspects of the problem

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction to Parallel Programming - Part I #2

Introduction to Parallel Programming - Part I #2

ShevaXu commented Aug 26, 2017 •

edited

Loading

Introduction to Parallel Programming - Part I #2

Introduction to Parallel Programming - Part I #2

Comments

ShevaXu commented Aug 26, 2017 • edited Loading

Introduction to Parallel Programming - Part I

1 The GPU Programming Model

2 GPU Hardware & Parallel Communication Model

Abstraction & Communication Patterns

Hardware & Model

Strategies

3 & 4 Fundamental GPU Algorithms

Reduce

Scan

Sparse matrix/dense vector multiplication (SpMv) & Segmented scan

Histogram

Compact

Allocate

Sort

ShevaXu commented Aug 26, 2017 •

edited

Loading