Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction to Parallel Programming - Part I #2

Open
ShevaXu opened this issue Aug 26, 2017 · 0 comments
Open

Introduction to Parallel Programming - Part I #2

ShevaXu opened this issue Aug 26, 2017 · 0 comments

Comments

@ShevaXu
Copy link
Owner

ShevaXu commented Aug 26, 2017

Introduction to Parallel Programming - Part I

Course from Udacity by NVIDIA, and Github repo here.

1 The GPU Programming Model

GPU favors throughput over latency, and optimizes for it.

Typical GPU program

  1. CPU allocates storage on GPU cudaMalloc
  2. CPU copies input data to GPU cudaMemcpy
  3. CPU launches kernel(s) on GPU to process the data
  4. CPU copies result back from GPU cudaMemcpy

💡 Kernels look like serial programs. Write your program as it if it will run on one thread; the GPU will run that program on many threads.

GPU is good for

  • launching a large number of threads efficiently
  • running a large number of threads in parallel
kernal<<<gridSize, blockSize, shmem>>>(...)
kernal<<<dim3(bx, by, bz), dim3(tx, ty, tz)>>>(...)

GPU efficiently runs on many blocks and each block has a maximum #threads.

(Images from Wikipedia thread-block.)

2 GPU Hardware & Parallel Communication Model

Abstraction & Communication Patterns

Map 1-to-1: tasks read from and write to specific data elements (memory)

map(element, function)

Gather n-to-1: tasks compute where to read data

Scatter 1-to-n: tasks compute where to write data

Stencil n-to-1: tasks read from a fixed neighborhood in an array (data re-use)

Transpose 1-to-1: tasks re-order data elements in memory

Hardware & Model

GPU is responsible for allocating blocks to SMs

  • a thread block contains many threads
  • a SM may run more than one blocks
  • all the threads in one block may cooperate to solve a subproblem, but not to communicate with threads in other blocks (even in the same SM)

CUDA makes few guarantees about when and where thread blocks will run

  • Pros: flexibility -> efficiency; scalability
  • Cons: no assumptions block<->SM; no communications between blocks

CUDA guarantees:

  1. all threads in a block run on the same SM at the same time
  2. all blocks in a kernel finishe before any blocks from the next kernel run

Synchronization

  • Barrier __syncthreads(): wait till all threads arrive the proceed
  • Atomic Ops: only certain ops & data types, still no ordering

Strategies

Maximize arithmetic intensity: math/memory

  • maximize compute ops per thread
  • minimize time spent on memory
    • move frequently-accessed data to fast memory: local > shared >> global >> host
    • coalesce global memory accesses (coalesced >> strided)

Avoid thread divergence (branches & loops)

💡 How to use shared memory (static & dynamic)

UPDATE: Cooperative Groups in Cuda 9 features

  • define groups of threads explicitly at sub-block and multiblock granularities
  • enable new patterns such as producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire Grid
  • provide an abstraction for scalable code across different GPU architectures
__global__ void cooperative_kernel(...)
{

    // obtain default "current thread block" group
    thread_group my_block = this_thread_block();
    
    // subdivide into 32-thread, tiled subgroups
    // Tiled subgroups evenly partition a parent group into
    // adjacent sets of threads - in this case each one warp in size
    thread_group my_tile = tiled_partition(my_block, 32);

    // This operation will be performed by only the 
    // first 32-thread tile of each block
    if (my_block.thread_rank() < 32) {
        …
        my_tile.sync();
    }
}

3 & 4 Fundamental GPU Algorithms

Step complexity vs. Work complexity (total amount of opearions)

Reduce

Input:

  • set of elements
  • reduction operator
    • binary
    • associative ((a op b) op c = a op (b op c))
a b
\ /
 + c
 \ /
  + d
  \ /
   +

a b c d
\ / \ /
 +   +
  \ /
   +

logn step complexity - what if we only have p processor but n > p input? Brent's theorem

Scan

Input:

  • set of elements
  • binary associative operator
  • identity element I s.t. I op a = a

Two types of scan:

  • Exclusive scan - output all element before but not current element
  • Inclusive scan - output all element before and current element
Algorithm Desp. Work Step Notes
Serial Scan - O(n) n
Hillis & Steele Scan Starting with step 0, on step i, op yourself to your 2^i left neighbor (if no such neighbor copy yourself) O(n*logn) logn step efficient (more processor than work)
Blelloch Scan reduce -> downsweep (paper, wiki) O(n) 2*logn work efficient (more work than processor)

Sparse matrix/dense vector multiplication (SpMv) & Segmented scan

Histogram

  • Accumulate using atomics
  • Per-thread local histograms, then reduce
  • Sort, then reduce by key

Compact

Compact is most useful when we compact away a large number number of elements and the computation on each surviving element is expensive.

  1. Predicate
  2. Scan-in array A (0 for false and 1 for true)
  3. Exclusive-sum-scan on array A -> scatter addresses (dense)
  4. Scatter using addresses

Allocate

Using scan (good strategy)

  • Input: allocation request per input element
  • Output: location in array to write your thread's output

Sort

  • Odd-even sort - O(n) steps & O(n^2) works
  • (Parallel) Merge sort
    1. Tons of tasks (each task small) - task per thread
    2. Bunches of tasks (each task medium) - task per block
    3. One task (big) - split task across SMs
  • Sorting network (e.g., bitonic sorter, image from wikipedia)

  • Radix sort - O(kn) where k is #bits of the representation (quite brute-force)
  • Quick sort

Oblivious - behavior is independent of some aspects of the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant