You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
define groups of threads explicitly at sub-block and multiblock granularities
enable new patterns such as producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire Grid
provide an abstraction for scalable code across different GPU architectures
__global__voidcooperative_kernel(...)
{
// obtain default "current thread block" groupthread_groupmy_block=this_thread_block();
// subdivide into 32-thread, tiled subgroups// Tiled subgroups evenly partition a parent group into// adjacent sets of threads - in this case each one warp in sizethread_groupmy_tile=tiled_partition(my_block, 32);
// This operation will be performed by only the // first 32-thread tile of each blockif (my_block.thread_rank() <32) {
…
my_tile.sync();
}
}
3 & 4 Fundamental GPU Algorithms
Step complexity vs. Work complexity (total amount of opearions)
Reduce
Input:
set of elements
reduction operator
binary
associative ((a op b) op c = a op (b op c))
a b
\ /
+ c
\ /
+ d
\ /
+
a b c d
\ / \ /
+ +
\ /
+
logn step complexity - what if we only have p processor but n > p input? Brent's theorem
Scan
Input:
set of elements
binary associative operator
identity element I s.t. I op a = a
Two types of scan:
Exclusive scan - output all element before but not current element
Inclusive scan - output all element before and current element
Algorithm
Desp.
Work
Step
Notes
Serial Scan
-
O(n)
n
Hillis & Steele Scan
Starting with step 0, on step i, op yourself to your 2^i left neighbor (if no such neighbor copy yourself)
Introduction to Parallel Programming - Part I
Course from Udacity by NVIDIA, and Github repo here.
1 The GPU Programming Model
Typical GPU program
cudaMalloc
cudaMemcpy
cudaMemcpy
💡 Kernels look like serial programs. Write your program as it if it will run on one thread; the GPU will run that program on many threads.
GPU is good for
GPU efficiently runs on many blocks and each block has a maximum
#
threads.(Images from Wikipedia thread-block.)
2 GPU Hardware & Parallel Communication Model
Abstraction & Communication Patterns
Map
1-to-1
: tasks read from and write to specific data elements (memory)map(element, function)
Gather
n-to-1
: tasks compute where to read dataScatter
1-to-n
: tasks compute where to write dataStencil
n-to-1
: tasks read from a fixed neighborhood in an array (data re-use)Transpose
1-to-1
: tasks re-order data elements in memoryHardware & Model
->
efficiency; scalability<->
SM; no communications between blocksCUDA guarantees:
Synchronization
__syncthreads()
: wait till all threads arrive the proceedStrategies
Maximize arithmetic intensity:
math/memory
local > shared >> global >> host
coalesced >> strided
)Avoid thread divergence (branches & loops)
💡 How to use shared memory (static & dynamic)
UPDATE: Cooperative Groups in Cuda 9 features
3 & 4 Fundamental GPU Algorithms
Step complexity vs. Work complexity (total amount of opearions)
Reduce
Input:
(a op b) op c = a op (b op c)
)logn
step complexity - what if we only havep
processor butn > p
input? Brent's theoremScan
Input:
I
s.t.I op a = a
Two types of scan:
O(n)
n
i
,op
yourself to your2^i
left neighbor (if no such neighbor copy yourself)O(n*logn)
logn
->
downsweep (paper, wiki)O(n)
2*logn
Sparse matrix/dense vector multiplication (SpMv) & Segmented scan
Histogram
Compact
Compact is most useful when we compact away a large number number of elements and the computation on each surviving element is expensive.
A
(0
for false and1
for true)A
->
scatter addresses (dense)Allocate
Using scan (good strategy)
Sort
O(n)
steps &O(n^2)
worksO(kn)
wherek
is#
bits of the representation (quite brute-force)Oblivious - behavior is independent of some aspects of the problem
The text was updated successfully, but these errors were encountered: