# Reduction



#### What is a Reduction?

- Derives a single value from an array of values
- Can be performed by sequentially going through every element of the array
- Take as a reference the sequential implementation
- The complexity is O(N)
- float **or** double?



## Usage Examples

- MPI exposes directly a MPI\_Reduce and MPI\_Allreduce
  - They are used quite a lot
- MapReduce relies on reductions
  - To analyze a large dataset
- Used as a primitive by different sorting algorithms



#### **Naive GPU**

- Each thread will be the "owner" of the location to which it is assigned
- The condition inside the loop verifies threads index multiple of 2<sup>n</sup>
  - these threads will be active threads
  - the others will be inactive
  - fewer and fewer threads remain active
  - o at the last iteration, only one remains active
- The \_\_syncthreads ensure that the results have been written back



# **Problems and Optimizations**

- Coalescing and Divergence
- Privatization
- Coarsening

| Optimization                                          | Benefit to compute cores                                               | Benefit to memory                                                                 | Strategies                                                                                                                                                                                                                         |
|-------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Maximizing occupancy                                  | More work to hide<br>pipeline latency                                  | More parallel<br>memory accesses<br>to hide DRAM<br>latency                       | Tuning usage of SM resources such as threads per block, shared memory per block, and registers per thread                                                                                                                          |
| Enabling<br>coalesced<br>global<br>memory<br>accesses | Fewer pipeline<br>stalls waiting for<br>global memory<br>accesses      | Less global memory<br>traffic and better<br>utilization of bursts/<br>cache lines | Transfer between global memory and shared memory in a coalesced manner and performing uncoalesced accesses in shared memory (e.g., corner turning)  Rearranging the mapping of threads to data  Rearranging the layout of the data |
| Minimizing<br>control<br>divergence                   | High SIMD<br>efficiency (fewer<br>idle cores during<br>SIMD execution) | -                                                                                 | Rearranging the mapping<br>of threads to work and/or<br>data<br>Rearranging the layout of<br>the data                                                                                                                              |
| Tiling of reused data                                 | Fewer pipeline<br>stalls waiting for<br>global memory<br>accesses      | Less global memory traffic                                                        | Placing data that is reused within a block in shared memory or registers so that it is transferred between global memory and the SM only once                                                                                      |
| Privatization<br>(covered<br>later)                   | Fewer pipeline<br>stalls waiting for<br>atomic updates                 | Less contention and<br>serialization of<br>atomic updates                         | Applying partial updates to<br>a private copy of the data<br>and then updating the<br>universal copy when done                                                                                                                     |
| Thread coarsening                                     | Less redundant<br>work, divergence,<br>or synchronization              | Less redundant<br>global memory<br>traffic                                        | Assigning multiple units of<br>parallelism to each thread<br>to reduce the price of<br>parallelism when it is<br>incurred unnecessarily                                                                                            |



## Reduce Memory Divergency

- With a lot of threads, the inactive threads become more visible
  - o Previous solution can have different warps with only a punch of threads active
  - We can compact them to increase the scheduling efficiency
- Enable coalesced memory access
- Increase thread convergence





#### **Introduce Privatization**

- Threads write their partial sum result values to global memory
  - These values are reread by the same threads and others in the next iterations
- Shared memory has much shorter latency and higher bandwidth
  - It can fit the data required by the block's thread in this case
- Note
  - The syncthreads have been moved up
  - The memory access is still coalesced



# Reduce Overhead: Thread Coarsening

- The hardware is to serialize these thread blocks.
  - we are better off serializing them ourselves in a more efficient manner
- The block's segment is identified by multiplying with the COARSE\_FACTOR
  - threads are not responsible only for adding two elements
- Trade-off: we are scheduling fewer threads
  - The occupancy could be low
  - Fine-tuning is required





## Reduce Overhead: Comparison

- We use the COARSE FACTOR to fully utilized the hardware
  - The price of idle threads is softened much more
  - More steps fully utilized the hardware





# Code Hands-on



#### Bonus

- It is available in the <u>Thrust Library</u>
  - Documentation <u>here</u>
- You can do the same exercise with a different hierarchical approach
  - Each block works on a different input's segments
  - o Instead, results are written to the same output using an atomicAdd
  - Try it as an exercise



# Thank you for your attention!

Gianmarco Accordi gianmarco.accordi@polimi.it

