## **Memory Coalescing**

Recall that thread blocks are divided into **warps** of 32 threads

Recall that thread blocks are divided into **warps** of 32 threads

Recall that thread blocks are divided into **warps** of 32 threads

Instructions are issued in parallel at the warp level of 32 threads

Instructions are issued in parallel at the warp level of 32 threads



For space on these slides, we will treat just 4 threads as a warp

Warp



Data is transferred to and from global device memory in 32-byte segments\*





(\* If the data is in the L1 cache it will be transferred in 128-byte cache lines – see the notebook for details)





For these slides we will treat 4 data elements as one of these fixed-length lines of contiguous memory





The memory subsystem will attempt to minimize the number of lines required to fulfill the read/write requirements of the warp







If the addresses requested are contiguous



All data in the line will be used



And the transfer will happen in as few lines as possible



When this occurs, the memory access is fully **coalesced** 



As requested memory becomes less contiguous









And more of the data being transferred will go unused





## Row and Column Sum Comparison

Consider a kernel that stores the sum of each row of a matrix (which here is 4 contiguous data elements) in a result vector

Warp



 0
 1
 2
 3

 4
 5
 6
 7

 8
 9
 10
 11

 12
 13
 14
 15



Warp















Sum = 1









Warp







This seems natural, but look at what happens when we consider the parallel execution within the warp

Warp





Data

Each thread in the warp is requesting data in a different line of memory









Note that increments to threadIdx.x are mapping to increments in the data along the y axis



Data





Which means (in our example) 4 lines of data will need to be loaded, and 75% of the data loaded will be unused



Data





Unfortunately, as each thread iterates over its row, the same uncoalesced pattern continues



Data





Unfortunately, as each thread iterates over its row, the same uncoalesced pattern continues



Data





Unfortunately, as each thread iterates over its row, the same uncoalesced pattern continues



Data





In this example we transferred 16 memory lines, and used 25% of the data for each line transferred

Warp





Data

Let's compare a kernel that stores the sum of each column of a matrix in a result vector

Warp







Warp





Data



















Warp





Data





Data







Data















Data





Warp



Data

A useful tip to keep in mind is that increments to threadIdx.x should map to increments in data in the direction of fastest changing index – in this case the x axis

Memory Line Size



In this example we transferred 4 memory lines (compared to 16), and used 100% of the data for each line transferred (compared to 25%)

Warp





Data

