Below is a step-by-step explanation of “Reduction #4: First Add During Load,” showing how combining the first reduction step with the global-to-shared memory transfer both keeps all threads busy and cuts runtime roughly in half compared to the prior version. We’ll walk through the code change, explain each line, do a small dry-run, and highlight the performance gain.

---

## Key Insight

By doing the first add **while** loading data from global memory into shared memory, we:

1. **Halve the number of blocks** needed (because each block now handles two elements per thread).
2. **Keep every thread active** in that first step (no one sits idle).
3. **Reduce overall kernel launches and index arithmetic**, boosting effective shared-memory bandwidth from \~9.7 GB/s to \~17.4 GB/s (for a 4 M-element sum) and cutting time from \~1.72 ms to \~0.97 ms—a \~1.78× speedup over Reduction #3 and \~8.3× over the naïve version.

---

## 1. Original Load + Reduction Loop (Version 3)

```cpp
// Version 3: each thread loads one element, then reduction loop begins
unsigned int tid = threadIdx.x;
unsigned int i   = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid]      = g_idata[i];
__syncthreads();

// reduction loop (reversed)
for (unsigned int s = blockDim.x/2; s > 0; s >>= 1) {
  if (tid < s) {
    sdata[tid] += sdata[tid + s];
  }
  __syncthreads();
}
```

- **Issue:** In the very first iteration (`s = blockDim.x/2`), only half the threads do work; the other half spin at the `if (tid < s)` check—wasting resources.

---

## 2. New “First Add During Load” (Version 4)

```cpp
// Version 4: perform first reduction step during the global→shared load
unsigned int tid = threadIdx.x;
unsigned int i   = blockIdx.x*(blockDim.x*2) + threadIdx.x;
// each thread loads two elements and immediately sums them
sdata[tid]      = g_idata[i] + g_idata[i + blockDim.x];
__syncthreads();

// now proceed with the same reversed reduction loop,
// but starting with half as many blocks
for (unsigned int s = blockDim.x/2; s > 0; s >>= 1) {
  if (tid < s) {
    sdata[tid] += sdata[tid + s];
  }
  __syncthreads();
}
```

---

## 3. Line-By-Line Explanation

| Line                                                        | What It Does                                                                  | Why It Helps                                                                                                         |
| ----------------------------------------------------------- | ----------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;` | Each block now covers `2*blockDim.x` elements instead of `blockDim.x`.        | Halves the number of blocks needed, so each block does twice the work.                                               |
| `sdata[tid] = g_idata[i] + g_idata[i + blockDim.x];`        | Threads load two global elements and sum them into shared memory in one shot. | Keeps **all** threads busy on the first reduction step—no idle threads—immediately cutting the problem size in half. |
| `__syncthreads();`                                          | Ensures the combined values are in shared memory before further reduction.    | Maintains correctness across threads.                                                                                |
| **Then same reversed loop**                                 | Combines pairs within shared memory until one value per block remains.        | All subsequent iterations have fewer threads, but the heavy first step is already done.                              |

---

## 4. Dry-Run Example (BlockDim = 8, 16-element slice)

Let’s say a block handles 16 elements (two loads per 8 threads). Initial global data for this block:

```
g_idata slice: [a,b,c,d,e,f,g,h,  i,j,k,l,m,n,o,p]
```

### 4.1 First Add During Load

Each thread `tid=0…7` does:

- `sdata[tid] = g_idata[i] + g_idata[i+8]`

Resulting `sdata` after load:

```
tid:       0      1      2      3      4      5      6      7
sdata:   [a+i,  b+j,  c+k,  d+l,  e+m,  f+n,  g+o,  h+p]
```

All 8 threads were active—no one sat idle.

### 4.2 Reversed Reduction Loop

- **s = 4:** threads 0–3 add:
  - `sdata[0]+=sdata[4] → (a+i)+(e+m)`
  - `sdata[1]+=sdata[5] → (b+j)+(f+n)`, etc.
- **s = 2:** threads 0–1 add partial sums.
- **s = 1:** thread 0 adds the final two.

Final `sdata[0] = sum(a…p)`.

---

## 5. Performance Improvement

| Version | GB/s (4 M elems) | Time (ms) | Speedup vs V3 |
| ------- | ---------------- | --------- | ------------- |
| V3      | 9.741            | 1.722     | 1×            |
| **V4**  | **17.377**       | **0.965** | **1.78×**     |

- **Why so fast?** Doing the first add during the global-to-shared transfer removes an entire iteration of the reduction loop and engages all threads up front, cutting both the instruction count and thread-idle time in half.

---

### Takeaways for GPU Experts

- **Fuse work with memory loads** whenever possible to keep all threads busy.
- **Halve the grid size** when you double per-thread work.
- **Reduce iterations** of your critical loop, especially the expensive first one where half the threads would otherwise idle.

This “first add during load” trick is a powerful example of how rethinking data movement on GPUs can yield outsized performance gains.
