A four-step walk from a one-thread-per-row baseline to a single-pass online softmax kernel. Each step moves a Nsight Compute counter, and the online variant doubles as the inner loop Layer 4's Flash Attention kernel reuses. cuDNN ACCURATE and FAST are registered as two baselines.
| naive | shared | warp_shuffle | online | cudnn_acc | cudnn_fast | |
|---|---|---|---|---|---|---|
| ms @ 4096×4096 | 5.878 | 1.650 | 1.214 | 1.421 | 1.092 | 1.092 |
| GB/s (2×N·4B) | 22.8 | 81.3 | 110.5 | 94.4 | 122.9 | 122.9 |
| % cuDNN | 19 | 66 | 90 | 77 | 100 | 100 |
| step | what changed | counter that moved | step gain |
|---|---|---|---|
| 00_naive.cu | 1 thread/row, 3 passes over GMEM | DRAM 21.56 % baseline | — |
| 01_shared.cu | 1 block/row, SMEM tree-fold for max and sum | DRAM 22 % → 36 % | 3.56× |
| 02_warp_shuffle.cu | __shfl_xor_sync finalise + one-cross-warp SMEM round-trip |
DRAM 36 % → 64 % | 1.36× |
| 03_online.cu | fused (max, sum) online pass + branchless warp merge |
(negative at this size) | 0.85× |
online runs slower than warp_shuffle here because softmax at this
shape is compute-bound on __expf and the online merge does 3
__expf/element vs warp_shuffle's 1. The pattern wins back its cost
in Layer 4's fused attention, where the inputs are computed inside
the kernel and never hit GMEM — the memory saved is real there,
imaginary here. Framed as primitive-introduction, not a perf win.
Click for more details cudaGetDeviceProperties / cudaDeviceGetAttribute
- GPU: NVIDIA GeForce RTX 3050 Laptop GPU (GA107), sm_86, 16 SMs
- Per-SM: 65,536 registers, 1,536 threads, 100 KB shared memory, 128 KB unified L1/TEX
- Off-chip: 3.68 GB VRAM, 128-bit bus, 192 GB/s peak DRAM
- Toolkit / driver: CUDA 13.0.88, driver 580.82.09, compiled
-O3 --gpu-architecture=sm_86 - cuDNN: 9.x (vendored at
../cuda-kernel-portfolio/cudnn, bundled with this repo tree) - Shape:
rows × cols = 4096 × 4096(67 M floats = 256 MB working set). Effective bandwidth counts input read + output write = 2·N·4 bytes. At DRAM peak (192 GB/s), this shape is 268 MB / 0.192 GB/ms = 1.39 ms bandwidth-minimal;cudnn_accachieves 1.09 ms by using the L2 hit path plus a tile-scheduled kernel that overlaps multiple rows per SM.
Row 0 → Row 1 — parallelise the reductions. naive runs one thread
per row, 3 sequential passes over cols elements each. L1/TEX sits
at 88.60 % because each thread's read pattern is a 16 KB stripe (cols
elements × 4 B) that spills across cache lines; DRAM only sees 21.56 %
of peak because most traffic hits and stays in L1. Compute (SM) at
4.83 % shows the block of 256 threads is mostly idle as 256 separate
serial loops. shared maps one block per row and reduces in parallel:
DRAM climbs 22 → 36 %, Compute (SM) jumps 5 → 58 % (the SMEM tree-fold
does real work), and Duration falls 5.88 → 1.65 ms. The L1/TEX traffic
drops by the same factor as the thread-level parallelism.
Row 1 → Row 2 — drop the SMEM round-trips. warp_shuffle replaces
the per-stride SMEM tree-fold with a warp-shuffle __shfl_xor_sync that
runs in registers. A block-size tree-fold did 9 SMEM rounds (log2(512))
plus 9 __syncthreads; warp-shuffle does 5 register shuffles with no
barrier, plus one SMEM write-read for the cross-warp merge. DRAM rises
36 → 64 %, L1/TEX falls 58 → 47 %, Compute (SM) falls 58 → 45 % (less
SMEM traffic means less pipe use). Duration 1.65 → 1.21 ms.
Row 2 → Row 3 — fused online pass. online computes (max, sum_exp) pairs incrementally, rescaling local_sum by exp(old_max - new_max) each time the running maximum grows. One GMEM read pass
computes both statistics; the second pass computes exp(x - max) * inv_sum and writes the output. warp_shuffle does 3 GMEM read passes
(max, exp+sum, normalise). At 2048 × 2048 in Layer 4 attention the
saved reads dominate; at 4096 × 4096 here with cold GMEM, the saved
reads hit L2 (L2 Throughput 22.92 vs warp_shuffle's 40.85 %) and are
cheap, while the extra __expf calls per element are the same cost
as in a compute-bound kernel. Duration rises 1.21 → 1.42 ms.
Rows 2/3 → cuDNN. cudnn_acc runs softmax_fw_kernel_resident at
DRAM 95.10 %, 0.735 ms. Its tile scheduling issues one block per SM
hosting many rows (Waves per SM = 64), amortising launch and filling the
pipeline deeper than our one-row-per-block design. The kernel fits in 45 KB
of SMEM but the tile-per-block strategy keeps 4 blocks/SM resident
versus our 1.
- Cross-checked against a Kahan-summed, FP64-accumulated CPU reference
(audit §L3.1.1/§L3.2.1). The reference cpu_softmax was naive forward-
sum and would have had error
~cols * eps ≈ 5e-4— above the tightened rtol = 1e-4. verify_closeusesatol = 1e-6, rtol = 1e-4. Forcols = 4096, typical output values are1/cols ≈ 2.4e-4; the previoustol = 1e-4absolute would have let a systematic normalizer bug of that size slip through.- Output poisoned with NaN before each launch (§0.1).
01_shared.cu's block-size is rounded to the next power-of-2 in the launcher so the stride-halving tree-fold is correct for any cols (audit §L3.1.1 — the reference kernel only worked for power-of-2 block sizes but was launched withceil(cols/32) * 32, which was buggy forcols ∈ {96, 160, ...}).03_online.cuuses the branchless(max, sum)merge and skips elements withval == -INFINITY(audit §L3.1.2 — the referenceif (val > local_max)branch computedexpf(NaN)on rows of all-INF, the masked-attention corner case; Layer 4 would have inherited the bug).
Build:
rm -rf build && mkdir build && cd build
cmake .. && cmake --build . --parallel
cd ..Run the Layer-3 sweep:
export LD_LIBRARY_PATH=PATH_TO_CUDNN/cudnn/lib:$LD_LIBRARY_PATH
./build/bin/softmax_bench --rows 4096 --cols 4096 --iters 20 --runs 5 --warmup 3Capture profiles:
./scripts/profile_layer4.sh- FP32 only, power-of-2
colsnot required.01_shared.curounds block size to the next pow2 in the launcher;02_warp_shuffle.cuand03_online.cutakecolsmultiple of 32. - Single shape in the benchmark.
rows = cols = 4096. Transformer attention softmax hasrows = B · H · seq_lenandcols = seq_len; atseq_len = 2048with 16 heads the effectiverows = O(1e5)— our variants scale linearly in rows, but the Pareto optimum may shift to online if rows dominate and the extra memory traffic of the 3-pass variants hurts. - No masking.
03_online.cudefends against-INFinputs so masked attention works; this is intent for Layer 4, not tested directly in this layer's bench (the inputs are uniform).