Reverse-engineering the Blackwell (sm_120) GPU microarchitecture — bare-metal.
Sister project to BLACKWALL — two halves of one Blackwell
sm_120teardown: BLACKWALL maps the compute roofline (FP8/FP4 GEMM); ICEPICK dissects the microarchitecture beneath it (latency, caches, SASS). Same die, two blades.
NVIDIA documents what its GPUs do, not how. ICEPICK reconstructs the how by microbenchmarking the silicon and reading the SASS (machine assembly) the compiler actually emits — measuring instruction latencies, the memory hierarchy, the tensor cores and the scheduler directly, in clock cycles. The name is Cyberpunk: an icepick breaks the ICE to reach what's protected. Here the protected thing is the undocumented behaviour of consumer Blackwell (RTX 50-series), which is barely covered in public work.
Status: Phase 0 — instruction-latency probe (FMA dependency chain) + SASS cross-check.
Each probe isolates one hardware feature, measures it in real cycles, and is checked against documentation — or flags what isn't documented. We don't reimplement or beat NVIDIA's libraries; we dissect the hardware they run on.
- Dependency chains measure latency (pipeline depth): dependent ops can't overlap, so time = chain_length × latency.
- Independent / ILP streams measure throughput (ops per cycle).
- Pointer-chasing with varying strides reveals cache sizes, line size, and tier latencies.
cuobjdump -sass/nvdisasmexpose the realsm_120instructions and scheduling.
The cycle counter (clock64) gives results in SM clocks — the invariant that doesn't
depend on boost frequency.
Lineage: Wong et al. (2010), Mei & Chu (2017), Jia et al. / Citadel, Dissecting the NVIDIA Volta/Turing GPU Architecture via Microbenchmarking (2018–2019).
ICEPICK/
├── src/
│ └── instruction_latency.cu F0 — FMA latency by dependency chain (FP32 / FP64)
├── docs/ADR-0001-icepick.md architecture decision + honest scope
└── (phases below add memory, tensor-core, scheduler probes)
From the x64 Native Tools prompt (or after vcvars64.bat):
nvcc -O3 -std=c++17 -arch=sm_120 src/instruction_latency.cu -o icepick_lat.exe
icepick_lat.exe
REM see the real machine code of the chain:
cuobjdump -sass icepick_lat.exe | findstr /C:"FFMA" /C:"DFMA"- F0 — instruction latency (FMA dependency chain), FP32 / FP64, + SASS cross-check
- F1 — memory hierarchy via random pointer-chasing. Measured on sm_120: L1 ~64 KB @ ~43 cyc · L2 32 MB @ ~353 cyc · DRAM ~920 cyc. (L2 size measured independently, matches the reported 32 MB → cross-checked.)
- F2b — sustained memory bandwidth: ~381 GB/s (85% of the ~448 GB/s GDDR7 peak), via read/copy/triad streaming. Crossed with BLACKWALL's compute ceiling = the full roofline.
- [—] F2 — tensor-core throughput lives in the sister project BLACKWALL (FP8/FP4 GEMM); ICEPICK keeps the latency + memory half, so the two don't overlap.
- F3 (optional deep-dive) — warp scheduling, register-file banks, occupancy
- F4 (optional deep-dive) —
sm_120SASS catalogue: Blackwell-new encodings
Core complete — F0 + F1 + F2b cover the microarchitecture + memory half of the roofline.
Crossing BLACKWALL's compute ceiling with ICEPICK's ~381 GB/s gives the ridge point per precision (FLOP/byte above which you're compute-bound): FP32 ~45 · FP16 ~231 · FP8 ~485 · FP4 ~897. The lower the precision, the further right the ridge — at FP4 you need ~900 FLOP/byte to escape the memory wall, which is why the 32 MB L2 exists: the only way to feed the FP4 tensor cores is cache reuse, not DRAM. Consumer Blackwell is a cache-resident FP4 inference engine — measured end to end across both repos.
Every number here is measured on the machine, reported in cycles, and cross-checked against the SASS or known values. Where a result is uncertain or architecture-specific, it is marked. No inferred numbers are presented as measured.
ICEPICK has a sister project: together they characterize consumer Blackwell (sm_120) end to end.
- BLACKWALL — the compute roofline: FP8/FP4 GEMM throughput across the precision spectrum (FP32 → FP4), measured on the metal.
- ICEPICK (this repo) — the microarchitecture beneath it: instruction latency, the memory
hierarchy, and the real
sm_120SASS.
Same die, two blades.