ICEPICK

Reverse-engineering the Blackwell (sm_120) GPU microarchitecture — bare-metal.

Sister project to BLACKWALL — two halves of one Blackwell sm_120 teardown: BLACKWALL maps the compute roofline (FP8/FP4 GEMM); ICEPICK dissects the microarchitecture beneath it (latency, caches, SASS). Same die, two blades.

NVIDIA documents what its GPUs do, not how. ICEPICK reconstructs the how by microbenchmarking the silicon and reading the SASS (machine assembly) the compiler actually emits — measuring instruction latencies, the memory hierarchy, the tensor cores and the scheduler directly, in clock cycles. The name is Cyberpunk: an icepick breaks the ICE to reach what's protected. Here the protected thing is the undocumented behaviour of consumer Blackwell (RTX 50-series), which is barely covered in public work.

Status: Phase 0 — instruction-latency probe (FMA dependency chain) + SASS cross-check.

Method (honest by construction)

Each probe isolates one hardware feature, measures it in real cycles, and is checked against documentation — or flags what isn't documented. We don't reimplement or beat NVIDIA's libraries; we dissect the hardware they run on.

Dependency chains measure latency (pipeline depth): dependent ops can't overlap, so time = chain_length × latency.
Independent / ILP streams measure throughput (ops per cycle).
Pointer-chasing with varying strides reveals cache sizes, line size, and tier latencies.
cuobjdump -sass / nvdisasm expose the real sm_120 instructions and scheduling.

The cycle counter (clock64) gives results in SM clocks — the invariant that doesn't depend on boost frequency.

Lineage: Wong et al. (2010), Mei & Chu (2017), Jia et al. / Citadel, Dissecting the NVIDIA Volta/Turing GPU Architecture via Microbenchmarking (2018–2019).

Layout

ICEPICK/
├── src/
│   └── instruction_latency.cu   F0 — FMA latency by dependency chain (FP32 / FP64)
├── docs/ADR-0001-icepick.md     architecture decision + honest scope
└── (phases below add memory, tensor-core, scheduler probes)

Build & run

From the x64 Native Tools prompt (or after vcvars64.bat):

nvcc -O3 -std=c++17 -arch=sm_120 src/instruction_latency.cu -o icepick_lat.exe
icepick_lat.exe

REM see the real machine code of the chain:
cuobjdump -sass icepick_lat.exe | findstr /C:"FFMA" /C:"DFMA"

Roadmap

F0 — instruction latency (FMA dependency chain), FP32 / FP64, + SASS cross-check
F1 — memory hierarchy via random pointer-chasing. Measured on sm_120: L1 ~64 KB @ ~43 cyc · L2 32 MB @ ~353 cyc · DRAM ~920 cyc. (L2 size measured independently, matches the reported 32 MB → cross-checked.)
F2b — sustained memory bandwidth: ~381 GB/s (85% of the ~448 GB/s GDDR7 peak), via read/copy/triad streaming. Crossed with BLACKWALL's compute ceiling = the full roofline.
[—] F2 — tensor-core throughput lives in the sister project BLACKWALL (FP8/FP4 GEMM); ICEPICK keeps the latency + memory half, so the two don't overlap.
F3 (optional deep-dive) — warp scheduling, register-file banks, occupancy
F4 (optional deep-dive) — sm_120 SASS catalogue: Blackwell-new encodings

Core complete — F0 + F1 + F2b cover the microarchitecture + memory half of the roofline.

The unified roofline (BLACKWALL × ICEPICK)

Crossing BLACKWALL's compute ceiling with ICEPICK's ~381 GB/s gives the ridge point per precision (FLOP/byte above which you're compute-bound): FP32 ~45 · FP16 ~231 · FP8 ~485 · FP4 ~897. The lower the precision, the further right the ridge — at FP4 you need ~900 FLOP/byte to escape the memory wall, which is why the 32 MB L2 exists: the only way to feed the FP4 tensor cores is cache reuse, not DRAM. Consumer Blackwell is a cache-resident FP4 inference engine — measured end to end across both repos.

Honesty

Every number here is measured on the machine, reported in cycles, and cross-checked against the SASS or known values. Where a result is uncertain or architecture-specific, it is marked. No inferred numbers are presented as measured.

Related — the Blackwell teardown

ICEPICK has a sister project: together they characterize consumer Blackwell (sm_120) end to end.

BLACKWALL — the compute roofline: FP8/FP4 GEMM throughput across the precision spectrum (FP32 → FP4), measured on the metal.
ICEPICK (this repo) — the microarchitecture beneath it: instruction latency, the memory hierarchy, and the real sm_120 SASS.

Same die, two blades.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICEPICK

Method (honest by construction)

Layout

Build & run

Roadmap

The unified roofline (BLACKWALL × ICEPICK)

Honesty

Related — the Blackwell teardown

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ICEPICK

Method (honest by construction)

Layout

Build & run

Roadmap

The unified roofline (BLACKWALL × ICEPICK)

Honesty

Related — the Blackwell teardown

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages