Loom is a GPU-accelerated RTL logic simulator. Like a Jacquard loom weaving patterns from punched cards, Loom maps gate-level netlists onto a virtual manycore Boolean processor and executes them on GPUs, delivering 5-40X speedup over CPU-based RTL simulators.
Loom builds on the excellent GEM research by Zizheng Guo, Yanqing Zhang, Runsheng Wang, Yibo Lin, and Haoxing Ren at NVIDIA Research. ChipFlow extends their work with:
- Metal backend for Apple Silicon Macs (in addition to the original CUDA backend)
- Liberty timing support — load real cell delays from Liberty files (e.g. SKY130) for timing-annotated simulation
- SDF back-annotation — post-layout timing from Standard Delay Format files
- Setup/hold violation detection — both CPU and GPU-side checking
- Significant performance optimizations to the partition mapping pipeline
- CI/CD with automated testing across both backends
The goal is GPU-accelerated gate-level simulation with real cell timing — a first for open source. Current status:
| Component | Status |
|---|---|
| Liberty file parsing | Done — loads SKY130 HD cell delays |
| Gate delay computation | Done — per-AIG-pin delays from Liberty |
| SDF back-annotation | Done — post-layout delays from SDF files |
| CPU timing simulation | Done — arrival time propagation with setup/hold checking |
| GPU timing simulation | Done — setup/hold violation detection on GPU (Metal + CUDA) |
| SKY130 timing test suite | Done — post-P&R test circuits with SDF |
Unified loom sim CLI |
Done — timing constraints wired to both Metal and CUDA kernels |
Next steps:
- Timing-aware bit packing for improved GPU utilization
- Multi-clock domain support
Requires the Rust toolchain.
git clone https://github.com/ChipFlow/Loom.git
cd Loom
git submodule update --init --recursivecargo build -r --features metal --bin loomRequires CUDA toolkit installed.
cargo build -r --features cuda --bin loomLoom operates in two phases:
- Map your synthesized gate-level netlist to a
.gempartsfile (one-time cost):
cargo run -r --bin loom -- map design.gv design.gemparts- Simulate with a VCD input waveform:
# Metal (macOS) - use NUM_BLOCKS=1
cargo run -r --features metal --bin loom -- sim design.gv design.gemparts input.vcd output.vcd 1
# CUDA (Linux) - set NUM_BLOCKS to 2x your GPU's SM count
cargo run -r --features cuda --bin loom -- sim design.gv design.gemparts input.vcd output.vcd NUM_BLOCKS
# With SDF timing back-annotation:
cargo run -r --features metal --bin loom -- sim design.gv design.gemparts input.vcd output.vcd 1 \
--sdf design.sdf --sdf-corner typSee docs/usage.md for full documentation including synthesis preparation, VCD scope handling, and troubleshooting.
Browse the full documentation online or build it locally with mdbook:
mdbook serve # opens at http://localhost:3000- Only supports non-interactive testbenches (static VCD input waveforms)
- Synchronous logic only (no latches or async sequential logic)
- Clock gates must use the
CKLNQDmodule fromaigpdk.v
Pre-synthesized benchmark designs are in benchmarks/dataset/ (git submodule). See benchmarks/README.md for instructions.
Available designs: NVDLA, Rocket, Gemmini.
Loom builds on the GEM research. Please cite the original paper if you find this work useful.
@inproceedings{gem,
author = {Guo, Zizheng and Zhang, Yanqing and Wang, Runsheng and Lin, Yibo and Ren, Haoxing},
booktitle = {Proceedings of the 62nd Annual Design Automation Conference 2025},
organization = {IEEE},
title = {{GEM}: {GPU}-Accelerated Emulator-Inspired {RTL} Simulation},
year = {2025}
}Apache-2.0. See LICENSE for details.