Status: v0.1.5 community testing Target: Replacement for Cuckaroo29 (C29) in Tari (XTM) Goal: GPU-native, ASIC-resistant proof-of-work; low power; cheap verifier.
GPUx is a candidate proof-of-work algorithm designed to make GPU mining durable against ASIC takeover. It combines random per-epoch programs, a 2 GiB random-access DAG, and a per-thread scratchpad to force any would-be ASIC into looking like a GPU — at which point the ASIC has no cost advantage.
Three artifacts in this repo:
- Algorithm spec (
ALGORITHM_SPEC.md) — formal definition. - Reference C implementation (
spec/) — the authoritative semantics. - CUDA implementation + bench harness (
cuda/,bench/) — what community testers run on their GPUs.
If you are a community tester, jump to COMMUNITY_TESTING.md.
If you are reviewing the algorithm, start with ALGORITHM_SPEC.md
and then docs/DESIGN_RATIONALE.md.
| Metric | Value |
|---|---|
| Hashrate | ~1.25 MH/s |
| DAG generation | 2 GiB in ~30 ms (~65 GB/s) |
| Per-share verify | ~0.5 ms (warm DAG) |
| GPU vs reference | bit-identical (5/5 KAT nonces) |
These are baseline numbers from a reference port. Optimized kernels (warp-cooperative DAG access, shared-memory scratchpad, instruction reordering) are expected to multiply throughput 2–5× without changing consensus.
ASICs win when the algorithm is small, homogeneous, and predictable. GPUx attacks each premise:
| Property | GPUx mechanism |
|---|---|
| Predictable kernel | Random program regenerated every 1024 blocks |
| Small kernel | 256 ops × 64 iters = 16 384 ops/nonce, 12 distinct opcodes, 32 64-bit lanes |
| Cheap memory | 2 GiB DAG with random dependent access (forces GDDR/HBM) |
| No cache | 16 KiB per-thread scratchpad with R-M-W (forces L1-equivalent) |
| One datapath | Mix of 64-bit int ALU, MULHI, AES round, IEEE-754 FP32 FMA |
| Throughput parallel | Latency-bound dependent chains limit pipelining |
Long-form analysis with comparisons to Ethash, ProgPoW, RandomX, Cuckaroo,
and X16R is in docs/DESIGN_RATIONALE.md.
gpux/
├── ALGORITHM_SPEC.md formal algorithm spec
├── COMMUNITY_TESTING.md how to run tests and submit results
├── README.md this file
├── Makefile builds reference + tests (Linux/WSL/macOS)
├── spec/ reference C implementation
│ ├── gpux.h / gpux.c algorithm reference (the source of truth)
│ ├── blake2b.c+h embedded BLAKE2b reference
│ ├── chacha20.c+h embedded ChaCha20 reference
│ ├── aes_round.c+h embedded AES single-round reference
│ └── test_vectors.h frozen KAT (regenerate with `make gen-kat`)
├── tests/
│ ├── smoke.c primitive correctness (BLAKE2b, ChaCha20, AES, KAT generators)
│ ├── kat.c full hash KAT (allocates 2 GiB)
│ └── gen_kat.c regenerate test_vectors.h
├── cuda/ CUDA implementation
│ ├── gpux_kernel.cu the mining kernel
│ ├── gpux_device.cuh device-side BLAKE2b/ChaCha20/AES
│ ├── gpux_miner.cu host driver: verify, bench, info
│ ├── Makefile Linux/WSL build
│ └── build.bat Windows build (vcvars + nvcc)
├── bench/ community testing
│ ├── run_bench.ps1 Windows harness
│ ├── run_bench.sh Linux harness
│ └── results/ per-GPU JSON results (created on first run)
└── docs/
└── DESIGN_RATIONALE.md why each design choice; ASIC-resistance argument
make smoke # primitive tests, no DAG
make kat # full KAT (allocates 2 GiB)cd cuda && make
./gpux_miner verify
./gpux_miner bench 30Requires Visual Studio 2022 BuildTools + CUDA 13.x.
cd cuda
.\build.bat
.\gpux_miner.exe verify
.\gpux_miner.exe bench 30Or use the testing wrapper:
.\bench\run_bench.ps1 -Seconds 60Tari's existing block header is hashed with BLAKE2b-256 to produce a 32-byte digest. To use GPUx as a PoW algorithm:
header_digest = BLAKE2b-256(serialized_block_header_excluding_nonce)
block_hash = GPUx(header_digest, nonce)
Difficulty target and Tari's multi-algo selection layer integrate at the
consensus boundary. See ALGORITHM_SPEC.md §11.
- Spec frozen for testing
- Reference C impl, deterministic
- KAT (1 epoch_seed, 5 nonces) with bit-exact reference output
- CUDA impl matches reference
- Baseline RTX 5090 hashrate (1.25 MH/s)
- Cross-vendor FP32 determinism audit (NVIDIA Ada/Hopper/Blackwell vs AMD RDNA3/RDNA4 vs Intel)
- Light-verifier Merkle DAG witness
- Tari multi-algo selection integration
- Optimized CUDA kernel (warp-coop DAG, shmem scratchpad)
- OpenCL implementation for AMD/Intel
MIT — see LICENSE. Bundled reference primitives (BLAKE2b,
ChaCha20, AES round, Argon2id) are public-domain or CC0/Apache-2.0 and
remain so under MIT. The intent is full open-source auditability — fork
it, break it, propose changes via PR, run your own bench results and
submit them as JSON files in bench/results/.