tatara

English | 日本語

tatara

A fast Rust trainer for shogi NNUE evaluation networks.

tatara trains shogi NNUE (Efficiently Updatable Neural Network) evaluation networks on the GPU. It is written in Rust end to end, from host to device: GPU kernels are compiled to PTX at build time by cuda-oxide (NVIDIA Labs' Rust → PTX rustc backend) — no C / C++ / CUDA C++ anywhere in the pipeline.

Hand-fusing the GPU kernels makes it very fast — it out-throughputs its upstream CUDA C++ trainer bullet-shogi. Measured on an RTX 3080 Ti: for the LayerStack architecture, even the bit-identical default path is +37% over bullet-shogi, and stacking the opt-in FP16 modes reaches up to ~2.1×. For the Simple architecture (HalfKP 512x2-8-64), it is around +20% on the bit-identical default and around +55% with the opt-in FP16/TF32 modes (--all-optim).

tatara (踏鞴), the traditional Japanese furnace that smelts iron sand (raw material) into tamahagane steel — forging a net out of raw data.

NVIDIA only — because cuda-oxide only generates PTX, ROCm / AMD is out of scope. To train comparable shogi NNUE nets on an AMD GPU, see the upstream bullet-shogi, which has both CUDA and HIP backends.

What you can train

A trained NNUE is defined by two independent choices: the architecture (a subcommand), which fixes the network structure, and the input feature set (--feature-set), which fixes how a board position is turned into an input vector.

Architecture

Architecture	Subcommand	Structure
LayerStack	`layerstack`	Specializes the output layer per bucket by game progress (9 buckets; the same idea as Stockfish's "LayerStacks"). FT output `--ft-out` (default 1536) → 16 → 32
Simple	`simple`	A plain NNUE with no bucket split (FT → 2 hidden layers → single output). Layer dimensions are set with `--arch <l1>x2-<l2>-<l3>` (`l1` = FT output, `l2`/`l3` = hidden layers; default `256x2-32-32`); activation crelu / screlu / pairwise

Input feature set

--feature-set selects one of five (default halfka-hm-merged). They differ in how the kings are included as features:

`--feature-set`	King handling
`halfkp`	Kings themselves are not included as piece features
`halfka-split`	Kings are included; own-king and enemy-king features have separate slots
`halfka-merged`	Kings are included; own-king and enemy-king features share one slot
`halfka-hm-split`	`halfka-split` plus a left-right mirror so the king always sits on one side of the board, compressing king squares 81 → 45
`halfka-hm-merged` (default)	`halfka-merged` plus the same left-right mirror king-square compression

The default halfka-hm-merged applies to shogi the same design as Stockfish's HalfKAv2_hm (left-right king-square mirroring + own-king/enemy-king features sharing one slot).

A separate binary, progress-kpabs-train, is a KP-abs progress trainer that produces progress.bin, the bucket coefficients for LayerStack. The approach of learning game progress and assigning it to output buckets is based on an idea from a post by nodchip.

Setup

Requirements

OS — Linux is first-class; Windows is supported via WSL2; macOS cannot build the GPU crates
NVIDIA GPU (Ampere and later / sm_80+ is officially supported; Turing / sm_75 also runs simple kernels with the CUDA_OXIDE_TARGET=sm_75 environment variable)
CUDA Toolkit 12.x (verified with 12.9)
LLVM 21+ (llc-21 is the floor; llc-22 is recommended because it is needed for fully correct atomics syncscope)
Rust nightly (rust-toolchain.toml tracks the cuda-oxide upstream channel; do not change the channel yourself, since it depends on the rustc internal ABI)

To set up cargo-oxide, which builds the GPU kernels, run bash scripts/setup-cuda-oxide.sh. For detailed installation steps, per-OS guidance, and the supported-GPU matrix, see docs/setup.md.

Build and train

For building the kernels and running the smoke test, see docs/setup.md; for how to run training, see docs/training-quickstart.md.

Documentation

Setup guide — per-OS guidance, CUDA / LLVM / cargo-oxide setup, supported-GPU matrix, CUDA toolkit root resolution
Training quickstart — per-architecture training examples + key CLI options + resume / checkpoint workflow
ADR (Architecture Decision Records) — design decisions and their rationale
Fused kernel catalog — which kernel does what
Arch string — how the architecture-description string embedded in the quantised .bin header is assembled and checked at load time

Using the trained net

The quantised .bin that tatara produces is designed to be loaded by the rshogi engine. The .bin header and the SCReLU / Pairwise activations are specific to this project, so other shogi engines such as YaneuraOu cannot necessarily load it as-is; depending on the architecture, some nets need additional inference code before they can be loaded. Pre-trained reference nets are attached to the GitHub Releases. To train your own net, see the setup guide.

Glossary

Abbreviation	Meaning
NNUE	Efficiently Updatable Neural Network — a lightweight evaluation function used by shogi / chess engines
FT	Feature Transformer — the NNUE's sparse-input → dense layer
PSV	PackedSfenValue — a training-data format from bullet-shogi (one position + score + WDL)
KP / KP-abs	King-Piece relative feature and its absolute-value variant (for progress / entering-king detection)
bucket	Per-output-bucket weight separation (branching by game phase / progress)
CReLU / SCReLU / Pairwise	NNUE activation functions. CReLU = Clipped ReLU, SCReLU = Squared Clipped ReLU, Pairwise = elementwise product of the first and second halves, halving the input dimension. Selected by `--activation` on the `simple` architecture
RAdam / Ranger	Rectified Adam / Ranger optimizer (Ranger = RAdam + lookahead)
WRM	Win-rate model loss (from bullet `--win-rate-model`)
SPRT	Sequential Probability Ratio Test — a method that plays two nets against each other and sequentially tests the strength difference. Used to confirm the quality of a trained net
superbatch	A bullet term: the unit of "multiple batches treated as one, advancing the lr/wdl scheduler"
PTX	Parallel Thread Execution — a virtual ISA for NVIDIA GPUs. CUDA C++ / Rust → PTX (`.ptx` text) → the CUDA driver's JIT compiles it to SASS (real machine code) for execution. It is portable across generations (PTX built for sm_80 runs forward-compatibly on sm_86/89/90). See the supported-GPU matrix in `docs/setup.md`
SASS	Per-generation real machine code for NVIDIA GPUs. The terminal form the CUDA driver JIT produces from PTX. This repository does not handle it directly
sm_XX	The compute capability of an NVIDIA GPU (e.g. sm_75 = Turing, sm_86 = Ampere RTX 30xx). Used to specify the target architecture when generating PTX (`CUDA_OXIDE_TARGET=sm_86`, etc.)

Related repositories

rshogi — the shogi engine that loads and plays with the NNUE trained here
bullet — upstream (NNUE training framework)
bullet-shogi — a shogi-oriented fork of bullet
cuda-oxide — the Rust → PTX rustc backend

License

MIT (see LICENSE). For the scope of code taken from bullet-shogi / bullet / cuda-oxide and license compatibility, see ATTRIBUTION.md.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github/workflows		.github/workflows
bins		bins
crates		crates
docs		docs
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ATTRIBUTION.md		ATTRIBUTION.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.ja.md		README.ja.md
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tatara

What you can train

Architecture

Input feature set

Setup

Requirements

Build and train

Documentation

Using the trained net

Glossary

Related repositories

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tatara

What you can train

Architecture

Input feature set

Setup

Requirements

Build and train

Documentation

Using the trained net

Glossary

Related repositories

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages