Learning Chess Structure Without Rules: Discrete Diffusion on FEN Token Sequences

Ashish Behal · Department of Computer Science and Engineering · University at Buffalo

Overview

Can a generative model learn what a valid chess position looks like purely from examples — without ever being told the rules?

This project applies D3PM (Discrete Denoising Diffusion Probabilistic Models) to chess board positions encoded as token sequences. The model is trained on 500,000 puzzle positions from the Lichess database and learns to generate valid chess positions through a learned denoising process, with no explicit knowledge of chess rules.

The model achieves 69.5% valid position generation compared to a 16.3% random baseline — more than four times the rate of a model that knows nothing. It also closely matches the training distribution on pawn structure metrics, suggesting it has learned genuine chess structure implicitly from data.

Key Results

Source	Valid Positions	Valid %
D3PM (ours)	695 / 1000	69.5%
Random baseline	1000 / 6148 attempts	16.3%
Training data	1000 / 1000	100.0%

KL Divergence from Training Distribution (lower is better):

Metric	D3PM	Random
White pawn count	0.092	17.640
Passed pawn count	0.010	0.088
Material balance	0.572	0.577
Total material	3.217	5.582

Architecture

Model: DDiT-Llama Transformer — 30.8M parameters, 512-dimensional embeddings, 6 layers
Diffusion framework: D3PM with uniform noise corruption over 1,000 timesteps
Input representation: 72-token integer sequences encoding full FEN board state
Training data: 500,000 chess puzzle positions from the Lichess database
Optimizer: AdamW with linear warmup, learning rate 2×10⁻⁴, batch size 256
Training hardware: NVIDIA B200 GPU (~25 minutes for 10 epochs)

FEN Tokenization

Each chess position is encoded as a 72-token integer sequence:

Tokens 0–63: Board squares (one per square), each taking one of 13 values (empty or one of 12 piece types)
Tokens 64–71: Metadata — side to move, castling rights (4 binary tokens), en passant target, halfmove clock, fullmove counter

This tokenization follows the approach of Ruoss et al. (2024), who showed that clean structured token representations outperform raw PGN notation for neural chess models.

Evaluation Framework

Positions are evaluated at two levels:

Level 1 — Syntactic Validity Generated token sequences are converted back to FEN strings and validated using python-chess. This checks hard legality constraints: exactly one king per side, no pawns on the back rank, valid piece counts, and the non-moving side not being in check.

Level 2 — Distributional Realism KL divergence is computed between generated positions and the training distribution across four structural metrics: white pawn count, material balance, passed pawn count, and total material. A random baseline (positions generated by randomly placing pieces subject to hard legality constraints) provides a null model representing a system that learned nothing beyond the rules.

Repository Structure

ChessDiffusion/
├── prepare_data.py     # Download and extract FEN strings from Lichess puzzle CSV
├── tokenizer.py        # FEN ↔ token sequence conversion
├── train.py            # D3PM training loop
├── evaluate.py         # Level 1 validity + Level 2 distributional evaluation
├── decompress.py       # Decompress the Lichess .zst file
├── requirements.txt    # Python dependencies
└── README.md

Note: Model checkpoints (checkpoints/) and tokenized data (tokens.npy) are not included in this repository due to file size. See the setup guide below to reproduce from scratch.

Setup & Reproduction Guide

1. Clone the repository

git clone https://github.com/Ashishkabaab/ChessDiffusion.git
cd ChessDiffusion

2. Clone the d3pm dependency

git clone https://github.com/cloneofsimo/d3pm.git

This provides d3pm_runner.py and dit.py which are required to run the model. Excluded from this repo since it is a separate project.

3. Install dependencies

pip install -r requirements.txt

Requires Python 3.8+ and a CUDA-capable GPU for training. The model was trained on an NVIDIA B200 with CUDA 12.8 and PyTorch 2.12.

4. Download the Lichess puzzle database

wget https://database.lichess.org/lichess_db_puzzle.csv.zst

This file is approximately 1.5GB compressed.

5. Decompress the dataset

python decompress.py

6. Prepare training data

python prepare_data.py

This extracts FEN strings from the puzzle CSV and tokenizes them into 72-token sequences, saving the result to tokens.npy.

7. Train and save the model

python train.py

Training runs for 10 epochs over 500,000 FEN sequences. Checkpoints are saved to checkpoints/. On an NVIDIA B200, training takes approximately 25 minutes.

8. Evaluate

python evaluate.py

Generates 1,000 positions from the trained model and reports:

Level 1 validity rate
KL divergence across all four structural metrics
Game phase distribution (opening / middlegame / endgame)

Limitations

The model generates too many opening-like positions relative to the puzzle training distribution (67.2% opening vs 22.6% in training data). This likely reflects difficulty learning long-range dependencies — total material on the board is determined by the interaction of all 64 square tokens simultaneously, which is harder to learn than local pawn patterns.

Future Work

Train longer with a larger dataset
Design a structured transition matrix that encodes chess-specific domain knowledge into the D3PM corruption process
Explore conditional generation (e.g. generate positions matching a specific game phase)

References

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. NeurIPS 34.
cloneofsimo (2021). d3pm: Discrete Denoising Diffusion Probabilistic Models.
El-Kishky, A., et al. (2025). Generating Creative Chess Puzzles. arXiv preprint.
Heusel, M., et al. (2017). GANs Trained by a Two Timescale Update Rule Converge to a Local Nash Equilibrium. NeurIPS 30.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 33.
Lichess (2024). Lichess Open Database.
Monroe, D. and Chalmers, P. A. (2024). Mastering Chess with a Transformer Model. arXiv preprint.
Ruoss, A., et al. (2024). Grandmaster-Level Chess Without Search. arXiv preprint.
Wu, Y., et al. (2025). Implicit Search via Discrete Diffusion: A Study on Chess. arXiv preprint.

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Chess Structure Without Rules: Discrete Diffusion on FEN Token Sequences

Overview

Key Results

Architecture

FEN Tokenization

Evaluation Framework

Repository Structure

Setup & Reproduction Guide

1. Clone the repository

2. Clone the d3pm dependency

3. Install dependencies

4. Download the Lichess puzzle database

5. Decompress the dataset

6. Prepare training data

7. Train and save the model

8. Evaluate

Limitations

Future Work

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
decompress.py		decompress.py
evaluate.py		evaluate.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Learning Chess Structure Without Rules: Discrete Diffusion on FEN Token Sequences

Overview

Key Results

Architecture

FEN Tokenization

Evaluation Framework

Repository Structure

Setup & Reproduction Guide

1. Clone the repository

2. Clone the d3pm dependency

3. Install dependencies

4. Download the Lichess puzzle database

5. Decompress the dataset

6. Prepare training data

7. Train and save the model

8. Evaluate

Limitations

Future Work

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages