Ashish Behal · Department of Computer Science and Engineering · University at Buffalo
Can a generative model learn what a valid chess position looks like purely from examples — without ever being told the rules?
This project applies D3PM (Discrete Denoising Diffusion Probabilistic Models) to chess board positions encoded as token sequences. The model is trained on 500,000 puzzle positions from the Lichess database and learns to generate valid chess positions through a learned denoising process, with no explicit knowledge of chess rules.
The model achieves 69.5% valid position generation compared to a 16.3% random baseline — more than four times the rate of a model that knows nothing. It also closely matches the training distribution on pawn structure metrics, suggesting it has learned genuine chess structure implicitly from data.
| Source | Valid Positions | Valid % |
|---|---|---|
| D3PM (ours) | 695 / 1000 | 69.5% |
| Random baseline | 1000 / 6148 attempts | 16.3% |
| Training data | 1000 / 1000 | 100.0% |
KL Divergence from Training Distribution (lower is better):
| Metric | D3PM | Random |
|---|---|---|
| White pawn count | 0.092 | 17.640 |
| Passed pawn count | 0.010 | 0.088 |
| Material balance | 0.572 | 0.577 |
| Total material | 3.217 | 5.582 |
- Model: DDiT-Llama Transformer — 30.8M parameters, 512-dimensional embeddings, 6 layers
- Diffusion framework: D3PM with uniform noise corruption over 1,000 timesteps
- Input representation: 72-token integer sequences encoding full FEN board state
- Training data: 500,000 chess puzzle positions from the Lichess database
- Optimizer: AdamW with linear warmup, learning rate 2×10⁻⁴, batch size 256
- Training hardware: NVIDIA B200 GPU (~25 minutes for 10 epochs)
Each chess position is encoded as a 72-token integer sequence:
- Tokens 0–63: Board squares (one per square), each taking one of 13 values (empty or one of 12 piece types)
- Tokens 64–71: Metadata — side to move, castling rights (4 binary tokens), en passant target, halfmove clock, fullmove counter
This tokenization follows the approach of Ruoss et al. (2024), who showed that clean structured token representations outperform raw PGN notation for neural chess models.
Positions are evaluated at two levels:
Level 1 — Syntactic Validity
Generated token sequences are converted back to FEN strings and validated using python-chess. This checks hard legality constraints: exactly one king per side, no pawns on the back rank, valid piece counts, and the non-moving side not being in check.
Level 2 — Distributional Realism KL divergence is computed between generated positions and the training distribution across four structural metrics: white pawn count, material balance, passed pawn count, and total material. A random baseline (positions generated by randomly placing pieces subject to hard legality constraints) provides a null model representing a system that learned nothing beyond the rules.
ChessDiffusion/
├── prepare_data.py # Download and extract FEN strings from Lichess puzzle CSV
├── tokenizer.py # FEN ↔ token sequence conversion
├── train.py # D3PM training loop
├── evaluate.py # Level 1 validity + Level 2 distributional evaluation
├── decompress.py # Decompress the Lichess .zst file
├── requirements.txt # Python dependencies
└── README.md
Note: Model checkpoints (
checkpoints/) and tokenized data (tokens.npy) are not included in this repository due to file size. See the setup guide below to reproduce from scratch.
git clone https://github.com/Ashishkabaab/ChessDiffusion.git
cd ChessDiffusiongit clone https://github.com/cloneofsimo/d3pm.gitThis provides d3pm_runner.py and dit.py which are required to run the model. Excluded from this repo since it is a separate project.
pip install -r requirements.txtRequires Python 3.8+ and a CUDA-capable GPU for training. The model was trained on an NVIDIA B200 with CUDA 12.8 and PyTorch 2.12.
wget https://database.lichess.org/lichess_db_puzzle.csv.zstThis file is approximately 1.5GB compressed.
python decompress.pypython prepare_data.pyThis extracts FEN strings from the puzzle CSV and tokenizes them into 72-token sequences, saving the result to tokens.npy.
python train.pyTraining runs for 10 epochs over 500,000 FEN sequences. Checkpoints are saved to checkpoints/. On an NVIDIA B200, training takes approximately 25 minutes.
python evaluate.pyGenerates 1,000 positions from the trained model and reports:
- Level 1 validity rate
- KL divergence across all four structural metrics
- Game phase distribution (opening / middlegame / endgame)
The model generates too many opening-like positions relative to the puzzle training distribution (67.2% opening vs 22.6% in training data). This likely reflects difficulty learning long-range dependencies — total material on the board is determined by the interaction of all 64 square tokens simultaneously, which is harder to learn than local pawn patterns.
- Train longer with a larger dataset
- Design a structured transition matrix that encodes chess-specific domain knowledge into the D3PM corruption process
- Explore conditional generation (e.g. generate positions matching a specific game phase)
- Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. NeurIPS 34.
- cloneofsimo (2021). d3pm: Discrete Denoising Diffusion Probabilistic Models.
- El-Kishky, A., et al. (2025). Generating Creative Chess Puzzles. arXiv preprint.
- Heusel, M., et al. (2017). GANs Trained by a Two Timescale Update Rule Converge to a Local Nash Equilibrium. NeurIPS 30.
- Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 33.
- Lichess (2024). Lichess Open Database.
- Monroe, D. and Chalmers, P. A. (2024). Mastering Chess with a Transformer Model. arXiv preprint.
- Ruoss, A., et al. (2024). Grandmaster-Level Chess Without Search. arXiv preprint.
- Wu, Y., et al. (2025). Implicit Search via Discrete Diffusion: A Study on Chess. arXiv preprint.
MIT License. See LICENSE for details.