## 📚 Data Overview

The training data was sourced from [Lichess PGN database](https://database.lichess.org/), which provides large-scale collections of chess games. I used February 2025 data games.

To ensure high-quality examples, we applied the following filters:

- Both players must be rated **above 2400**.
- Games must be **at least 5 minutes per side** to exclude low-effort blitz games.

This filtering process yielded:

- **22,596 games**
- **2,370,238 board positions**
- Positions are divided into **23 shards** (shards 0–22), each containing exactly **1,000 games** and weighing **~500–700MB** after processing.

### 🧹 Data Processing Pipeline

1. **Filtering**:
   - Scripts `pgn_filtering.py` and `pgn_time_filtering.py` were used to apply rating and time controls filters to raw PGN files. The `pgn` shards in data/shards300_small are already filtered.

2. **Parsing**:
   - `position_parsing.py` converts filtered PGN games into a structured batch format:
     - Board input tensors
     - Labels for move targets, evaluation result, in-check flag, threats, etc.
     - All the tensors are saved in `int8` or `uint8` type to improve efficiency.   

3. **Final Packaging**:
   - `convert_to_stacked_shards.py` collects and compresses parsed batches into final `.pt` tensor files.
   - This significantly accelerates data loading during training.

Each shard in its final form contains all required inputs and targets for model training, allowing efficient streaming from disk.

## Reproducing the training data
### 🔧 The following are two simple scripts to reproduce the training data from the filtered pgn shards

In [None]:
# parse_shards.py

import subprocess

def parse_shards(start=0, end=23):
    for i in range(start, end):
        input_file = f"data/shards300_small/shard_{i}.pgn"
        output_file = f"data/shards300_small/positions{i}.pt"
        command = [
            "python3", "chessengine/preprocessing/position_parsing.py",
            "--input", input_file,
            "--output", output_file
        ]
        print(f"🔹 Parsing shard {i}...")
        result = subprocess.run(command)
        if result.returncode != 0:
            print(f"❌ Failed to parse shard {i}")

# Example usage:
# parse_shards(20, 23)

In [None]:
from data.convert_to_stacked_shards import convert_shard_format

input_path = "data/shards300_small"
output_path = "data/stacked_shards"
convert_shard_format(input_path, output_path)