# Data cleaning
## Data structure
Raw data has match info and the game PGN format (moves made in the games)
Example:

```
                Event            White       Black Result     UTCDate   UTCTime \
0           Classical          eisaaaa    HAMID449    1-0  2016.06.30  22:00:01 \
1               Blitz           go4jas  Sergei1973    0-1  2016.06.30  22:00:01 \
2    Blitz tournament  Evangelistaizac      kafune    1-0  2016.06.30  22:00:02 \
3      Correspondence           Jvayne    Wsjvayne    1-0  2016.06.30  22:00:02 \
4    Blitz tournament           kyoday   BrettDale    0-1  2016.06.30  22:00:02 \

    WhiteElo  BlackElo  WhiteRatingDiff  BlackRatingDiff  ECO
0       1901      1896             11.0            -11.0  D10
1       1641      1627            -11.0             12.0  C20
2       1647      1688             13.0            -13.0  B01
3       1706      1317             27.0            -25.0  A00
4       1945      1900            -14.0             13.0  B9 

                                         Opening  TimeControl   Termination                                                     AN
0                                    Slav Defense       300+5  Time forfeit      1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e...
1                       King's Pawn Opening: 2.b3       300+0        Normal      1. e4 e5 2. b3 Nf6 3. Bb2 Nc6 4. Nf3 d6 5. d3 ...
2   Scandinavian Defense: Mieses-Kotroc Variation       180+0  Time forfeit      1. e4 d5 2. exd5 Qxd5 3. Nf3 Bg4 4. Be2 Nf6 5....
3                            Van't Kruijs Opening           -        Normal      1. e3 Nf6 2. Bc4 d6 3. e4 e6 4. Nf3 Nxe4 5. Nd...
4     Sicilian Defense: Najdorf, Lipnitsky Attack       180+0  Time forfeit      1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. N...
```
So a few columns of the game info and then a column called `AN` that contains the PGN of the game, so we must convert this table into gameboards
From this data, the most important data to save is `Result`, `ELO`s, `TimeControl` and `AN`


In [5]:
import pandas as pd
import chess
import numpy as np

piece_to_idx = {
    'P': 0, 'N': 1, 'B': 2, 'R': 3, 'Q': 4, 'K': 5,
    'p': 6, 'n': 7, 'b': 8, 'r': 9, 'q': 10, 'k': 11
}

def board_to_tensor(board: chess.Board, sideToPlay: bool):
    tensor = np.zeros(13, dtype=np.uint64) # each board is represented in a int(8*8)-> uint64, a bit for each square
    for square in chess.SQUARES:
        piece = board.piece_at(square)
        if piece:
            idx = piece_to_idx[piece.symbol()]
            tensor[idx] |= 1<<square # Set the bit to 1 at square for the board idx
    # tensor[12] = 1 if sideToPlay else 0
    return tensor

def get_possible_moves(board: chess.Board) -> np.ndarray[np.uint64]: # np.ndarray[np.uint64, shape=(63)]
    array = np.zeros(shape=(64,), dtype=np.uint64) # 64*63 bits (is set to 64 just to maintain homogeneity)
    for move in board.legal_moves:
        move_idx = MOVE_DICTIONARY[move.uci()[:4]]
        num_idx = move_idx // 64
        bit_idx = move_idx % 64
        array[num_idx] |= 1<<bit_idx
    return array


letters = ["a", "b", "c", "d", "e", "f", "g", "h"]
numbers = list(range(1, 10)) # [1..9]
MOVE_DICTIONARY = {}
cumulative = 0
for num_games_saved in range(8):
    for j in range(8):
        for k in range(8):
            for w in range (8):
                if (num_games_saved == k and j == w):
                    cumulative += 1
                    continue
                from_square = f"{letters[num_games_saved]}{numbers[j]}"
                to_square = f"{letters[k]}{numbers[w]}"
                MOVE_DICTIONARY[f"{from_square}{to_square}"] = (num_games_saved * 8**3) + (j * 8**2) + (k * 8) + w - cumulative
REVERSE_MOVE_DICTIONARY = {
    value: key for key, value in MOVE_DICTIONARY.items()
}

In [6]:
import re

def clean_pgn(pgn_text):
    # Remove {...} comments and [%...] evals
    cleaned = re.sub(r'\{[^}]*\}', '', pgn_text)
    cleaned = re.sub(r'\[%[^]]*\]', '', cleaned)
    # cleaned = re.sub(r'[0-9]+\.\.\.', '', cleaned)
    cleaned = re.sub(r'\?!?', '', cleaned)
    return cleaned



In [7]:
import csv
from time import time

from tqdm import tqdm
HEADERS=[
    "Event", "White", "Black", "Result", "UTCDate", "UTCTime", "WhiteElo", 
    "BlackElo", "WhiteRatingDiff", "BlackRatingDiff", "ECO", "Opening", 
    "TimeControl", "Termination", "AN"
]

start = 1 # To ignore headers
total = 0
READ_SIZE = 10_000
RUN = True
num_games_saved = 0
num_batches = 1
try:
    with open("./dataset/processed/results_ELO_1500_50k.csv", "w") as f:
        writer = csv.writer(f)
        while RUN:
            ti = time()
            games_df = pd.read_csv('./dataset/raw/chess_games.csv', skiprows=start, nrows=READ_SIZE, names=HEADERS)

            print(f"Reading {num_batches}-th batch of games...")
            # Result, WhiteElo, BlackElo, TimeControl, BoardPositions: uint_64[], move_played
            for index, game in tqdm(games_df.iterrows()):
                board = chess.Board()
                # Get PGN
                Result = game["Result"]
                WhiteElo = int(game["WhiteElo"])
                BlackElo = int(game["BlackElo"])
                TimeControl = game["TimeControl"]
                moves_string = clean_pgn(game["AN"])

                bitmaps = []

                tokens = moves_string.replace("\n", " ").split()
                moves = [token for token in tokens if not token[0].isdigit() and '.' not in token]
                numMoves = 0
                if not(1400 < WhiteElo < 1600) or not(1400 < BlackElo < 1600):
                    continue

                sideToPlay = True # True for white, False for Black
                for move in moves:
                    try:
                        bitmap = board_to_tensor(board, sideToPlay)
                        possible_moves = get_possible_moves(board)

                        move: chess.Move = board.push_san(move)
                        numMoves += 1
                        
                        if numMoves < 8 or board.is_game_over():
                            sideToPlay = not sideToPlay
                            continue

                        writer.writerow([
                            bitmap.tolist(), # bitmaps
                            MOVE_DICTIONARY[move.uci()[:4]], # Played move
                            # 0 for invalid 1 for valid; To keep constant size
                            possible_moves.tolist(),
                        ])
                        sideToPlay = not sideToPlay

                    except Exception as e:
                        print(f"Skipping bad move: {moves} - {move} – #{e}#")
                        break
                num_games_saved+=1
            
            start += READ_SIZE
            tf = time()
            total += len(games_df)
            num_batches += 1
            print(f"Read {len(games_df)} in {tf-ti} | TOTAL: {total} | {num_games_saved}")

            if 50_000 <= num_games_saved:
                RUN = False

except Exception as e:
    print(e.with_traceback())

Reading 1-th batch of games...


10000it [00:08, 1203.86it/s]


Read 10000 in 8.644449234008789 | TOTAL: 10000 | 939
Reading 2-th batch of games...


10000it [00:08, 1228.51it/s]


Read 10000 in 8.239701986312866 | TOTAL: 20000 | 1848
Reading 3-th batch of games...


10000it [00:07, 1327.53it/s]


Read 10000 in 7.681756496429443 | TOTAL: 30000 | 2741
Reading 4-th batch of games...


10000it [00:07, 1397.59it/s]


Read 10000 in 7.269893407821655 | TOTAL: 40000 | 3624
Reading 5-th batch of games...


10000it [00:07, 1333.34it/s]


Read 10000 in 7.638554096221924 | TOTAL: 50000 | 4516
Reading 6-th batch of games...


10000it [00:07, 1329.47it/s]


Read 10000 in 7.66652250289917 | TOTAL: 60000 | 5388
Reading 7-th batch of games...


10000it [00:08, 1224.78it/s]


Read 10000 in 8.325124502182007 | TOTAL: 70000 | 6319
Reading 8-th batch of games...


10000it [00:07, 1371.79it/s]


Read 10000 in 7.455922842025757 | TOTAL: 80000 | 7161
Reading 9-th batch of games...


10000it [00:08, 1243.42it/s]


Read 10000 in 8.27109694480896 | TOTAL: 90000 | 8107
Reading 10-th batch of games...


10000it [00:08, 1228.16it/s]


Read 10000 in 8.34715986251831 | TOTAL: 100000 | 9075
Reading 11-th batch of games...


10000it [00:08, 1247.43it/s]


Read 10000 in 8.22925591468811 | TOTAL: 110000 | 9993
Reading 12-th batch of games...


10000it [00:07, 1262.37it/s]


Read 10000 in 8.173838138580322 | TOTAL: 120000 | 10921
Reading 13-th batch of games...


10000it [00:07, 1411.18it/s]


Read 10000 in 7.345448732376099 | TOTAL: 130000 | 11745
Reading 14-th batch of games...


10000it [00:07, 1368.74it/s]


Read 10000 in 7.618567705154419 | TOTAL: 140000 | 12613
Reading 15-th batch of games...


10000it [00:07, 1335.10it/s]


Read 10000 in 7.769639492034912 | TOTAL: 150000 | 13429
Reading 16-th batch of games...


10000it [00:07, 1388.59it/s]


Read 10000 in 7.53776216506958 | TOTAL: 160000 | 14259
Reading 17-th batch of games...


10000it [00:07, 1293.86it/s]


Read 10000 in 8.045767307281494 | TOTAL: 170000 | 15178
Reading 18-th batch of games...


10000it [00:07, 1265.46it/s]


Read 10000 in 8.282756805419922 | TOTAL: 180000 | 16106
Reading 19-th batch of games...


10000it [00:07, 1287.55it/s]


Read 10000 in 8.092992782592773 | TOTAL: 190000 | 17012
Reading 20-th batch of games...


10000it [00:08, 1158.84it/s]


Read 10000 in 9.041783571243286 | TOTAL: 200000 | 17972
Reading 21-th batch of games...


10000it [00:08, 1248.06it/s]


Read 10000 in 8.390672445297241 | TOTAL: 210000 | 18886
Reading 22-th batch of games...


10000it [00:08, 1240.04it/s]


Read 10000 in 8.44861650466919 | TOTAL: 220000 | 19843
Reading 23-th batch of games...


10000it [00:07, 1377.44it/s]


Read 10000 in 7.646055698394775 | TOTAL: 230000 | 20724
Reading 24-th batch of games...


10000it [00:07, 1391.23it/s]


Read 10000 in 7.625539064407349 | TOTAL: 240000 | 21577
Reading 25-th batch of games...


10000it [00:07, 1377.37it/s]


Read 10000 in 7.727782964706421 | TOTAL: 250000 | 22433
Reading 26-th batch of games...


10000it [00:07, 1388.48it/s]


Read 10000 in 7.6129982471466064 | TOTAL: 260000 | 23305
Reading 27-th batch of games...


10000it [00:06, 1456.59it/s]


Read 10000 in 7.29938268661499 | TOTAL: 270000 | 24142
Reading 28-th batch of games...


10000it [00:07, 1392.02it/s]


Read 10000 in 7.630439758300781 | TOTAL: 280000 | 25027
Reading 29-th batch of games...


10000it [00:06, 1428.65it/s]


Read 10000 in 7.531796932220459 | TOTAL: 290000 | 25868
Reading 30-th batch of games...


10000it [00:07, 1325.55it/s]


Read 10000 in 8.040568828582764 | TOTAL: 300000 | 26741
Reading 31-th batch of games...


10000it [00:07, 1367.49it/s]


Read 10000 in 7.799785852432251 | TOTAL: 310000 | 27565
Reading 32-th batch of games...


10000it [00:07, 1381.41it/s]


Read 10000 in 7.795697450637817 | TOTAL: 320000 | 28420
Reading 33-th batch of games...


10000it [00:07, 1368.13it/s]


Read 10000 in 7.852860689163208 | TOTAL: 330000 | 29302
Reading 34-th batch of games...


10000it [00:08, 1235.61it/s]


Read 10000 in 8.688997268676758 | TOTAL: 340000 | 30240
Reading 35-th batch of games...


10000it [00:07, 1405.99it/s]


Read 10000 in 7.6686952114105225 | TOTAL: 350000 | 31043
Reading 36-th batch of games...


10000it [00:07, 1383.99it/s]


Read 10000 in 7.807011365890503 | TOTAL: 360000 | 31846
Reading 37-th batch of games...


10000it [00:07, 1358.59it/s]


Read 10000 in 7.934517860412598 | TOTAL: 370000 | 32687
Reading 38-th batch of games...


10000it [00:07, 1376.46it/s]


Read 10000 in 7.887926816940308 | TOTAL: 380000 | 33522
Reading 39-th batch of games...


10000it [00:07, 1401.11it/s]


Read 10000 in 7.757328748703003 | TOTAL: 390000 | 34350
Reading 40-th batch of games...


10000it [00:07, 1368.24it/s]


Read 10000 in 7.943890571594238 | TOTAL: 400000 | 35176
Reading 41-th batch of games...


10000it [00:07, 1370.34it/s]


Read 10000 in 8.019845962524414 | TOTAL: 410000 | 36026
Reading 42-th batch of games...


10000it [00:08, 1249.93it/s]


Read 10000 in 8.662577390670776 | TOTAL: 420000 | 36971
Reading 43-th batch of games...


10000it [00:07, 1257.99it/s]


Read 10000 in 8.705304145812988 | TOTAL: 430000 | 37883
Reading 44-th batch of games...


10000it [00:08, 1248.92it/s]


Read 10000 in 8.740654945373535 | TOTAL: 440000 | 38737
Reading 45-th batch of games...


10000it [00:08, 1194.84it/s]


Read 10000 in 9.098841428756714 | TOTAL: 450000 | 39622
Reading 46-th batch of games...


10000it [00:07, 1273.15it/s]


Read 10000 in 8.595641374588013 | TOTAL: 460000 | 40451
Reading 47-th batch of games...


10000it [00:08, 1214.67it/s]


Read 10000 in 9.04657769203186 | TOTAL: 470000 | 41365
Reading 48-th batch of games...


10000it [00:08, 1229.98it/s]


Read 10000 in 8.876656532287598 | TOTAL: 480000 | 42273
Reading 49-th batch of games...


10000it [00:08, 1143.85it/s]


Read 10000 in 9.513035774230957 | TOTAL: 490000 | 43262
Reading 50-th batch of games...


10000it [00:08, 1244.50it/s]


Read 10000 in 8.823207139968872 | TOTAL: 500000 | 44151
Reading 51-th batch of games...


10000it [00:08, 1155.07it/s]


Read 10000 in 9.436676263809204 | TOTAL: 510000 | 45050
Reading 52-th batch of games...


10000it [00:07, 1314.51it/s]


Read 10000 in 8.499630451202393 | TOTAL: 520000 | 45839
Reading 53-th batch of games...


10000it [00:07, 1288.52it/s]


Read 10000 in 8.823467254638672 | TOTAL: 530000 | 46700
Reading 54-th batch of games...


10000it [00:07, 1325.55it/s]


Read 10000 in 8.420501232147217 | TOTAL: 540000 | 47572
Reading 55-th batch of games...


10000it [00:07, 1281.59it/s]


Read 10000 in 8.66593050956726 | TOTAL: 550000 | 48436
Reading 56-th batch of games...


10000it [00:08, 1227.15it/s]


Read 10000 in 9.02623963356018 | TOTAL: 560000 | 49339
Reading 57-th batch of games...


10000it [00:07, 1372.57it/s]

Read 10000 in 8.199848651885986 | TOTAL: 570000 | 50156



