# Data cleaning
## Data structure
Raw data has match info and the game PGN format (moves made in the games)
Example:

```
                Event            White       Black Result     UTCDate   UTCTime \
0           Classical          eisaaaa    HAMID449    1-0  2016.06.30  22:00:01 \
1               Blitz           go4jas  Sergei1973    0-1  2016.06.30  22:00:01 \
2    Blitz tournament  Evangelistaizac      kafune    1-0  2016.06.30  22:00:02 \
3      Correspondence           Jvayne    Wsjvayne    1-0  2016.06.30  22:00:02 \
4    Blitz tournament           kyoday   BrettDale    0-1  2016.06.30  22:00:02 \

    WhiteElo  BlackElo  WhiteRatingDiff  BlackRatingDiff  ECO
0       1901      1896             11.0            -11.0  D10
1       1641      1627            -11.0             12.0  C20
2       1647      1688             13.0            -13.0  B01
3       1706      1317             27.0            -25.0  A00
4       1945      1900            -14.0             13.0  B9 

                                         Opening  TimeControl   Termination                                                     AN
0                                    Slav Defense       300+5  Time forfeit      1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e...
1                       King's Pawn Opening: 2.b3       300+0        Normal      1. e4 e5 2. b3 Nf6 3. Bb2 Nc6 4. Nf3 d6 5. d3 ...
2   Scandinavian Defense: Mieses-Kotroc Variation       180+0  Time forfeit      1. e4 d5 2. exd5 Qxd5 3. Nf3 Bg4 4. Be2 Nf6 5....
3                            Van't Kruijs Opening           -        Normal      1. e3 Nf6 2. Bc4 d6 3. e4 e6 4. Nf3 Nxe4 5. Nd...
4     Sicilian Defense: Najdorf, Lipnitsky Attack       180+0  Time forfeit      1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. N...
```
So a few columns of the game info and then a column called `AN` that contains the PGN of the game, so we must convert this table into gameboards
From this data, the most important data to save is `Result`, `ELO`s, `TimeControl` and `AN`


In [4]:
import pandas as pd
import chess
import numpy as np

piece_to_idx = {
    'P': 0, 'N': 1, 'B': 2, 'R': 3, 'Q': 4, 'K': 5,
    'p': 6, 'n': 7, 'b': 8, 'r': 9, 'q': 10, 'k': 11
}

def board_to_tensor(board: chess.Board, sideToPlay: bool):
    tensor = np.zeros(13, dtype=np.uint64) # each board is represented in a int(8*8)-> uint64, a bit for each square
    for square in chess.SQUARES:
        piece = board.piece_at(square)
        if piece:
            idx = piece_to_idx[piece.symbol()]
            tensor[idx] |= 1<<square # Set the bit to 1 at square for the board idx
    tensor[12] = 1 if sideToPlay else 0
    return tensor

def get_possible_moves(board: chess.Board) -> np.ndarray[np.uint64]: # np.ndarray[np.uint64, shape=(63)]
    array = np.zeros(shape=(64,), dtype=np.uint64) # 64*63 bits (is set to 64 just to maintain homogeneity)
    for move in board.legal_moves:
        move_idx = MOVE_DICTIONARY[move.uci()[:4]]
        num_idx = move_idx // 64
        bit_idx = move_idx % 64
        array[num_idx] |= 1<<bit_idx
    return array


letters = ["a", "b", "c", "d", "e", "f", "g", "h"]
numbers = list(range(1, 10)) # [1..9]
MOVE_DICTIONARY = {}
cumulative = 0
for num_games_saved in range(8):
    for j in range(8):
        for k in range(8):
            for w in range (8):
                if (num_games_saved == k and j == w):
                    cumulative += 1
                    continue
                from_square = f"{letters[num_games_saved]}{numbers[j]}"
                to_square = f"{letters[k]}{numbers[w]}"
                MOVE_DICTIONARY[f"{from_square}{to_square}"] = (num_games_saved * 8**3) + (j * 8**2) + (k * 8) + w - cumulative
REVERSE_MOVE_DICTIONARY = {
    value: key for key, value in MOVE_DICTIONARY.items()
}

In [5]:
import re

def clean_pgn(pgn_text):
    # Remove {...} comments and [%...] evals
    cleaned = re.sub(r'\{[^}]*\}', '', pgn_text)
    cleaned = re.sub(r'\[%[^]]*\]', '', cleaned)
    # cleaned = re.sub(r'[0-9]+\.\.\.', '', cleaned)
    cleaned = re.sub(r'\?!?', '', cleaned)
    return cleaned



In [6]:
import csv
from time import time

from tqdm import tqdm
HEADERS=[
    "Event", "White", "Black", "Result", "UTCDate", "UTCTime", "WhiteElo", 
    "BlackElo", "WhiteRatingDiff", "BlackRatingDiff", "ECO", "Opening", 
    "TimeControl", "Termination", "AN"
]

start = 1 # To ignore headers
total = 0
READ_SIZE = 10_000
RUN = True
num_games_saved = 0
num_batches = 1
try:
    with open("./dataset/processed/results_ELO_1500.csv", "w") as f:
        writer = csv.writer(f)
        while RUN:
            ti = time()
            games_df = pd.read_csv('./dataset/raw/chess_games.csv', skiprows=start, nrows=READ_SIZE, names=HEADERS)

            print(f"Reading {num_batches}-th batch of games...")
            # Result, WhiteElo, BlackElo, TimeControl, BoardPositions: uint_64[], move_played
            for index, game in tqdm(games_df.iterrows()):
                board = chess.Board()
                # Get PGN
                Result = game["Result"]
                WhiteElo = int(game["WhiteElo"])
                BlackElo = int(game["BlackElo"])
                TimeControl = game["TimeControl"]
                moves_string = clean_pgn(game["AN"])

                bitmaps = []

                tokens = moves_string.replace("\n", " ").split()
                moves = [token for token in tokens if not token[0].isdigit() and '.' not in token]
                sideToPlay = True # True for white, False for Black
                numMoves = 0
                if not(1400 < WhiteElo < 1600) or not(1400 < BlackElo < 1600):
                    continue
                for move in moves:
                    try:
                        move: chess.Move = board.push_san(move)
                        numMoves += 1
                        if numMoves < 8 or board.is_game_over():
                            continue


                        bitmap = board_to_tensor(board, sideToPlay)
                        possible_moves = get_possible_moves(board)
                        writer.writerow([
                            bitmap.tolist(), # bitmaps
                            MOVE_DICTIONARY[move.uci()[:4]], # Played move
                            # 0 for invalid 1 for valid; To keep constant size
                            possible_moves.tolist(),
                        ])
                        sideToPlay = not sideToPlay
                    except Exception as e:
                        print(f"Skipping bad move: {moves} - {move} – #{e}#")
                        break
                num_games_saved+=1
            
            start += READ_SIZE
            tf = time()
            total += len(games_df)
            num_batches += 1
            print(f"Read {len(games_df)} in {tf-ti} | TOTAL: {total} | {num_games_saved}")

            if 50_000 <= num_games_saved:
                RUN = False

except Exception as e:
    print(e.with_traceback())

Reading 1-th batch of games...


10000it [00:07, 1333.50it/s]


Read 10000 in 7.870232820510864 | TOTAL: 10000 | 939
Reading 2-th batch of games...


10000it [00:07, 1380.68it/s]


Read 10000 in 7.34575629234314 | TOTAL: 20000 | 1848
Reading 3-th batch of games...


10000it [00:06, 1467.67it/s]


Read 10000 in 6.968638896942139 | TOTAL: 30000 | 2741
Reading 4-th batch of games...


10000it [00:06, 1527.89it/s]


Read 10000 in 6.665080785751343 | TOTAL: 40000 | 3624
Reading 5-th batch of games...


10000it [00:06, 1455.85it/s]


Read 10000 in 7.048427104949951 | TOTAL: 50000 | 4516
Reading 6-th batch of games...


10000it [00:06, 1466.28it/s]


Read 10000 in 7.02899956703186 | TOTAL: 60000 | 5388
Reading 7-th batch of games...


10000it [00:07, 1361.05it/s]


Read 10000 in 7.515100717544556 | TOTAL: 70000 | 6319
Reading 8-th batch of games...


10000it [00:06, 1456.72it/s]


Read 10000 in 7.100871324539185 | TOTAL: 80000 | 7161
Reading 9-th batch of games...


10000it [00:07, 1393.55it/s]


Read 10000 in 7.375486612319946 | TOTAL: 90000 | 8107
Reading 10-th batch of games...


10000it [00:07, 1326.58it/s]


Read 10000 in 7.805805206298828 | TOTAL: 100000 | 9075
Reading 11-th batch of games...


10000it [00:07, 1376.98it/s]


Read 10000 in 7.487231492996216 | TOTAL: 110000 | 9993
Reading 12-th batch of games...


10000it [00:07, 1408.84it/s]


Read 10000 in 7.389220237731934 | TOTAL: 120000 | 10921
Reading 13-th batch of games...


10000it [00:06, 1556.44it/s]


Read 10000 in 6.667329549789429 | TOTAL: 130000 | 11745
Reading 14-th batch of games...


10000it [00:06, 1513.54it/s]


Read 10000 in 6.865509510040283 | TOTAL: 140000 | 12613
Reading 15-th batch of games...


10000it [00:06, 1509.48it/s]


Read 10000 in 6.926027297973633 | TOTAL: 150000 | 13429
Reading 16-th batch of games...


10000it [00:06, 1548.05it/s]


Read 10000 in 6.7485129833221436 | TOTAL: 160000 | 14259
Reading 17-th batch of games...


10000it [00:06, 1431.82it/s]


Read 10000 in 7.334336519241333 | TOTAL: 170000 | 15178
Reading 18-th batch of games...


10000it [00:07, 1403.26it/s]


Read 10000 in 7.500615358352661 | TOTAL: 180000 | 16106
Reading 19-th batch of games...


10000it [00:07, 1426.06it/s]


Read 10000 in 7.34860372543335 | TOTAL: 190000 | 17012
Reading 20-th batch of games...


10000it [00:07, 1289.10it/s]


Read 10000 in 8.15158987045288 | TOTAL: 200000 | 17972
Reading 21-th batch of games...


10000it [00:07, 1403.66it/s]


Read 10000 in 7.490124940872192 | TOTAL: 210000 | 18886
Reading 22-th batch of games...


10000it [00:07, 1321.47it/s]


Read 10000 in 8.002735137939453 | TOTAL: 220000 | 19843
Reading 23-th batch of games...


10000it [00:06, 1468.57it/s]


Read 10000 in 7.192998170852661 | TOTAL: 230000 | 20724
Reading 24-th batch of games...


10000it [00:06, 1515.82it/s]


Read 10000 in 7.0602805614471436 | TOTAL: 240000 | 21577
Reading 25-th batch of games...


10000it [00:06, 1469.79it/s]


Read 10000 in 7.228078603744507 | TOTAL: 250000 | 22433
Reading 26-th batch of games...


10000it [00:06, 1473.54it/s]


Read 10000 in 7.273404598236084 | TOTAL: 260000 | 23305
Reading 27-th batch of games...


10000it [00:06, 1593.71it/s]


Read 10000 in 6.710845708847046 | TOTAL: 270000 | 24142
Reading 28-th batch of games...


10000it [00:06, 1469.27it/s]


Read 10000 in 7.262572526931763 | TOTAL: 280000 | 25027
Reading 29-th batch of games...


10000it [00:06, 1561.04it/s]


Read 10000 in 6.931878566741943 | TOTAL: 290000 | 25868
Reading 30-th batch of games...


10000it [00:06, 1446.70it/s]


Read 10000 in 7.387641191482544 | TOTAL: 300000 | 26741
Reading 31-th batch of games...


10000it [00:06, 1513.87it/s]


Read 10000 in 7.091502904891968 | TOTAL: 310000 | 27565
Reading 32-th batch of games...


10000it [00:06, 1521.14it/s]


Read 10000 in 7.13236141204834 | TOTAL: 320000 | 28420
Reading 33-th batch of games...


10000it [00:06, 1491.14it/s]


Read 10000 in 7.223925590515137 | TOTAL: 330000 | 29302
Reading 34-th batch of games...


10000it [00:07, 1341.96it/s]


Read 10000 in 8.057716846466064 | TOTAL: 340000 | 30240
Reading 35-th batch of games...


10000it [00:06, 1508.11it/s]


Read 10000 in 7.193569660186768 | TOTAL: 350000 | 31043
Reading 36-th batch of games...


10000it [00:06, 1548.00it/s]


Read 10000 in 7.038237571716309 | TOTAL: 360000 | 31846
Reading 37-th batch of games...


10000it [00:07, 1318.22it/s]


Read 10000 in 8.218154191970825 | TOTAL: 370000 | 32687
Reading 38-th batch of games...


10000it [00:06, 1473.75it/s]


Read 10000 in 7.528122425079346 | TOTAL: 380000 | 33522
Reading 39-th batch of games...


10000it [00:06, 1478.08it/s]


Read 10000 in 7.4028918743133545 | TOTAL: 390000 | 34350
Reading 40-th batch of games...


10000it [00:06, 1464.03it/s]


Read 10000 in 7.480183839797974 | TOTAL: 400000 | 35176
Reading 41-th batch of games...


10000it [00:06, 1502.48it/s]


Read 10000 in 7.34788966178894 | TOTAL: 410000 | 36026
Reading 42-th batch of games...


10000it [00:06, 1431.94it/s]


Read 10000 in 7.652076721191406 | TOTAL: 420000 | 36971
Reading 43-th batch of games...


10000it [00:07, 1417.95it/s]


Read 10000 in 7.801868200302124 | TOTAL: 430000 | 37883
Reading 44-th batch of games...


10000it [00:06, 1520.44it/s]


Read 10000 in 7.269696474075317 | TOTAL: 440000 | 38737
Reading 45-th batch of games...


10000it [00:06, 1440.74it/s]


Read 10000 in 7.675040245056152 | TOTAL: 450000 | 39622
Reading 46-th batch of games...


10000it [00:06, 1464.22it/s]


Read 10000 in 7.598703622817993 | TOTAL: 460000 | 40451
Reading 47-th batch of games...


10000it [00:07, 1398.00it/s]


Read 10000 in 7.924944877624512 | TOTAL: 470000 | 41365
Reading 48-th batch of games...


10000it [00:07, 1418.91it/s]


Read 10000 in 7.789685487747192 | TOTAL: 480000 | 42273
Reading 49-th batch of games...


10000it [00:07, 1263.86it/s]


Read 10000 in 8.714117050170898 | TOTAL: 490000 | 43262
Reading 50-th batch of games...


10000it [00:07, 1427.32it/s]


Read 10000 in 7.826236724853516 | TOTAL: 500000 | 44151
Reading 51-th batch of games...


10000it [00:07, 1392.31it/s]


Read 10000 in 8.013488054275513 | TOTAL: 510000 | 45050
Reading 52-th batch of games...


10000it [00:06, 1607.53it/s]


Read 10000 in 7.066640615463257 | TOTAL: 520000 | 45839
Reading 53-th batch of games...


10000it [00:07, 1409.79it/s]


Read 10000 in 7.92664361000061 | TOTAL: 530000 | 46700
Reading 54-th batch of games...


10000it [00:06, 1447.43it/s]


Read 10000 in 7.77057147026062 | TOTAL: 540000 | 47572
Reading 55-th batch of games...


10000it [00:06, 1470.40it/s]


Read 10000 in 7.630624771118164 | TOTAL: 550000 | 48436
Reading 56-th batch of games...


10000it [00:07, 1403.59it/s]


Read 10000 in 8.02131199836731 | TOTAL: 560000 | 49339
Reading 57-th batch of games...


10000it [00:06, 1554.94it/s]

Read 10000 in 7.337378740310669 | TOTAL: 570000 | 50156



