# **DATA PRE-PROCESSING**

## **IMPORTANT TERMS -**

**PGN :** PGN (Portable Game Notation) in chess is a universal, plain-text standard for recording chess games, including moves in algebraic notation and metadata (players, date, event, result) in a human-readable format that chess software can easily import, analyze, and share, acting like a digital chess score sheet.

**FEN :** FEN (Forsyth-Edwards Notation) in chess is a standard, single-line text format for describing a specific board position, capturing piece placement, whose turn it is, castling rights, en passant targets, half-moves, and full moves, allowing games to restart from any point in software. It uses letters for pieces (uppercase for white, lowercase for black), numbers for empty squares, and slashes to separate ranks, making it a compact snapshot of a game state. 

**Elo :** It refers to the Elo rating system, developed by Arpad Elo, a method to calculate players' relative skill levels

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Installing Python's chess library

!pip install python-chess

In [None]:
import chess
import pandas as pd
import numpy as np

In [None]:
# initalizing board

board = chess.Board()
print(board)

In [None]:
# graphical board

import chess.svg
from IPython.display import SVG, display

board = chess.Board()
display(SVG(chess.svg.board(board=board)))

In [None]:
# displaying a FEN

b=chess.Board("r1bqkb1r/pppp1Qpp/2n2n2/4p3/2B1P3/8/PPPP1PPP/RNB1K1NR b KQkq - 0 4")
display(SVG(chess.svg.board(board=b)))

In [None]:
# Fetching '.csv' file as Panda's DataFrame

df = pd.read_csv('/kaggle/input/chess-games/chess_games.csv')
df.head(5)

In [None]:
df[['WhiteElo','BlackElo']].describe()

In [None]:
df.dropna(inplace=True)

Since the dataset is very large, I filtered the dataset, keeping only those games where both players are rated >1800 Elo

In [None]:
print(df[(df["WhiteElo"] >= 1800) & (df["BlackElo"] >= 1800)].shape[0])

In [None]:
df = df[(df["WhiteElo"] >= 1800) & (df["BlackElo"] >= 1800)]

In [None]:
df.drop(columns=['Event','Result','WhiteElo','BlackElo','WhiteRatingDiff','BlackRatingDiff','Opening','TimeControl','Termination'],inplace=True)

In [None]:
df.shape
df.head(5)

In [None]:
# Saving the filtered dataset to a new '.csv' file for future use

df.to_csv('modified_chess_data.csv')

In [None]:
df = pd.read_csv('/kaggle/input/modified-dataset/modified_chess_data.csv')
df.head(5)

In [None]:
print(df.shape)

In [None]:
# Dataset contains some impure PGN's containng '%eval', some evaluation output. Such games are removed

df = df[~df.iloc[:, 0].str.contains(r'%eval', na=False)]
print(df.shape)

Extracting game result from PGN and adding a new 'Result' column to df.
*   1: White wins
* -1: Black wins
*   0: Draw

In [None]:
''' 're' is a Python library which provides tools for working with regular expressions 
    to search, match, and manipulate text patterns.
'''

import re

def result(pgn):
    if pgn.endswith("1-0"):
        return 1
    elif pgn.endswith("0-1"):
        return -1
    elif pgn.endswith("1/2-1/2"):
        return 0
    return None

df['Result'] = df['AN'].apply(result)

In [None]:
df.dropna(subset=['Result'], inplace=True)

In [None]:
df = df.rename(columns={'AN': 'PGN'})
df.head(5)

In [None]:
type(df.loc[0,'PGN'])

In [None]:
# Creating Train, Validation and Test sets

from sklearn.model_selection import train_test_split

train_df, temp_df = train_test_split(
    df,
    test_size=0.20,
    stratify=df['Result'],
    random_state=42,
    shuffle=True
)

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,
    stratify=temp_df['Result'],
    random_state=42,
    shuffle=True
)

In [None]:
# Stratified datasets

print(train_df['Result'].value_counts(normalize=True))
print(val_df['Result'].value_counts(normalize=True))
print(test_df['Result'].value_counts(normalize=True))

In [None]:
# Saving the DataFrames as 'pickle' files and downloaded for future use

train_df.to_pickle("train_df.pkl")
val_df.to_pickle("val_df.pkl")
test_df.to_pickle("test_df.pkl")

##  **Making data ready for Neural-Network input**

In [None]:
import chess.pgn
import io

In [None]:
# Numeric representation of each piece 'a'- for white pieces and 'A'- for black pieces

''' In this I went a bit out of convention. This is the correct convention 'A'- for white 
    and 'a'- for black.
    In later part the correct convention is followed and in code 'piece_weight' are multiplied 
    with a extra '-1' to rectify the error.
'''

num_piece = {'p':0,'n':1,'b':2,'r':3,'q':4,'k':5,
             'P':6,'N':7,'B':8,'R':9,'Q':10,'K':11}

piece_weight = {'p':1,'n':3,'b':3,'r':5,'q':9,'k':0,
             'P':-1,'N':-3,'B':-3,'R':-5,'Q':-9,'K':0}

In [None]:
# extracting each move as FEN from PGN

def pgn_to_fen(PGN):
  fen = []
  pgn = io.StringIO(PGN)
  game = chess.pgn.read_game(pgn)
  for move in game.mainline_moves():
    fen.append(b.fen())
  return fen

In [None]:
# converting a FEN to a 12x8x8 matrix: 8x8 for board and x12 for each type of chess piece

def fen_to_matrix(FEN):
  matrix = np.zeros((12,8,8))
  b = chess.Board(FEN)
  for square,piece in b.piece_map().items():
    r = 7 - square//8
    c = square%8
    matrix[num_piece[str(piece)],r,c] = 1
  return matrix

There is a 8x8 chess board.
There are two players and there are 6 type of pieces - king, queen, rook, bishop, knight and pawn. So,
2 x 6 = 12

In [None]:
# Calculating material points

def material_points(FEN):
  white_point = 0
  black_point = 0

  b = chess.Board(FEN)
  for square,piece in b.piece_map().items():
    if str(piece).isupper():
      white_point = white_point + piece_weight[str(piece)]
    elif str(piece).islower():
      black_point = black_point + piece_weight[str(piece)]
  return white_point,black_point

In [None]:
#  additional features to the board matrix

def add_board(matrix,turn,FEN):

  # side to move
  side_plane = np.ones((1,8,8)) * turn

  # castling rights
  castle = []
  if(board.has_kingside_castling_rights(chess.WHITE)):
    castling_plane = castle.append(np.ones((8,8)))
    castling_plane = castle.append(np.zeros((8,8)))
  else:
    castling_plane = castle.append(np.zeros((8,8)))
    castling_plane = castle.append(np.ones((8,8)))
  if(board.has_kingside_castling_rights(chess.BLACK)):
    castling_plane = castle.append(np.ones((8,8)))
    castling_plane = castle.append(np.zeros((8,8)))
  else:
    castling_plane = castle.append(np.zeros((8,8)))
    castling_plane = castle.append(np.ones((8,8)))

  # material points
  white_point,black_point = material_points(FEN)
  material_advantage = black_point + white_point
  material = np.full((1, 8, 8),material_advantage,dtype=np.float32) * (-1*turn)

  add_matrix = np.concatenate([matrix, side_plane, castle, material], axis=0)

  return add_matrix

Adding 'additional features' means adding an extra 8x8 plane. One plane for 'Side to move: 1 for White & -1 for Black'. One for 'Material advantage: white material points - black material points'. And four for 'Castling rights: 2 Players x 2 Sides(king side and queen side)'

**Final neural-network input: a 18 x 8 x 8 matrix**

In [None]:
# Testing all functions

ft = pgn_to_fen(df.loc[0,'PGN'])
print(ft[0])

xt = chess.Board(ft[0])
print(xt)

yt = fen_to_matrix(ft[0])
print(yt)

wt,bt = material_points(ft[0])
print(wt,'\t',bt)

print(add_board(yt,1,ft[0]))

In [None]:
 add_board(yt,1,ft[0]).shape