<a href="https://colab.research.google.com/github/GOTWIC/AI-Chess-Engine/blob/main/Chess_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Libraries

In [28]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
import xgboost as xgb
import gc
from google.colab import drive
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
import chess
import time
import random
from IPython.display import clear_output 

! rm -rf sample_data

# Import Original Dataset and Reformat to Custom Dataframe (Not Recommended)

The steps in this section achieve the following:
1. Import the Original Kaggle dataset
2. Load the Dataset into a dataframe
3. Parse and Reformat the dataframe for all 13 million positions into a new dataframe
4. Upload new dataframe to Google Drive as a .CSV file.

Because reformatting 13 million chess positions is incredibly resource intensive, it's better to download the reformatted .CSV file that is generated at the end of this process. Using the reformatted .CSV file will significantly reduce the setup time for subsequent sessions.



### *If you want to download the reformatted .CSV file:*
*It is better to save the reformatted file to drive and download from there (10 minutes total), rather than save the file to colab's local storage and download directly from here (20 minutes or more).*

## Import Dataset 

1.   Download Kaggle API .json file
2.   Upload to Google Drive (root folder)
3.   Mount Google Drive



In [None]:
! pip install kaggle
! mkdir ~/.kaggle
! cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download ronakbadhe/chess-evaluations
! unzip chess-evaluations.zip

## Load and Read Dataset

In [None]:
path = 'chessData.csv' 
rawData = pd.read_csv(path)

## Parse FEN + Evaluation into Custom Data Frame

The original data frame includes only two columns: the FEN and the Evaluation. To prepare the dataset for inputting, we need to reformat the input. The reformatting parses the FEN string and allocates a column for every square on the board (64 squares), and an additional column for the evaluation (65 columns in total).

***This step will take around 18 minutes to complete.***

In [None]:
def parsePiece(piece):
  return {
        'P': 1,
        'N': 2,
        'B': 3,
        'R': 4,
        'Q': 5,
        'K': 6,
        'p': -1,
        'n': -2,
        'b': -3,
        'r': -4,
        'q': -5,
        'k': -6,
    }[piece]

rows = len(rawData)
rawParse = np.empty([rows, 65])

for row in range(rows):
  FENIndex = 0
  for char in rawData.FEN[row]:
    if char.isalpha():
      rawParse[row][FENIndex] = parsePiece(char)
      FENIndex += 1
    elif char.isdigit():
      for emptySpace in range(int(char)):
        rawParse[row][FENIndex] = 0
        FENIndex += 1
    elif char == ' ':
      break

  eval = rawData.Evaluation[row]
  if eval[0] == '+':
    rawParse[row][64] = int(eval[1:])
  elif eval[0] == '-':
    rawParse[row][64] = -1 * int(eval[1:])
  elif eval[0] == '#':
    if eval[1] == '+':
      rawParse[row][64] = 32000
    elif eval[1] == '-':
      rawParse[row][64] = -32000
  else:
    rawParse[row][64] = 0

  if row%100000 == 0:
    print("Currently reformatting position #" + str(row))

print("Finished processing " + str(rows) + " positions")
  

squareLabels = []
for i in range(1, 9):
  for j in range(1, 9):
    squareLabels.append(chr(j + 96) + chr((8-i) + 49))
squareLabels.append('Evaluation')

data = pd.DataFrame(rawParse, columns = squareLabels)
downcasted_data = data.apply(pd.to_numeric,downcast='signed')

print("Reformatting Complete")





## Delete the Original Dataframe
Reformatting pretty much deletes all of your ram. This step will free up some memory space.

In [None]:
del rawData
del data
gc.collect()

## Save New Dataframe as a .CSV file and download to Google Drive
After this step, you can import the new dataframe directly from Google Drive.

In [None]:
path = '/content/drive/My Drive/ReformattedChessDataset.csv'
with open(path, 'w', encoding = 'utf-8-sig') as f:
  downcasted_data.to_csv(f)

# The Reformatted Dataset

## Import Dataset 

1.   Download Kaggle API .json file
2.   Upload to Google Drive (root folder)
3.   Mount Google Drive

*Runtime: Approximately 30 seconds*

In [2]:
! pip install kaggle
! mkdir ~/.kaggle
! cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download swagnikroychoudhury/bysquare-chess-evaluations
! unzip bysquare-chess-evaluations.zip

Downloading bysquare-chess-evaluations.zip to /content
 91% 148M/163M [00:01<00:00, 132MB/s]
100% 163M/163M [00:01<00:00, 134MB/s]
Archive:  bysquare-chess-evaluations.zip
  inflating: ReformattedChessDataset.csv  


## Loading the Dataset

Although the reformatted dataset has been optimized, numbers (float/int) are converted to the 64-bit version when the dataframe is converted to a .CSV file. In other words, most of the optimizations are gone. We need to chunk, downcast, and concatonate the dataset to re-optimize it.

Most of the table values are of the data type INT8. Converting the dataframe into a .csv file converts the INT8 to INT64, thus taking up 8x more memory. This is okay for small datasets, but because this dataset is so big, Colab crashes with the INT64 datatype.

*Runtime: Approximately 1 minute*

In [3]:
path = 'ReformattedChessDataset.csv' 
chunksize = 1000000
data = pd.DataFrame()

for count, chunk in enumerate(pd.read_csv(path, chunksize=chunksize)):
  print("Processing Chunk " + str(count + 1))
  data = pd.concat([data, chunk.apply(pd.to_numeric,downcast='signed')]).select_dtypes(include=['int8', 'int16'])

print("Chunk Processing Complete")

Processing Chunk 1
Processing Chunk 2
Processing Chunk 3
Processing Chunk 4
Processing Chunk 5
Processing Chunk 6
Processing Chunk 7
Processing Chunk 8
Processing Chunk 9
Processing Chunk 10
Processing Chunk 11
Processing Chunk 12
Processing Chunk 13
Chunk Processing Complete


# Part 1: Regressional Machine Learning
The next section uses various regressional models. While the models are not parametrically tuned, there is no point in doing so as it becomes apparent soon that regressional analysis does not work well, or at all. 

In [27]:
labels = []
for i in range(1, 9):
  for j in range(1, 9):
    labels.append(chr(j + 96) + chr((8-i) + 49))

sample = data.sample(n = 1000000)

y = sample.Evaluation
X = sample[labels]
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 0)

### Linear Regression


In [22]:
linear_model = linear_model.LinearRegression()
linear_model.fit(X_train, y_train)
pred = linear_model.predict(X_val)
print("Mean Absolute Error: ", mean_absolute_error(y_val, pred))

Mean Absolute Error:  1129.7312063645802


In [24]:
y_val_arr = y_val.to_numpy()
error_statistics = pd.DataFrame(abs(pred - y_val_arr))
error_statistics.describe()

Unnamed: 0,0
count,250000.0
mean,1129.731206
std,3777.202189
min,0.005231
25%,203.619112
50%,458.892826
75%,879.547302
max,36331.179984


### Ridge Model

In [29]:
ridge_model = linear_model.Ridge()
ridge_model.fit(X_train, y_train)
pred = ridge_model.predict(X_val)
print("Mean Absolute Error: ", mean_absolute_error(y_val, pred))

Mean Absolute Error:  1099.8443733035333


In [30]:
y_val_arr = y_val.to_numpy()
error_statistics = pd.DataFrame(abs(pred - y_val_arr))
error_statistics.describe()

Unnamed: 0,0
count,250000.0
mean,1099.844373
std,3713.113814
min,0.003003
25%,199.742152
50%,447.053868
75%,855.979286
max,36671.500247


### Random Forest Regression

In [31]:
forest_model = RandomForestRegressor(max_leaf_nodes=2, n_estimators=1, random_state=1)
forest_model.fit(X_train, y_train)
pred = forest_model.predict(X_val)
print("Mean Absolute Error: ", mean_absolute_error(y_val, pred))

Mean Absolute Error:  803.1642296061361


In [32]:
y_val_arr = y_val.to_numpy()
error_statistics = pd.DataFrame(abs(pred - y_val_arr))
error_statistics.describe()

Unnamed: 0,0
count,250000.0
mean,803.16423
std,3902.57423
min,0.046939
25%,62.953061
50%,118.046939
75%,310.953061
max,33568.924359


### XGBoost

In [33]:
xgb_model = xgb.XGBRegressor(objective='reg:squarederror')
xgb_model.fit(X_train, y_train)
pred = xgb_model.predict(X_val)
print("Mean Absolute Error: ", mean_absolute_error(y_val, pred))

Mean Absolute Error:  842.99


In [18]:
y_val_arr = y_val.to_numpy()
error_statistics = pd.DataFrame(abs(pred - y_val_arr))
error_statistics.describe()

Unnamed: 0,0
count,250000.0
mean,1105.478818
std,3693.359017
min,0.00268
25%,202.78277
50%,454.603224
75%,866.350847
max,37188.517149


# The Chess Board

The following section is a way to interact with the chess evaluator.

In [10]:
board = chess.Board()

def parsePiece(piece):
  return {
        'P': 1,
        'N': 2,
        'B': 3,
        'R': 4,
        'Q': 5,
        'K': 6,
        'p': -1,
        'n': -2,
        'b': -3,
        'r': -4,
        'q': -5,
        'k': -6,
    }[piece]

def parseFEN(FEN):
  FENIndex = 0
  parsed = np.empty([1, 64])
  for char in FEN:
    if char.isalpha():
      parsed[0][FENIndex] = parsePiece(char)
      FENIndex += 1
    elif char.isdigit():
      for emptySpace in range(int(char)):
        parsed[0][FENIndex] = 0
        FENIndex += 1
    elif char == ' ':
      break
  return pd.DataFrame(parsed, columns=labels)

In [None]:
board.reset()

In [17]:

print(board)

for i in range(1,501):

  human_move = input("Your Move: ")

  board.push_san(human_move)

  maxEval = 0
  move_to_play = chess.Move

  for uci in board.legal_moves:
    move = chess.Move.from_uci(str(uci))
    board.push(move)
    FEN = board.fen()
    move_df = parseFEN(FEN)
    pred = abs(model.predict(move_df))
    board.pop()

    if pred[0] > maxEval:
      maxEval = pred[0]
      move_to_play = move
    
    if pred[0] == maxEval:
      pick = random.randrange(2)
      if pick == 0:
        move_to_play = move

  board.push(move_to_play)

  clear_output()
  print(board)
  print(board.fullmove_number)

  if board.is_game_over():
    break

board

board.outcome()


r n b q k b n Q
p . . . . . . p
. . . . . . . .
. p p p p . . .
. . . . . . . .
. . . . . . . .
P P P P . P P P
R N B Q K B N R
6


KeyboardInterrupt: ignored