# Chess Model
This notebook focuses on creating a predictive model for the winning side of a chess match given information about the number of moves.

## Import Statements and Data
In this section we'll import our packages we'll be using and import our data. We'll also check for data quality.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

Reading in our dataset and setting it to a variable called games.

In [2]:
games = pd.read_csv("games.csv")

Let's check the data quality. We want to make sure we aren't workin with any null data.

In [3]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20058 entries, 0 to 20057
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              20058 non-null  object 
 1   rated           20058 non-null  bool   
 2   created_at      20058 non-null  float64
 3   last_move_at    20058 non-null  float64
 4   turns           20058 non-null  int64  
 5   victory_status  20058 non-null  object 
 6   winner          20058 non-null  object 
 7   increment_code  20058 non-null  object 
 8   white_id        20058 non-null  object 
 9   white_rating    20058 non-null  int64  
 10  black_id        20058 non-null  object 
 11  black_rating    20058 non-null  int64  
 12  moves           20058 non-null  object 
 13  opening_eco     20058 non-null  object 
 14  opening_name    20058 non-null  object 
 15  opening_ply     20058 non-null  int64  
dtypes: bool(1), float64(2), int64(4), object(9)
memory usage: 2.3+ MB


In [4]:
games['rating_diff'] = games['white_rating'] - games['black_rating']

In [5]:
games = games[['winner','moves','rating_diff']]
games = games.drop(columns=['rating_diff'])
#Comment to subset out draws
games = games[games['winner'] != 'draw']

Now that we know that our data doesn't contain nulls, let's check roughly what our baseline percent is.

In [6]:
games['winner'].value_counts(normalize=True)

white    0.523393
black    0.476607
Name: winner, dtype: float64

Looking at this, our model will need to perform above roughly 50% to be accurate.

## Data Cleaning and Preprocessing
Unfortunately, the moves columns is our only source of information for what is going on in our games turn to turn. Because of this we'll need to work vigorously to get the data the way we want it in. First things first let's split this column into the desired number of turns. This needs to be standardized so that our number of inputs for our model is always the same. We'll create a variable, turns, representing the number of turns we want.

In [7]:
turns = 10

Now we'll need to create a function to return the our moves column split and spliced to our desired size. An important issue of note is because we have to have consistent input, games with less turns will need to be padded. For ease of use, we'll pad the empty turns with 0.

In [8]:
def splitStandardize(array, length = turns):
    split_array = array.split(' ')[0:length]
    while len(split_array) < length:
        split_array.append('0')
    return(split_array)

Now we can create a new column for our first x amount of turns.

In [9]:
columnName = 'First'+ str(turns) + 'Moves'
games[columnName] = games['moves'].apply(lambda x: splitStandardize(x))

Now we need to conceptualize what we're going to do with these moves. Chess has a couple of things of note for the notation. Here's what we need to know.
* Move put opponent in check: + (++ means checkmate)
* Piece was taken in move: x
* Different notations are used for different pieces
  * K : King
  * Q : Queen
  * R : Rook
  * B : Bishop
  * N : Knight
  * P : pawn (Note that pawn is also the defualt)
* Castleing is indicated by O-O or O-O-O
* Lower case letters followed by a number represents the coordinates of the play. This get's weird as a piece was taken. If a piece was taken, the coordinates of the final location are given after the "x".

With this information we should be able to work with putting our information about moves into vectors. Each turn or move will contain:
- A feature representing the x coordinate (letters)
- A feature repressenting the y cooridnate (numbers)
- Dummy columns for each piece that was moved (other than pawn)
- A flag for if a piece was taken
- A flag for if a move resulted in a check
- A flag for if the turn was null (ie the game was finished already)
- A flag for if a castle occured

We'll need to create functions for each of these flags.

In [10]:
def flagPieceTaken(array):
    if '0' == array:
        return(-1)
    elif 'x' in array:
        return(1)
    else:
        return(0)

In [11]:
def flagCheck(array):
    if '0' == array:
        return(-1)
    elif '+' in array:
        return(1)
    else:
        return(0)

In [12]:
def flagNull(array):
    if '0' == array:
        return(1)
    else:
        return(0)

In [13]:
def flagPieceType(turnNumber,dataframe):
    dataframe['temp'] = dataframe[columnName].apply(lambda x: x[turnNumber])
    for piece in ['K','Q','R','B','N']:
        newColumnName = piece + str(turnNumber + 1)
        dataframe[newColumnName] = [0] * len(dataframe)
        dataframe.loc[dataframe['temp'].str.contains(piece,na=False),newColumnName] = 1

In [14]:
def xCoord(array, case = {'a':1,'b':2,'c':3,'d':4,'e':5,'f':6,'g':7,'h':8}):
    key = ''
    index = -1
    while (key not in case) and (abs(index) <= len(array)):
        key = array[index]
        index-= 1
    if key in case:
        return(case[key])
    else:
        return(-1)

In [15]:
def yCoord(array, case = {'1':1,'2':2,'3':3,'4':4,'5':5,'6':6,'7':7,'8':8}):
    key = ''
    index = -1
    while (key not in case) and (abs(index) <= len(array)):
        key = array[index]
        index-= 1
    if key in case:
        return(case[key])
    else:
        return(-1)

In [16]:
def castleFlag(array):
    if array == '0':
        return(-1)
    elif 'O-O' in array:
        return(1)
    else:
        return(0)

Now that we have all of our functions built, let's create one big function that combines them all. This will make our code more interpretable.

In [17]:
def combined(turnNumber,dataframe):
    newColumnName = str(turnNumber + 1)
    array = dataframe[columnName].apply(lambda x: x[turnNumber])
    array = array.apply(lambda x: x.replace('=',''))
    dataframe[f"{newColumnName}PieceTaken"] = array.apply(lambda x: flagPieceTaken(x))
    dataframe[f"{newColumnName}Check"] = array.apply(lambda x: flagCheck(x))
    dataframe[f"{newColumnName}NullTurn"] = array.apply(lambda x: flagNull(x))
    dataframe[f"{newColumnName}XCoord"] = array.apply(lambda x: xCoord(x))
    dataframe[f"{newColumnName}YCoord"] = array.apply(lambda x: yCoord(x))
    dataframe[f"{newColumnName}Castle"] = array.apply(lambda x: castleFlag(x))
    flagPieceType(turnNumber,dataframe)

Now let's loop through all the turns that we have and create these features.

In [18]:
for turnNumber in range(turns):
    combined(turnNumber,games)
    print(f"Turn {turnNumber + 1} Completed")

Turn 1 Completed
Turn 2 Completed
Turn 3 Completed
Turn 4 Completed
Turn 5 Completed
Turn 6 Completed
Turn 7 Completed
Turn 8 Completed
Turn 9 Completed
Turn 10 Completed
Turn 11 Completed
Turn 12 Completed
Turn 13 Completed
Turn 14 Completed
Turn 15 Completed


Now let's check and make sure this was successful.

In [19]:
games.columns[games.isna().any()].tolist()

[]

The fact that there are no columns names tells us that we have no columns that have na values. Let's drop our temp, moves, and columnName columns so that we can get ready to model.

In [20]:
games = games.drop(columns = ['temp','moves',columnName])

## Modeling
Now we can go through and model the relationship. First things first let's reassign our target variable so that it can work in sklearn.

In [21]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
games['winner'] =le.fit_transform(games['winner'])

Now we can go through and model our data. We'll be utilizing linear discriminant analysis as our model and Kfold crossvalidation to evaluate it. 

In [22]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [23]:
k = 10
crossvalidation = KFold(n_splits=k, random_state = 123, shuffle=True)

In [24]:
LDA_cv_scores = cross_val_score(LDA(),games.drop(columns=['winner']),games['winner'])

In [25]:
print("LDA cross validation scores with k=10: ", LDA_cv_scores)
print("Average score of all folds:",LDA_cv_scores.mean())

LDA cross validation scores with k=10:  [0.56357928 0.58791209 0.57456829 0.55561371 0.555352  ]
Average score of all folds: 0.5674050740824585


## Deep learning KEK

In [26]:
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import layers, models

In [27]:
train_set, test_set = train_test_split(games, test_size=.2, random_state = 42)

Now we create our model

In [28]:
model = models.Sequential()
model.add(layers.InputLayer(len(games.columns)-1))
model.add(layers.Dense(256,activation='relu'))
model.add(layers.Dense(128,activation='relu'))
model.add(layers.Dense(64,activation='relu'))
model.add(layers.Dense(32,activation='softmax'))
model.add(layers.Dense(len(le.classes_)-1))

In [29]:
model.compile(optimizer='adam',
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])

In [30]:
model.fit(train_set.drop(columns=['winner']),
         train_set['winner'],batch_size=128,epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x184c353ba60>

In [31]:
model.evaluate(test_set.drop(columns=['winner']),
         test_set['winner'],batch_size=128)



[0.6751484274864197, 0.5156986117362976]