Below is my attempt to see if I can build a model to predict the winner(and draws) of a chess game(standard variation) on lichess given the players' ratings, the opening variations etc. And given a winner, which method of win. Initially I thought having a higher elo rating should in theory predict the outcome but expirience says otherwise.

In [116]:
import numpy as np

# Processing the data

The dataset has 16 attributes and 20059 data points. I will drop some of the attributes i deem to be irrelevant manually(user_ids , start time of game etc) and the a PCA.

In [117]:
import pandas as pd

#specify data type for the columns in the data
#type = {"id": str, "rated":bool, "t1":float, "t2":float,"turns":int32,"victory_status":str, "winner":str, "inc_code":str,"white_id":str,"white_rating":int32,"black_id":str,"black_rating":int32,"moves":str,"pening_code":str,"opening_name":str,"opening_ply":int32}


#here i drop columns i deem useless e.g(id, user id's etc)
cols_to_use = {1,4,5,6,7,9,11,14,15} #columns to use

data = pd.read_csv("games.csv", usecols=cols_to_use)#read in the csv file with game records
#preview first 5 lines of loaded data
data.head()

Unnamed: 0,rated,turns,victory_status,winner,increment_code,white_rating,black_rating,opening_name,opening_ply
0,False,13,outoftime,white,15+2,1500,1191,Slav Defense: Exchange Variation,5
1,True,16,resign,black,5+10,1322,1261,Nimzowitsch Defense: Kennedy Variation,4
2,True,61,mate,white,5+10,1496,1500,King's Pawn Game: Leonardis Variation,3
3,True,61,mate,white,20+0,1439,1454,Queen's Pawn Game: Zukertort Variation,3
4,True,95,mate,white,30+3,1523,1469,Philidor Defense,5


Remove the target columns i wish to predict:

In [118]:
vstat_data = data.drop('victory_status', axis=1) #data to classify the victory_status
wnr_data = data.drop(columns=['victory_status','winner'], axis=1) #data to classify the victory winner

vstats = data['victory_status']
winners = data['winner'] #targets for winner data

encode data with numerical values:

In [119]:
wnr_data['rated'] = wnr_data['rated'].replace({True:1, False:0}) #change the rated column to 1 and 0
vstat_data['rated'] = vstat_data['rated'].replace({True:1, False:0}) #change the rated column to 1 and 0

opening_name will be one hot encode:

In [120]:
one_hot = pd.get_dummies(wnr_data['opening_name']) #get the one hot encoding of the column
one_hot = one_hot.join(pd.get_dummies(wnr_data['increment_code']))

#drop the original column since it's been encoded
wnr_data = wnr_data.drop('opening_name',axis=1)
wnr_data = wnr_data.drop('increment_code',axis=1)

#join the the encoded column with the rest of the data
wnr_data = wnr_data.join(one_hot)

In [121]:
one_hot = pd.get_dummies(vstat_data['opening_name']) #get the one hot encoding of the column
one_hot = one_hot.join(pd.get_dummies(vstat_data['increment_code']))
one_hot = one_hot.join(pd.get_dummies(vstat_data['winner']))

#drop the original column since it's been encoded
vstat_data = vstat_data.drop('opening_name',axis=1)
vstat_data = vstat_data.drop('increment_code',axis=1)
vstat_data = vstat_data.drop('winner',axis=1)

#join the the encoded column with the rest of the data
vstat_data = vstat_data.join(one_hot)

Now to split the data into training and testing data:

In [122]:
from sklearn.model_selection import train_test_split
#i shuffled them just incase they are sorted in an i missed.
wnr_data.sample(frac=1) #fraction to return randomised = 1

#80 20 split
wnr_train, wnr_test = train_test_split(wnr_data, test_size=0.2)
wnr_train_target, wnr_test_target = train_test_split(winners, test_size=0.2)

In [123]:
#i shuffled them just incase they are sorted in an i missed.
vstat_data.sample(frac=1) #fraction to return randomised = 1

#80 20 split
vstat_train, vstat_test = train_test_split(vstat_data, test_size=0.2)
vstat_train_target, vstat_test_target = train_test_split(vstats, test_size=0.2)

## First I will use a descision tree:

In [124]:
from sklearn import tree

dTree = tree.DecisionTreeClassifier(criterion="entropy", max_depth=50)
dTree.fit(wnr_train, wnr_train_target)
print(dTree.tree_.max_depth)

50


Get the the predictions on the test_data and check the accuract score:

In [125]:
#predict on test data
wnr_predns = dTree.predict(wnr_test)

from sklearn.metrics import accuracy_score
#get the accuracy score using the test targets
print(accuracy_score(wnr_test_target, wnr_predns))

0.48404785643070786


Do the same for predicting victory status:

In [126]:
dTree.max_depth = 10
dTree.fit(vstat_train, vstat_train_target)

#predict on test data
vstat_predns = dTree.predict(vstat_test)

#get the accuracy score using the test targets
print(accuracy_score(vstat_test_target, vstat_predns))

0.5635593220338984


The descision tree classifier produce poor results for both sets of targets. Pruning the wnr tree does not change the score. Prunning vstat data did improve it from 0.47 to 0.55 and max_depth=10 seems to be the sweet spot.

## Next I will give Naive Bayes a go:

In [127]:
from sklearn.naive_bayes import GaussianNB

NBmodel = GaussianNB()

#fit the wnr_data
NBmodel.fit(vstat_train, vstat_train_target)

GaussianNB()

Check the accuracy again

In [128]:
#predict on test data
NB_vstat_predns = NBmodel.predict(vstat_test)

#get the accuracy score using the test targets
print(accuracy_score(vstat_test_target, NB_vstat_predns))

0.11340977068793619


THe accuracy score is laughable!

## Perhaps Logistic Regression might do better?

In [129]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

#fit data
logreg.fit(vstat_train, vstat_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [130]:
vstat_predcns = logreg.predict(vstat_test)
#get accuracy score
print(accuracy_score(vstat_test_target, vstat_predcns))

0.5682951146560319


It has just about the same performance as the pruned decision tree classifier