# <center>**Chess Move Prediction Model**</center>

<img src='chess_board_pic.png'>

## **Table of Contents**

1. [Problem Statement](#problem)
2. [Data Loading and Exploration](#data-loading)
3. [Creating Sentence Structure](#sentence)
4. [Model Selection and Training](#selection)
[<ul>4.1 Bigram Model (Proof of Concept)</ul>](#initial)
[<ul>4.2 N-Gram Model</ul>](#fine)
5. [Model Evaluation](#evaluation)
6. [Conclusion](#conclude)

## **1. Problem Statement** <a class="anchor" id="problem"></a>

The goal of this analysis is to create a model to predict the next move of a Chess game given the previous moves in the game using the [Chess Games Data](https://www.kaggle.com/datasets/rishidamarla/chess-games) from Kaggle.  

The model could be used by someone to explore the types of moves strong players play in a given board position to improve their own skill.  Because the number of possible chess games rises exponentially with every move, this model will be limited to the first few moves.

## **2. Data Loading and Exploration** <a class="anchor" id="data-loading"></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
%%html
<style>
table {
  float: left;
}
</style>

In [3]:
chess_data = pd.read_csv('chess_games.csv')

In [4]:
chess_data.head(2)

Unnamed: 0,Game,White,Black,White Elo,Black Elo,White RD,Black RD,WhiteIsComp,BlackIsComp,TimeControl,Date,Time,White Clock,Black Clock,ECO,PlyCount,Result,Result-Winner,Commentaries,Moves
0,"""fjjvh"" vs ""FishTest""",fjjvh,FishTest,818,3204,70.3,51.6,Yes,Yes,60+0,2016.08.28,11:13:00,01:00.0,01:00.0,A13,64,0-1,Black,White checkmated,1. e4 e6 2. d4 d5 3. Nd2 c5 4. exd5 exd5 5. Bb...
1,"""fjjvh"" vs ""birdcostello""",fjjvh,birdcostello,831,3213,69.8,45.9,Yes,Yes,120+0,2016.08.11,15:16:00,02:00.0,02:00.0,C20,24,0-1,Black,White checkmated,1. d4 d5 2. c4 e6 3. Nf3 Nf6 4. g3 c5 5. Bg2 c...


In [5]:
chess_data.duplicated().sum()

0

There are no duplicated rows in the dataset.

In [6]:
chess_data.shape

(48871, 20)

The dataset contains around 49,000 rows with 20 columns.  However, the model will only use the <code>Moves</code> column, which contains the algebraic chess move notation for the games.  For more on chess notation, please see this [link](https://www.chess.com/terms/chess-notation).  

In [7]:
chess_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48871 entries, 0 to 48870
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Game           48871 non-null  object 
 1   White          48871 non-null  object 
 2   Black          48871 non-null  object 
 3   White Elo      48871 non-null  int64  
 4   Black Elo      48871 non-null  int64  
 5   White RD       48871 non-null  float64
 6   Black RD       48871 non-null  float64
 7   WhiteIsComp    48871 non-null  object 
 8   BlackIsComp    48871 non-null  object 
 9   TimeControl    48871 non-null  object 
 10  Date           48871 non-null  object 
 11  Time           48871 non-null  object 
 12  White Clock    48871 non-null  object 
 13  Black Clock    48871 non-null  object 
 14  ECO            48871 non-null  object 
 15  PlyCount       48871 non-null  int64  
 16  Result         48871 non-null  object 
 17  Result-Winner  48871 non-null  object 
 18  Commen

In [8]:
chess_data.isna().sum().sum()

0

The dataset contains no missing values.

In [9]:
chess_data.describe()

Unnamed: 0,White Elo,Black Elo,White RD,Black RD,PlyCount
count,48871.0,48871.0,48871.0,48871.0,48871.0
mean,2533.101471,2534.359579,37.259143,37.18496,123.753535
std,295.777722,293.91872,21.276387,20.934625,59.452701
min,818.0,831.0,0.0,0.0,0.0
25%,2334.0,2338.0,25.7,25.7,82.0
50%,2492.0,2496.0,32.9,33.0,117.0
75%,2759.0,2759.0,43.0,43.0,152.0
max,3308.0,3315.0,350.0,350.0,575.0


<code>White Elo</code> and <code>Black Elo</code> represent the strength of the players.  The lowest rating for a chess master is 2200 (National Master rating).  Since we want this model to predict the moves of the strong players, the data will be filtered to only include games where both players have an Elo above 2200.  

In [14]:
chess_data_above_2200 = chess_data[(chess_data['White Elo'] >= 2200) & (chess_data['Black Elo'] >= 2200)]

In [15]:
chess_data_above_2200.describe()

Unnamed: 0,White Elo,Black Elo,White RD,Black RD,PlyCount
count,40652.0,40652.0,40652.0,40652.0,40652.0
mean,2594.645208,2595.367879,36.826658,36.754056,131.905515
std,250.885942,249.066088,19.222952,19.03253,58.865347
min,2200.0,2200.0,0.0,0.0,0.0
25%,2389.0,2390.0,25.5,25.5,93.0
50%,2542.0,2544.0,32.5,32.5,124.0
75%,2789.0,2789.0,43.0,43.0,159.0
max,3308.0,3315.0,310.1,312.4,575.0


In [16]:
chess_data_above_2200.shape

(40652, 20)

There are still 40,000 games within the dataset where both players are above 2200 Elo, which is more than enough to build the model.

## **3. Creating Sentence Structure** <a class="anchor" id="sentence"></a>

The N-gram model will require the data as a list of sentences where the sentences should be a list of each word in the sentence.  Since we are analyzing chess games, each word will be a full chess move (ex. Nf5, O-O, dxe5, etc.).  Also, because chess moves have special characters, the data will not be tokenized like it would be for a typical NLP model.  They will simply be split by characters within the data to accomplish the goal of having each list item being a specific move.  

In [19]:
chess_games = chess_data_above_2200['Moves']

In [20]:
chess_games.head()

4673    1. d4 Nf6 2. Nf3 e6 3. a3 d5 4. e3 Be7 5. Bd3 ...
4681    1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 d5 5. ex...
4682    1. e4 e6 2. d4 d5 3. Nd2 c5 4. exd5 Qxd5 5. Ng...
4686    1. e4 e6 2. d4 d5 3. Nd2 c5 4. exd5 exd5 5. Ng...
4687    1. e4 c6 2. d4 d5 3. e5 Bf5 4. Nf3 e6 5. Be2 c...
Name: Moves, dtype: object

## **4. Model Selection and Training** <a class="anchor" id="selection"></a>

### 4.1 Bigram Model (Proof of Concept) <a class="anchor" id="initial"></a>



### 4.2 N-Gram Model<a class="anchor" id="fine"></a>


## **5. Model Evaluation** <a class="anchor" id="evaluation"></a>

## **6. Conclusion** <a class="anchor" id="conclude"></a>