<br><br>
<img src="https://www.bing.com/th/id/R.538ae9d108825192f1c3aabfdd0d0e1b?rik=uE6T7HeXOudINw&pid=ImgRaw"
     alt="Chess Icon"
     width="120">    

# Chess Opening Recommendation System (Capstone)

**Author:** Thomas Handley  
**Goal:** Predict a player's opening **play-style** from the first moves of a game.

---

## Project Summary
This notebook builds an end-to-end recommendation system that classifies chess openings into broad **play-style categories** using a combination of:
- **Opening move data** (`openings.csv`, Kaggle) with ECO groupings and opening statistics
- **Real-world game dataset** (`games.csv`, Kaggle) containing moves, outcomes, and player ratings
- **Live Chess.com API data** providing a user’s most recent games for inference
- **Feature engineering** on early move sequences (first 8 ply)
- **Random Forest classifier**


The final system predicts whether a player's early-game decisions align more with styles such as:
**Classical, Positional, Tactical, Hypermodern, Dynamic, or Sharp**



In [2]:
# Importing necessary libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, balanced_accuracy_score
from sklearn.preprocessing import OrdinalEncoder

import requests
import re

## Exploratory Data Analysis (EDA)

Checks performed:
- `.head()` to validate content and column meaning  
- `.info()` to confirm data types and non-null counts  
- `.isna().sum()` to find missing values

Key observation:
- Some move columns (`move1b` to `move4b`) contain missing values due to shorter games or incomplete sequences.
- Missing moves are expected and handled during feature engineering.

In [None]:
# Loading the openings.csv dataset
df = pd.read_csv("openings.csv", index_col=0)

# Inspecting the dataset
display(df.head())
df.info()
df.isna().sum()

In [None]:
# Loading the games.csv dataset
df2 = pd.read_csv('games.csv', index_col= 'id')

# Inspecting the dataset
display(df2.head())
df2.info()
df2.isna().sum()

In [None]:
# Finding the opening with the highest and lowest win rate
highest_win = df.loc[df['Player Win %'].idxmax(), ['Player Win %', 'Opening', 'Colour', 'Num Games']]
lowest_win = df.loc[df['Player Win %'].idxmin(), ['Player Win %', 'Opening', 'Colour', 'Num Games']]

# Putting the highest and lowest win rate opening book move into a dataframe
win_df = pd.DataFrame({
    "Highest Win Rate": highest_win,
    "Lowest Win Rate": lowest_win
})
win_df

## Feature Engineering

Key transformations applied:

#### ECO Simplification
ECO codes are reduced to their **first letter** (A–E) and mapped into numeric categories.

#### Colour Encoding
Player colour is converted into binary:
- `white → 0`
- `black → 1`

#### Move Cleaning
Move strings are normalised by removing symbols such as:
- checks (`+`)
- checkmates (`#`)
- captures (`x`)
- promotions (`=Q`, etc.)
- special annotations (`e.p.`)

#### Move Encoding
Moves are encoded using an `OrdinalEncoder` configured to handle unseen moves safely:
- `handle_unknown="use_encoded_value"`
- `unknown_value=-1`

This ensures the model can still run when the API returns moves not seen in training.

In [None]:
# Converting the ECO feature into a string and taking only the first character
df['ECO'] = df['ECO'].astype(str).str[0]

# Creating a list of features to drop
columns_dropped = ['Num Games', 'Last Played',
       'Avg Player', 'Player Win %', 'Draw %', 'Opponent Win %', 'Moves',
       'moves_list', 'White_win%', 'Black_win%', 'White_odds',
       'White_Wins', 'Black_Wins']

# Dropping the list of features
df = df.drop(columns = columns_dropped)

# Mapping the colour column 0 and 1
df['Colour'] = df['Colour'].map({
    "white" : 0,
    "black" : 1
 })

# Mapping the ECO codes to numerical values
df['ECO'] = df['ECO'].map({
    'A': 1,
    'B': 2,
    'C' : 3,
    'D' : 4,
    'E' : 5
})
df.columns

In [9]:
# Creating a list of columns I will drop
drop_columns = ['rated','created_at', 'last_move_at','turns', 'victory_status', 'winner', 'increment_code', 'opening_ply', 'white_id',
               'black_id']

# Dropping the columns in the list
df2 = df2.drop(columns = drop_columns)

# Converting the ECO feature into a string and taking only the first character
df2['opening_eco'] = df2['opening_eco'].astype(str).str[0]

# Mapping the ECO codes to numerical values
df2['opening_eco'] = df2['opening_eco'].map({
    'A': 1,
    'B': 2,
    'C' : 3,
    'D' : 4,
    'E' : 5
})

# The code below splits each game into two records (one for White and one for Black) so both players are represented separately

# Creating a list to store expanded white and black game rows
new_df2 = []

# Looping through each game and taking the first 8 moves
for game in df2.itertuples():
    moves_split = game.moves.split()

    # Capturing the eight moves in a list
    moves8 = []
    for i in range(8):
        if i >= len(moves_split):
            moves8.append(None)
        else:
            moves8.append(moves_split[i])
            
    # Creating a row for the white player
    white = {
        "Colour" : "White",
        "Perf Rating" : game.white_rating,
        "ECO": game.opening_eco,
        "move1w" : moves8[0],
        "move1b" : moves8[1], 
        "move2w" : moves8[2], 
        "move2b" : moves8[3], 
        "move3w" : moves8[4], 
        "move3b" : moves8[5],
        "move4w" : moves8[6],
        "move4b" : moves8[7],
        "Opening" : game.opening_name
    }

    # Creating a row for the black player
    black = {
        "Colour" : "Black",
        "Perf Rating" : game.black_rating,
        "ECO": game.opening_eco,
        "move1w" : moves8[0],
        "move1b" : moves8[1], 
        "move2w" : moves8[2], 
        "move2b" : moves8[3], 
        "move3w" : moves8[4], 
        "move3b" : moves8[5],
        "move4w" : moves8[6],
        "move4b" : moves8[7],
        "Opening" : game.opening_name
    }
    # Appending white and black to the list
    new_df2.append(white)
    new_df2.append(black)

# Converting the list of rows into a dataframe
new_df2 = pd.DataFrame(new_df2)
new_df2.head()


# Mapping White and Black to 0 and 1
new_df2['Colour'] = new_df2['Colour'].map({
    "White" : 0,
    "Black" : 1
})

In [None]:
# Combinging the two dataframes together
final_df = pd.concat([new_df2, df], ignore_index=True)

# Shuffles the dataset and resets the index
final_df = final_df.sample(frac = 1, random_state = 42).reset_index(drop=True)

# Inspecting the new dataframe
display(final_df.head())
final_df.shape

In [11]:
# Function that takes an eco code and returns a playstyle
def eco_style(eco):
    if eco == 1:
        return 'Hypermodern'
    elif eco == 2:
        return 'Dynamic'
    elif eco == 3:
        return 'Classical'
    elif eco == 4:
        return 'Positional'
    elif eco == 5:
        return 'Sharp'

# Function that takes an opening name and returns a playstyle
def name_style(opening_name):
    opening_name = opening_name.lower()
    if "gambit" in opening_name:
        return 'Tactical'
    elif "attack" in opening_name:
        return 'Tactical'
    elif "declined" in opening_name:
        return 'Positional'
    elif "exchange" in opening_name:
        return 'Positional'
    elif "modern" in opening_name:
        return 'Hypermodern'
    elif "indian" in opening_name:
        return 'Hypermodern'
    if "sicilian" in opening_name or "dragon" in opening_name or "najdorf" in opening_name:
        return 'Sharp'
    return None

# Assigning a playstyle label using ECO codes and refining it using opening names
final_df['play_style'] = final_df['ECO'].apply(eco_style)
name_change = final_df['Opening'].apply(name_style)
final_df['play_style'] = name_change.fillna(final_df['play_style']) 

In [12]:
# List of move columns to be cleaned
moves = ['move1w', 'move1b', 'move2w', 'move2b', 'move3w',
       'move3b', 'move4w', 'move4b']

# Function to clean move notation by removing special symbols
def clean_move(m):
    # Checks if move is null
    if pd.isna(m):
        return None
    
    m = str(m)
    return (m
            .replace("+", "")
            .replace("#", "")
            .replace("x", "")     
            .replace("e.p.", "")
            .replace("=Q", "")
            .replace("=R", "")
            .replace("=B", "")
            .replace("=N", "")
            .strip()
           )

# This applys the cleaning function to all the move columns
for col in moves:
    final_df[col] = final_df[col].apply(clean_move)

In [13]:
# Encoding move strings into numerical values, while handling unseen moves safely
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
moves_encoded = encoder.fit_transform(final_df[moves])

# Putting the encoded moves into the dataframe
moves_encoded_df = pd.DataFrame(moves_encoded, columns=moves, index=final_df.index)

# Replacing the original move columns with their encoded versions
final_df = final_df.drop(columns= moves)
final_df = pd.concat([final_df, moves_encoded_df], axis = 1)

## Modelling

### Model Choice: Random Forest Classifier
A `RandomForestClassifier` is trained using:
- `n_estimators = 400`
-  `max_depth = 20`
-  `max_features = sqrt`
-  `min_samples_leaf = 1`
-  `min_samples_split = 2`
- `random_state = 42`

### Features Used
- `Colour`
- `Perf Rating`
- Encoded move sequence (`move1w` … `move4b`)

### Target Variable
`play_style`, the label represents broad opening style categories.

### Train/Test Split
- `test_size = 0.2`
- `random_state = 42`
-  `stratify = y`

The model is evaluated on **both** test accuracy and train accuracy to assess generalisation and potential overfitting.

In [15]:
# Creating a list of columns going into the model
columns = ['Colour', 'Perf Rating']

# Creating the X and y variables for the model
X = final_df[columns + moves]
y = final_df['play_style']

In [None]:
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size = 0.2, random_state = 42, stratify = y)

# Initialising the random forest model
base_model = RandomForestClassifier(n_estimators = 400, random_state = 42)

# Fitting the model
base_model.fit(X_train, y_train)

# Generating predictions for training and test sets
y_train_pred = base_model.predict(X_train)
y_test_pred = base_model.predict(X_test)

# Calculating accuracy for both sets
test_accuracy = accuracy_score(y_test, y_test_pred)
train_accuracy = accuracy_score(y_train, y_train_pred)

# Printing the train and test accuracy
print(f"Test Accuracy:{test_accuracy}")
print(f"Train Accuracy:{train_accuracy}")

In [None]:
# # Setting up a grid search to tune random forest hyperparameters
grid = GridSearchCV(
    estimator = RandomForestClassifier(random_state=42),
    param_grid = {
        'n_estimators': [200, 400],    
        'max_depth': [None, 20, 40],     
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['sqrt']    
    },
    cv = 3,                         
    refit = True,
    verbose = 2,
    scoring = 'accuracy',
    n_jobs = -1                      
)

# Fitting the grid search
grid.fit(X_train, y_train)

In [None]:
# Outputting the best parameters found by the grid search
grid.best_params_

In [None]:
# Creating a random forest model with the best parameters
final_model = RandomForestClassifier(max_depth=20, max_features= 'sqrt', min_samples_leaf = 1, min_samples_split = 2, n_estimators = 400, random_state=42)

# Fitting the model on the train data
final_model.fit(X_train, y_train)

# Generating predictions for training and test sets
y_train_pred = final_model.predict(X_train)
y_test_pred = final_model.predict(X_test)

# Calculating accuracy for both sets
test_accuracy = accuracy_score(y_test, y_test_pred)
train_accuracy = accuracy_score(y_train, y_train_pred)

# Printing the train and test accuracy
print(f"Test Accuracy:{test_accuracy}")
print(f"Train Accuracy:{train_accuracy}")

## Results & Evaluation

Model evaluation includes:
- Test set accuracy
- Training set accuracy
- Balanced Accuracy
- Weighted F1
- Classification Report
- Confusion matrix

Interpretation:
- Strong test performance indicates the model learns meaningful early-game patterns.
- Training accuracy notably higher than test accuracy suggests some overfitting, which is expected with high-capacity ensemble models.

In [None]:
# Generating predictions for training and test sets
y_train_pred = final_model.predict(X_train)
y_test_pred = final_model.predict(X_test)

# Calculating accuracy for both sets
test_accuracy = accuracy_score(y_test, y_test_pred)
train_accuracy = accuracy_score(y_train, y_train_pred)

# Printing the final train and test accuracy
print("Final Model performance")
print(f"Test Accuracy: {test_accuracy:.2f}")
print(f"Train Accuracy: {train_accuracy:.2f}")

# Printing the Balanced Accuracy and Weighted F1 
print("\nAdditional metrics (multi-class):")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_test_pred):.2f}")
print(f"Weighted F1      : {f1_score(y_test, y_test_pred, average='weighted'):.2f}")

# Printing the classification report
print("\n Classification report (test set)")
print(classification_report(y_test, y_test_pred))

# Creating and outputting the Confusion matrix
ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred, xticks_rotation=45)
plt.title("Play-Style Prediction Confusion Matrix (Final Model)")
plt.tight_layout()
plt.show()

## API-Based Inference & User Profiling

This section applies the trained model to real Chess.com games.

Given a username, the system:
- Retrieves recent games via the Chess.com public API
- Parses PGN data and extracts the first 8 ply
- Builds feature rows (Colour, Rating, Encoded Moves)
- Predicts play-style using the trained model
- Applies a rule-based adjustment from opening names
- Aggregates predictions to identify the user's dominant play-style

In [None]:
# Header to identify the request when calling the API
headers = {"User-Agent": "Mozilla/5.0"}

# Fetches the most recent games for a given Chess.com username
def fetch_latest_games(username):
    archives = requests.get(
        f"https://api.chess.com/pub/player/{username}/games/archives",
        headers=headers
    ).json()

    if "archives" not in archives:
        raise ValueError("No archives available for this user.")

    latest = archives["archives"][-1]
    data = requests.get(latest, headers=headers).json()
    return data["games"]

# Extracts the first 8 moves from PGN text and removes annotations
def extract_moves_from_pgn(pgn_text):
    pgn_body = "\n".join(
        line for line in pgn_text.split("\n")
        if not line.startswith("[")
    )
    pgn_body = re.sub(r"\{.*?\}", "", pgn_body)
    pgn_body = re.sub(r"\$\d+", "", pgn_body)
    pgn_body = re.sub(r"\b1-0\b|\b0-1\b|\b1\/2-1\/2\b", "", pgn_body)

    tokens = pgn_body.split()
    clean_moves = []

    for t in tokens:
        if re.match(r"^\d+\.+$", t):
            continue
        if re.match(r"^\d+\.\.\.$", t):
            continue
        if t.startswith("(") or t.endswith(")"):
            continue
        if t in ["*", "#", "+"]:
            continue
        if re.match(r"[^a-zA-Z0-9=+#-]", t):
            continue

        clean_moves.append(t)
    # Ensures exactly 8 moves per game
    moves = clean_moves[:8] + [None] * (8 - len(clean_moves[:8]))
    return moves

# Converts fetched games into a feature DataFrame for prediction
def convert_games_to_features(games, username):
    rows = []
    username = username.lower()

    for g in games:
        pgn = g.get("pgn", "")
        moves = extract_moves_from_pgn(pgn)

        white = g["white"]["username"].lower()
        black = g["black"]["username"].lower()

        if username == white:
            colour = 0
        else:
            colour = 1

        rows.append({
            "Colour": colour,
            "Perf Rating": g["white"]["rating"] if colour == 0 else g["black"]["rating"],
            "move1w": moves[0],
            "move1b": moves[1],
            "move2w": moves[2],
            "move2b": moves[3],
            "move3w": moves[4],
            "move3b": moves[5],
            "move4w": moves[6],
            "move4b": moves[7]
        })

    return pd.DataFrame(rows)

# Takes the inputted username and fetches their latest games
username = input("Enter Chess.com username: ").lower()
games = fetch_latest_games(username)
print(f"Fetched {len(games)} games.")

# Building a DataFrame from the user’s games
df_user = convert_games_to_features(games, username)
print("\nExtracted raw user data:")
print(df_user.head())

# Extracting opening names from PGN metadata when available
opening_names = []

for g in games:
    pgn = g.get("pgn", "")

    eco_url = re.search(r'\[ECOUrl \"(.*?)\"\]', pgn)

    if eco_url:
        url = eco_url.group(1)

        name_part = url.split("/openings/")[-1]

        name_clean = re.split(r"-\d", name_part)[0]

        # Replace hyphens with spaces
        name_clean = name_clean.replace("-", " ")

        opening_names.append(name_clean.strip())

    else:
        opening_names.append("Unknown Opening")

df_user["Opening"] = opening_names

# Encoding moves and preparing features for prediction
df_user[moves] = encoder.transform(df_user[moves])
df_user = df_user.fillna(0)

X_user = df_user[columns + moves]

# Generating playstyle predictions
df_user["play_style_model"] = final_model.predict(X_user)
df_user["style_from_name"] = df_user["Opening"].apply(lambda x: name_style(str(x)))
df_user["play_style_final"] = df_user["style_from_name"].fillna(df_user["play_style_model"])

# Printing predicted play styles
print("Predicted play styles:")
print(df_user[["Opening", "play_style_final"]].head())

In [None]:
# Outputting the play style with the highest value count in the series 
style_counts = df_user["play_style_final"].value_counts()
main_style = style_counts.idxmax()
print(f"Your main play-style is: {main_style}")