# <ins>Guess the elo</ins>
## Concept
In this jupyter notebook, we try to implement a model that does the following :
Based on the performance of white against black, and based on black's rating, try to guess the rating of white. Of course, this is highly random as one person can play a very good and a very bad game in the same day. However, we hope that on average, the prediction is as right as possible.
The idea of this concept was inspired by a popular youtube show called "Guess The Elo" by Gotham Chess

<hr/>

## Import libraries and data

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, Binarizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.svm import SVR

In [2]:
dataset = pd.read_csv('data/chess_simplified.csv')
X = dataset.drop(labels=['white_rating', 'moves'], axis=1)
y = dataset['white_rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

<hr/>

## Preprocessing

In [3]:
# The label has to be standardize outside of the pipeline
ss_y = StandardScaler()
y_train = ss_y.fit_transform(y_train.values.reshape(-1,1)).reshape(-1)
#y_test = ss_y.transform(y_test.values.reshape(-1,1)).reshape(-1)

# preprocessing step in the pipeline
numerical_features = ['nb_turns', 'black_rating', 'nb_opening_moves', 'game_time', 'increment']
boolean_features = ['is_rated', 'out_of_time', 'resign']
categorical_features_ordinal = ['result']
categorical_features_onehot = ['opening_name']

preprocessor = ColumnTransformer([('scale', StandardScaler(), numerical_features),
                                  ('ordinal', OrdinalEncoder(categories=[['lose', 'draw', 'win']]), categorical_features_ordinal),
                                  ('onehot', OneHotEncoder(drop='first'), categorical_features_onehot),
                                  ('binarize', Binarizer(), boolean_features)])

<hr/>

## Multilinear Regression

In [19]:
model = LinearRegression()
pipeline_lin = Pipeline([('preprocessor', preprocessor), ('model', model)])
pipeline_lin.fit(X_train, y_train)

score_lin = mean_absolute_error(y_test, ss_y.inverse_transform(pipeline_lin.predict(X_test).reshape(-1,1)))
print(f'The best multilinear regression model has the following MAE : {round(score_lin)}')

The best multilinear regression model has the following MAE : 158


<hr/>

## Ridge Regression

In [20]:
model = Ridge()
pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])
params={
    'model__alpha': np.linspace(1.0, 100.0, 100)
}
grid_ridge = GridSearchCV(pipeline, param_grid=params, cv=5)
grid_ridge.fit(X_train, y_train)

score_ridge = mean_absolute_error(y_test, ss_y.inverse_transform(grid_ridge.predict(X_test).reshape(-1,1)))
print(f'Here is the best choice of parameters : {grid_ridge.best_params_}')
print(f'The best Ridge regression model has the following MAE : {round(score_ridge)}')

Here is the best choice of parameters : {'model__alpha': 32.0}
The best Ridge regression model has the following MAE : 158


<hr/>

## Lasso Regression

In [21]:
model = Lasso()
pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])
params={
    'model__alpha': np.linspace(0.0001, 0.001, 10)
}
grid_lasso = GridSearchCV(pipeline, param_grid=params, cv=5)
grid_lasso.fit(X_train, y_train)

score_lasso = mean_absolute_error(y_test, ss_y.inverse_transform(grid_lasso.predict(X_test).reshape(-1,1)))
print(f'Here is the best choice of parameters : {grid_lasso.best_params_}')
print(f'The best Lasso regression model has the following MAE : {round(score_lasso)}')

Here is the best choice of parameters : {'model__alpha': 0.0001}
The best Lasso regression model has the following MAE : 158


<hr/>

## Random Forest Regression

In [22]:
model = RandomForestRegressor(n_estimators=500, random_state=99)
pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])
params={
    'model__max_features': ['sqrt'],
    #'model__min_samples_leaf': np.arange(2, 43, 10),
    #'model__max_depth': np.arange(4, 13, 4)
    'model__max_depth': [30], #30 gave 0.55
    #'model__min_samples_leaf': [10]
}
grid_rf = GridSearchCV(pipeline, param_grid=params, cv=5)
grid_rf.fit(X_train, y_train)

score_rf = mean_absolute_error(y_test, ss_y.inverse_transform(grid_rf.predict(X_test).reshape(-1,1)))
print(f'Here is the best choice of parameters : {grid_rf.best_params_}')
print(f'The best Random Forest regression model has the following MAE : {round(score_rf)}')

Here is the best choice of parameters : {'model__max_depth': 30, 'model__max_features': 'sqrt'}
The best Random Forest regression model has the following MAE : 144


<hr/>

## SVM

In [24]:
model = SVR()
pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])
params={
    'model__kernel': ['rbf', 'sigmoid', 'linear'],
    #'model__degree': [2, 3, 4],
    'model__C': [ 0.75, 1.0, 1.25]
}
grid_svr = GridSearchCV(pipeline, param_grid=params, cv=5)
grid_svr.fit(X_train, y_train)

score_svr = mean_absolute_error(y_test, ss_y.inverse_transform(grid_svr.predict(X_test).reshape(-1,1)))
print(f'Here is the best choice of parameters : {grid_svr.best_params_}')
print(f'The SVM regression model has the following MAE : {round(score_svr)}')

Here is the best choice of parameters : {'model__C': 1.0, 'model__kernel': 'rbf'}
The SVM regression model has the following MAE : 149


<hr/>

## Ensemble Learning

In [25]:
lin = LinearRegression()
pipeline_lin = Pipeline([('prep', preprocessor), ('lin', lin)])
rf = RandomForestRegressor(n_estimators=500, max_features='sqrt', max_depth=40, random_state=99)
pipeline_rf = Pipeline([('prep', preprocessor), ('rf', rf)])
svr = SVR(kernel='rbf', C=1.0)
pipeline_svr = Pipeline([('prep', preprocessor), ('svr', svr)])

vote = VotingRegressor([('linear', pipeline_lin), ('random_forest', pipeline_rf), ('SVR', pipeline_svr)], )
vote.fit(X_train, y_train)

score_vote = mean_absolute_error(y_test, ss_y.inverse_transform(vote.predict(X_test).reshape(-1,1)))
print(f'The voting regressor model has the following MAE : {round(score_vote)}')

The voting regressor model has the following MAE : 147


<hr/>

## Feature Selection
Now that we know that the Random Forest is the best estimator, we use it to check whether there are bad features. The number of estimators is however reduced to avoid the selection to take too long.

In [29]:
rf_selec = RandomForestRegressor(n_estimators=100, max_depth=30, max_features='sqrt', random_state=99)
selec = RFECV(estimator=rf_selec, cv=5, min_features_to_select=1, step=1, scoring='r2')
X_selec = preprocessor.fit_transform(X)
y_selec = StandardScaler().fit_transform(y.values.reshape(-1,1)).reshape(-1)

selec.fit(X_selec, y_selec)

In [34]:
print(selec.ranking_)
print(selec.support_)
truc = X_selec.toarray()

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 3 1 1 4 1 1 1 1 1 1]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True False False  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True False  True  True False
  True  True  True  True  True  True]


We see that the only bad features are dummy variables representing openings