# XG Model Selection

This notebook intends to evaluate a set of different models fitted against the shots data and perform some hyperparameter tuning as well. The chosen model will then be used to make predictions of the likelihood of a certain shot turning into a goal. I will use a combination of the accuracy, recall and precisio scores to select the most suited model to make the predictions. First I will read the clean dataset containing the shots from every match. Afterwards I will implement the preprocessing steps gathered from the Exploratory Data Analysis notebook and I will finish off by performing and KFold cross-validation on a set of different models and compare the results for each combination of hyperparameters and each model, thus coming up with the selected model for our predictions.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler,OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from scipy.stats import boxcox,skew
from sklearn.model_selection import GridSearchCV, StratifiedKFold


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('../data/clean/xg/model_data.csv',index_col="id")

In [3]:
df["duration"] = np.sqrt(df["duration"])

In [4]:
df["location_x"] = boxcox(df["location_x"])[0]

In [5]:
df = df.drop(["minute", "second"], axis=1)

In [6]:
df["outcome"] = df["outcome"].apply(lambda x: 1.0 if x=="Goal" else 0.0)

In [7]:
min_max = MinMaxScaler()

scaled_cols = ['possession','duration','location_x','location_y']

df[scaled_cols] = min_max.fit_transform(df[scaled_cols])

In [8]:
one_hot = OneHotEncoder()

encoded_cols = ['under_pressure','play','type','technique',
                'body_part','first_time','one_on_one','aerial_won',
                'pos','redirect','deflected','open_goal','follows_dribble']

df_encoded_cols = pd.get_dummies(df[encoded_cols], drop_first=True)

df = df.drop(encoded_cols, axis=1)

df = pd.concat([df, df_encoded_cols], axis=1)



In [9]:
X = df.drop('outcome', axis=1)

y = df['outcome']

After preprocessing the data and splitting the train and test sets, I will now instantiate a set of models and tune its hyperparameters. The chosen set of models is the Logistic Regression, KNearest Neighbors, Support Vector Machine, Random Forest, Naive-Bayes and a Multi-Layer Perceptron.

In [10]:
scoring = ['accuracy', 'roc_auc', 'precision','recall','f1']

In [11]:
solvers = ['newton-cg', 'lbfgs']
penalty = ['l2','none']
c_values = [100, 10, 1.0, 0.1, 0.01]

lr_grid = dict(solver=solvers, penalty=penalty, C=c_values)

skf = StratifiedKFold(n_splits=5)

lr = LogisticRegression(max_iter=1000)

lr_cv = GridSearchCV(lr, scoring=scoring, cv=skf, param_grid=lr_grid, refit='accuracy')

lr_cv.fit(X,y)

df_lr_grid = lr_cv.cv_results_





In [12]:
df_lr_grid = pd.DataFrame(df_lr_grid)[['params', 'mean_test_accuracy','std_test_accuracy','mean_test_roc_auc','mean_test_recall','mean_test_precision','mean_test_f1']]

df_lr_grid["Algorithm"] = "Logistic Regression"

In [13]:
n_neighbors = [5, 50, 100]
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']

knn_grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)

knn = KNeighborsClassifier()

skf = StratifiedKFold(n_splits=5)

knn_cv = GridSearchCV(knn, scoring=scoring, cv=skf, param_grid=knn_grid, refit='accuracy')

knn_cv.fit(X,y)

df_knn_grid = knn_cv.cv_results_

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
df_knn_grid = pd.DataFrame(df_knn_grid)[['params', 'mean_test_accuracy','std_test_accuracy','mean_test_roc_auc','mean_test_recall','mean_test_precision','mean_test_f1']]

df_knn_grid["Algorithm"] = "K Nearest Neighbors"

In [15]:
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]

svc_grid = dict(kernel=kernel, C=C)

svc = SVC()

skf = StratifiedKFold(n_splits=5)

svc_cv = GridSearchCV(svc, scoring=scoring, cv=skf, param_grid=svc_grid, refit='accuracy')

svc_cv.fit(X,y)

df_svc_grid = svc_cv.cv_results_


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [16]:
df_svc_grid = pd.DataFrame(df_svc_grid)[['params', 'mean_test_accuracy','std_test_accuracy','mean_test_roc_auc','mean_test_recall','mean_test_precision','mean_test_f1']]

df_svc_grid["Algorithm"] = "Support Vector Machine"

In [17]:
n_estimators = [10, 100, 1000]
max_features = ['sqrt', 'log2']

rf_grid = dict(n_estimators=n_estimators, max_features=max_features)

rf = RandomForestClassifier()

skf = StratifiedKFold(n_splits=5)

rf_cv = GridSearchCV(rf, scoring=scoring, cv=skf, param_grid=rf_grid, refit='accuracy')

rf_cv.fit(X,y)

df_rf_grid = rf_cv.cv_results_

In [18]:
df_rf_grid = pd.DataFrame(df_rf_grid)[['params', 'mean_test_accuracy','std_test_accuracy','mean_test_roc_auc','mean_test_recall','mean_test_precision','mean_test_f1']]

df_rf_grid["Algorithm"] = "Random Forest"

In [19]:
nb_grid = dict(var_smoothing=[1e-11, 1e-10, 1e-9])

nb = GaussianNB()

skf = StratifiedKFold(n_splits=5)

nb_cv = GridSearchCV(nb, scoring=scoring, cv=skf, param_grid=nb_grid, refit='accuracy')

nb_cv.fit(X,y)

df_nb_grid = nb_cv.cv_results_

In [20]:
df_nb_grid = pd.DataFrame(df_nb_grid)[['params', 'mean_test_accuracy','std_test_accuracy','mean_test_roc_auc','mean_test_recall','mean_test_precision','mean_test_f1']]

df_nb_grid["Algorithm"] = "Naive Bayes"

In [21]:
hidden_layers = [(32,16,8,4, ), (16,8,4,)]
learning_rate = ['constant', 'adaptive']
solver = ['sgd','adam']
activation = ['tanh','relu','logistic']
mlp_grid = dict(hidden_layer_sizes=hidden_layers,learning_rate=learning_rate,solver=solver,activation=activation)

mlp = MLPClassifier(max_iter=1000)

skf = StratifiedKFold(n_splits=5)

mlp_cv = GridSearchCV(mlp, scoring=scoring, cv=skf, param_grid=mlp_grid, refit='accuracy')

mlp_cv.fit(X,y)

df_mlp_grid = mlp_cv.cv_results_


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [22]:
df_mlp_grid = pd.DataFrame(df_mlp_grid)[['params', 'mean_test_accuracy','std_test_accuracy','mean_test_roc_auc','mean_test_recall','mean_test_precision','mean_test_f1']]

df_mlp_grid["Algorithm"] = "Multi-Layer Perceptron"

In [23]:
df_results = pd.concat([df_lr_grid, df_knn_grid, df_nb_grid, df_svc_grid, df_rf_grid, df_mlp_grid],axis=0,ignore_index=True)

df_results = df_results[['Algorithm','params', 'mean_test_accuracy','std_test_accuracy','mean_test_roc_auc','mean_test_recall','mean_test_precision','mean_test_f1']]

In [30]:
df_results.sort_values('mean_test_accuracy',ascending=False)[:10]

Unnamed: 0,Algorithm,params,mean_test_accuracy,std_test_accuracy,mean_test_roc_auc,mean_test_recall,mean_test_precision,mean_test_f1
58,Random Forest,"{'max_features': 'sqrt', 'n_estimators': 1000}",0.89181,0.003448,0.845632,0.201134,0.744999,0.316532
61,Random Forest,"{'max_features': 'log2', 'n_estimators': 1000}",0.89181,0.002805,0.837839,0.191589,0.764941,0.306108
57,Random Forest,"{'max_features': 'sqrt', 'n_estimators': 100}",0.890795,0.002595,0.836181,0.201131,0.726519,0.314289
60,Random Forest,"{'max_features': 'log2', 'n_estimators': 100}",0.890575,0.002395,0.830286,0.189114,0.740257,0.300815
13,Logistic Regression,"{'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}",0.890178,0.001866,0.799456,0.18452,0.741113,0.295043
12,Logistic Regression,"{'C': 0.1, 'penalty': 'l2', 'solver': 'newton-...",0.890134,0.001857,0.799424,0.184167,0.740753,0.294558
76,Multi-Layer Perceptron,"{'activation': 'relu', 'hidden_layer_sizes': (...",0.889648,0.001775,0.79832,0.199009,0.704623,0.310158
45,Support Vector Machine,"{'C': 10, 'kernel': 'rbf'}",0.889296,0.002101,0.702896,0.165433,0.758648,0.27125
83,Multi-Layer Perceptron,"{'activation': 'logistic', 'hidden_layer_sizes...",0.889163,0.002638,0.811248,0.209259,0.695409,0.318882
8,Logistic Regression,"{'C': 1.0, 'penalty': 'l2', 'solver': 'newton-...",0.888987,0.002356,0.803851,0.203609,0.687243,0.313602


Despite displaying the highest scores, the Random Forest has a very low Recall score. The model ranked afterwards (Logistic Regression) has an even worse Recall. Let's see if we can find other models with an higher recall without giving up too much accuracy.

In [31]:
df_results.sort_values('mean_test_recall',ascending=False)[:10]

Unnamed: 0,Algorithm,params,mean_test_accuracy,std_test_accuracy,mean_test_roc_auc,mean_test_recall,mean_test_precision,mean_test_f1
38,Naive Bayes,{'var_smoothing': 1e-11},0.737386,0.169772,0.727536,0.50725,0.304302,0.345931
39,Naive Bayes,{'var_smoothing': 1e-10},0.779642,0.118402,0.727596,0.450324,0.326632,0.34777
40,Naive Bayes,{'var_smoothing': 1e-09},0.822781,0.053822,0.72766,0.402964,0.358967,0.361285
73,Multi-Layer Perceptron,"{'activation': 'relu', 'hidden_layer_sizes': (...",0.877299,0.001979,0.796235,0.288111,0.514398,0.366449
65,Multi-Layer Perceptron,"{'activation': 'tanh', 'hidden_layer_sizes': (...",0.867199,0.005709,0.774921,0.281724,0.451346,0.345911
63,Multi-Layer Perceptron,"{'activation': 'tanh', 'hidden_layer_sizes': (...",0.873153,0.003018,0.776528,0.268293,0.488027,0.341246
79,Multi-Layer Perceptron,"{'activation': 'logistic', 'hidden_layer_sizes...",0.883297,0.004443,0.800339,0.261236,0.592116,0.355147
75,Multi-Layer Perceptron,"{'activation': 'relu', 'hidden_layer_sizes': (...",0.885547,0.003559,0.815508,0.237191,0.612947,0.338983
69,Multi-Layer Perceptron,"{'activation': 'tanh', 'hidden_layer_sizes': (...",0.884047,0.002231,0.810135,0.227657,0.596111,0.327985
67,Multi-Layer Perceptron,"{'activation': 'tanh', 'hidden_layer_sizes': (...",0.883121,0.0039,0.803705,0.222696,0.584057,0.322106


The Multi-Layer Perceptron seems to be a great fit. It's accuracy is not much smaller and the recall, despite not being great, is much higher than that of the Random Forest.

In [32]:
df_results.sort_values('mean_test_precision',ascending=False)[:10]

Unnamed: 0,Algorithm,params,mean_test_accuracy,std_test_accuracy,mean_test_roc_auc,mean_test_recall,mean_test_precision,mean_test_f1
25,K Nearest Neighbors,"{'metric': 'euclidean', 'n_neighbors': 100, 'w...",0.875799,0.000269,0.69876,0.006008,0.82,0.011923
37,K Nearest Neighbors,"{'metric': 'minkowski', 'n_neighbors': 100, 'w...",0.875799,0.000269,0.69876,0.006008,0.82,0.011923
31,K Nearest Neighbors,"{'metric': 'manhattan', 'n_neighbors': 100, 'w...",0.878975,0.001025,0.710343,0.039588,0.809188,0.075436
35,K Nearest Neighbors,"{'metric': 'minkowski', 'n_neighbors': 50, 'we...",0.878005,0.000783,0.68634,0.031102,0.788086,0.059721
23,K Nearest Neighbors,"{'metric': 'euclidean', 'n_neighbors': 50, 'we...",0.878005,0.000783,0.68634,0.031102,0.788086,0.059721
61,Random Forest,"{'max_features': 'log2', 'n_estimators': 1000}",0.89181,0.002805,0.837839,0.191589,0.764941,0.306108
48,Support Vector Machine,"{'C': 1.0, 'kernel': 'rbf'}",0.881798,0.000911,0.658225,0.07741,0.760371,0.140337
45,Support Vector Machine,"{'C': 10, 'kernel': 'rbf'}",0.889296,0.002101,0.702896,0.165433,0.758648,0.27125
29,K Nearest Neighbors,"{'metric': 'manhattan', 'n_neighbors': 50, 'we...",0.881357,0.001419,0.696403,0.073525,0.753857,0.133904
28,K Nearest Neighbors,"{'metric': 'manhattan', 'n_neighbors': 50, 'we...",0.879107,0.001402,0.700162,0.047012,0.753312,0.088373


In [33]:
df_results.loc[73,:]

Algorithm                                         Multi-Layer Perceptron
params                 {'activation': 'relu', 'hidden_layer_sizes': (...
mean_test_accuracy                                              0.877299
std_test_accuracy                                               0.001979
mean_test_roc_auc                                               0.796235
mean_test_recall                                                0.288111
mean_test_precision                                             0.514398
mean_test_f1                                                    0.366449
Name: 73, dtype: object

In [34]:
df_results.loc[73,"params"]

{'activation': 'relu',
 'hidden_layer_sizes': (32, 16, 8, 4),
 'learning_rate': 'adaptive',
 'solver': 'adam'}