# Baseline Model

We will first build some basic models.

In [1]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from predict_test_data import predict_test_data

import warnings
warnings.filterwarnings('ignore')

  from numpy.core.umath_tests import inner1d


In [2]:
train = pd.read_csv('../data/cleaned/train_final.csv')
test = pd.read_csv('../data/cleaned/test_final.csv')

In [3]:
train.columns

Index(['game_date', 'home_team', 'away_team', 'country', 'neutral', 'home_win',
       'attack_diff', 'bup_dribbling_diff', 'bup_passing_diff',
       'bup_speed_diff', 'cc_crossing_diff', 'cc_passing_diff',
       'cc_shooting_diff', 'd_aggresion_diff', 'd_pressure_diff',
       'd_width_diff', 'defence_diff', 'full_age_diff',
       'goalkeeeper_overall_diff', 'growth_diff', 'midfield_diff',
       'overall_diff', 'prestige_diff', 'start_age_diff',
       'value_euros_millions_diff', 'wage_euros_thousands_diff',
       'attack_home_defence_away_diff', 'attack_away_defence_home_diff',
       'rank_diff', 'gdp_diff', 'wins_past_1_games_diff',
       'wins_home_against_away_1_games', 'wins_past_2_games_diff',
       'wins_home_against_away_2_games', 'wins_past_3_games_diff',
       'wins_home_against_away_3_games', 'wins_past_4_games_diff',
       'wins_home_against_away_4_games', 'wins_past_5_games_diff',
       'wins_home_against_away_5_games'],
      dtype='object')

Our most basic model would be to just predict the majority class every time. In this case, `home_win` = 1 is the majority class. What are is the training accuracy from just doing this "prediction"? 

In [4]:
train['home_win'].value_counts()[1] / len(train)

0.43806009488666314

Pretty decent when we have 3 classes. What about the test accuracy? 

In [5]:
accuracy_score(test['home_win'], np.ones(len(test)))

0.421875

Still decent. Any model we build should be better than this test accuracy of just guessing.

Our baseline model will be pretty simple. We will utilize the differences in FIFA rankings, offense ratings, defense ratings, midfield ratings, overall ratings, and whether the home team is actually playing at home. We will make a train and validation set out of the original train set. 

In [6]:
train = train[['home_win', 'rank_diff', 'attack_diff', 'defence_diff', 'midfield_diff', 'overall_diff']]
test = test[['home_win', 'rank_diff', 'attack_diff', 'defence_diff', 'midfield_diff', 'overall_diff', 'Group']]

In [7]:
np.random.seed(3)
X_train, X_validation = train_test_split(train, test_size = 0.2)
y_train = X_train['home_win'].ravel()
X_train = X_train.drop(['home_win'], axis = 1)
y_validation = X_validation['home_win'].ravel()
X_validation = X_validation.drop(['home_win'], axis = 1)
y_test = test['home_win'].ravel()

In [8]:
# stores the score of each model
score = {}

We will first try out logistic regression. 

In [9]:
lr_model = LogisticRegressionCV(solver = 'lbfgs', max_iter = 5000, cv = 5, multi_class='multinomial').fit(X_train, y_train)

In [10]:
score["Logistic Regression"] = {}
score["Logistic Regression"]["model"] = lr_model
score["Logistic Regression"]["Train Score"] = lr_model.score(X_train, y_train)
score["Logistic Regression"]["Validation Score"] = lr_model.score(X_validation, y_validation)

print("Logistic Regression Train Score: {}".format(score["Logistic Regression"]["Train Score"]))
print("Logistic Regression Validation Score: {}".format(score["Logistic Regression"]["Validation Score"]))

Logistic Regression Train Score: 0.5260382333553065
Logistic Regression Validation Score: 0.4921052631578947


We will also try out Linear Discriminant Analysis. However, we need to first check whether the variances across the three outcomes are equal. 

In [11]:
train.groupby('home_win').var()

Unnamed: 0_level_0,rank_diff,attack_diff,defence_diff,midfield_diff,overall_diff
home_win,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
-1,916.793268,40.858572,37.554931,38.133908,31.393396
0,960.540851,47.654088,43.559341,41.798216,36.158215
1,936.866455,49.295727,43.509512,42.634416,37.105145


Surprisingly, besides `rank_diff`, they are actually quite similar! 

In [12]:
lda_model = LinearDiscriminantAnalysis().fit(X_train, y_train)

In [13]:
score["LDA"] = {}
score["LDA"]["model"] = lda_model
score["LDA"]["Train Score"] = lda_model.score(X_train, y_train)
score["LDA"]["Validation Score"] = lda_model.score(X_validation, y_validation)
print("LDA Train Score: {}".format(score["LDA"]["Train Score"]))
print("LDA Validation Score: {}".format(score["LDA"]["Validation Score"]))

LDA Train Score: 0.5227422544495716
LDA Validation Score: 0.5026315789473684


We will also try out Quadratic Discriminant Analysis, which should perform similarly to LDA in this case due to the almost equal variances.

In [14]:
qda_model = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

In [15]:
score["QDA"] = {}
score["QDA"]['model'] = qda_model
score["QDA"]["Train Score"] = qda_model.score(X_train, y_train)
score["QDA"]["Validation Score"] = qda_model.score(X_validation, y_validation)
print("QDA Train Score: {}".format(score["QDA"]["Train Score"]))
print("QDA Validation Score: {}".format(score["QDA"]["Validation Score"]))

QDA Train Score: 0.5280158206987475
QDA Validation Score: 0.48947368421052634


We will also try out Random Forest.

In [16]:
rf_params = {'bootstrap': [True, False],
     'max_depth': [3, 5, 10, 20, 30, 40, None],
'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4, 10, 20],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [10, 50, 100, 200, 500]}

rf_model = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=rf_params,\
                                   n_iter=50, scoring='accuracy', n_jobs=-1, cv=5, verbose=1).fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 227 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:   18.0s finished


In [17]:
score["Random Forest"] = {}
score["Random Forest"]['model'] = rf_model
score["Random Forest"]["Train Score"] = rf_model.score(X_train, y_train)
score["Random Forest"]["Validation Score"] = rf_model.score(X_validation, y_validation)
print("Random Forest Train Score: {}".format(score["Random Forest"]["Train Score"]))
print("Random Forest Validation Score {}".format(score["Random Forest"]["Validation Score"]))

Random Forest Train Score: 0.5451549110085695
Random Forest Validation Score 0.4921052631578947


Let's also not forget XGBoost.

In [18]:
xgb_params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5]
    }
xgb_model = RandomizedSearchCV(estimator=XGBClassifier(objective='multi:softmax', num_class = 3), param_distributions=xgb_params,\
                                   n_iter=50, scoring='accuracy', n_jobs=-1, cv=5, verbose=1).fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:   10.8s finished


In [19]:
score["XGBoost"] = {}
score["XGBoost"]['model'] = xgb_model
score["XGBoost"]["Train Score"] = xgb_model.score(X_train, y_train)
score["XGBoost"]["Validation Score"] = xgb_model.score(X_validation, y_validation)
print("XGBoost Train Score: {}".format(score["Random Forest"]["Train Score"]))
print("XGBoost Validation Score {}".format(score["Random Forest"]["Validation Score"]))

XGBoost Train Score: 0.5451549110085695
XGBoost Validation Score 0.4921052631578947


In [20]:
df_result = pd.DataFrame(score).T

In [21]:
df_result

Unnamed: 0,Train Score,Validation Score,model
Logistic Regression,0.526038,0.492105,"LogisticRegressionCV(Cs=10, class_weight=None,..."
LDA,0.522742,0.502632,"LinearDiscriminantAnalysis(n_components=None, ..."
QDA,0.528016,0.489474,"QuadraticDiscriminantAnalysis(priors=None, reg..."
Random Forest,0.545155,0.492105,"RandomizedSearchCV(cv=5, error_score='raise',\..."
XGBoost,0.550428,0.497368,"RandomizedSearchCV(cv=5, error_score='raise',\..."


In [22]:
model_name = df_result['Validation Score'].astype(float).argmax()
print("We choose the final model to be the one with the highest validation score,\
 which is {} in this case".format(model_name))

We choose the final model to be the one with the highest validation score, which is LDA in this case


In [23]:
test_pred = predict_test_data(test, X_train.columns, df_result.loc[model_name].model)
test_score = accuracy_score(test['home_win'], test_pred)
print("For the best Model, {}, the test accuracy is {:.3f}".format(model_name, test_score))

For the best Model, LDA, the test accuracy is 0.594


Impressive! We now have an idea of what our more advanced model should hope to achieve. 
