# Baseline Model

We will first build some basic models.

In [816]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from predict_test_data import predict_test_data

In [817]:
train = pd.read_csv('../data/train_merged.csv')
test = pd.read_csv('../data/test_merged.csv')

In [818]:
train.columns

Index(['game_date', 'home_team', 'away_team', 'country', 'neutral', 'home_win',
       'attack_diff', 'bup_dribbling_diff', 'bup_passing_diff',
       'bup_speed_diff', 'cc_crossing_diff', 'cc_passing_diff',
       'cc_shooting_diff', 'd_aggresion_diff', 'd_pressure_diff',
       'd_width_diff', 'defence_diff', 'full_age_diff',
       'goalkeeeper_overall_diff', 'growth_diff', 'midfield_diff',
       'overall_diff', 'prestige_diff', 'start_age_diff',
       'value_euros_millions_diff', 'wage_euros_thousands_diff',
       'attack_home_defence_away_diff', 'attack_away_defence_home_diff',
       'cur_year_avg_weighted_diff', 'cur_year_avg_diff',
       'last_year_avg_weighted_diff', 'last_year_avg_diff',
       'previous_points_diff', 'rank_change_diff', 'rank_diff',
       'three_year_ago_avg_diff', 'three_year_ago_weighted_diff',
       'total_points_diff', 'two_year_ago_avg_diff',
       'two_year_ago_weighted_diff', 'gdp_diff'],
      dtype='object')

Our most basic model would be to just predict the majority class every time. In this case, `home_win` = 1 is the majority class. What are is the training accuracy from just doing this "prediction"? 

In [819]:
train['home_win'].value_counts()[1] / len(train)

0.4406651549508692

Pretty decent when we have 3 classes. What about the test accuracy? 

In [820]:
accuracy_score(test['home_win'], np.ones(len(test)))

0.421875

Still decent. Any model we build should be better than this test accuracy of just guessing.

Our baseline model will be pretty simple. We will utilize the differences in FIFA rankings, offense ratings, defense ratings, midfield ratings, and overall ratings. We will make a train and validation set out of the original train set. 

In [821]:
train = train[['home_win', 'attack_diff', 'defence_diff', 'midfield_diff', 'overall_diff']]
test = test[['home_win', 'attack_diff', 'defence_diff', 'midfield_diff', 'overall_diff', 'Group']]

In [882]:
np.random.seed(1)
X_train, X_validation = train_test_split(train, test_size = 0.2)
y_train = X_train['home_win'].ravel()
X_train = X_train.drop(['home_win'], axis = 1)
y_validation = X_validation['home_win'].ravel()
X_validation = X_validation.drop(['home_win'], axis = 1)
y_test = test['home_win'].ravel()

We will first try out logistic regression. 

In [883]:
lr_model = LogisticRegressionCV(solver = 'lbfgs', max_iter = 5000, cv = 5, multi_class='multinomial').fit(X_train, y_train)

In [884]:
print("Logistic Regression Train Score: {}".format(lr_model.score(X_train, y_train)))
print("Logistic Regression Validation Score: {}".format(lr_model.score(X_validation, y_validation)))

Logistic Regression Train Score: 0.5122873345935728
Logistic Regression Validation Score: 0.5320754716981132


We will also try out Linear Discriminant Analysis. However, we need to first check whether the variances across the three outcomes are equal. 

In [885]:
train.groupby('home_win').var()

Unnamed: 0_level_0,attack_diff,defence_diff,midfield_diff,overall_diff
home_win,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,42.557163,38.549357,37.936027,31.537804
1,48.936282,40.405204,40.252079,35.024385
2,44.322274,43.722156,38.042605,34.816576


Surprisingly, they are actually quite similar! 

In [886]:
lda_model = LinearDiscriminantAnalysis().fit(X_train, y_train)

In [887]:
print("LDA Train Score: {}".format(lda_model.score(X_train, y_train)))
print("LDA Validation Score: {}".format(lda_model.score(X_validation, y_validation)))

LDA Train Score: 0.5113421550094518
LDA Validation Score: 0.5547169811320755


We will also try out Quadratic Discriminant Analysis, which should perform similarly to LDA in this case due to the almost equal variances.

In [888]:
qda_model = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

In [889]:
print("QDA Train Score: {}".format(qda_model.score(X_train, y_train)))
print("QDA Validation Score: {}".format(qda_model.score(X_validation, y_validation)))

QDA Train Score: 0.5
QDA Validation Score: 0.5358490566037736


We will also try out Random Forest.

In [890]:
rf_model = RandomForestClassifier(min_samples_leaf = 20, n_estimators=100).fit(X_train, y_train)

In [891]:
print("Random Forest Train Score: {}".format(rf_model.score(X_train, y_train)))
print("Random Forest Validation Score {}".format(rf_model.score(X_validation, y_validation)))

Random Forest Train Score: 0.553875236294896
Random Forest Validation Score 0.5207547169811321


Now let's see how these models perform on the actual World Cup!

In [892]:
print("Logistic Regression Test Score: {}".
                format(accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, lr_model))))

Logistic Regression Test Score: 0.625


In [893]:
print("LDA Test Score: {}".format(
    accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, lda_model))))

LDA Test Score: 0.609375


In [894]:
print("QDA Test Score: {}".format(
    accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, qda_model))))

QDA Test Score: 0.609375


In [895]:
print("Random Forest Test Score: {}".format(
    accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, rf_model))))

Random Forest Test Score: 0.578125


Impressive! We now have an idea of what our more advanced model should hope to achieve. 