# Baseline Model

We will first build some basic models.

In [396]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from predict_test_data import predict_test_data

In [397]:
train = pd.read_csv('../data/train_team.csv')
test = pd.read_csv('../data/test_team.csv')

In [398]:
train.columns

Index(['home_win', 'attack_diff', 'bup_dribbling_diff', 'bup_passing_diff',
       'bup_speed_diff', 'cc_crossing_diff', 'cc_passing_diff',
       'cc_shooting_diff', 'd_aggresion_diff', 'd_pressure_diff',
       'd_width_diff', 'defence_diff', 'full_age_diff',
       'goalkeeeper_overall_diff', 'growth_diff', 'midfield_diff',
       'overall_diff', 'prestige_diff', 'start_age_diff',
       'value_euros_millions_diff', 'wage_euros_thousands_diff',
       'attack_home_defence_away_diff', 'attack_away_defence_home_diff',
       'cur_year_avg_weighted_diff', 'cur_year_avg_diff',
       'last_year_avg_weighted_diff', 'last_year_avg_diff',
       'previous_points_diff', 'rank_change_diff', 'rank_diff',
       'three_year_ago_avg_diff', 'three_year_ago_weighted_diff',
       'total_points_diff', 'two_year_ago_avg_diff',
       'two_year_ago_weighted_diff'],
      dtype='object')

Our most basic model would be to just predict the majority class every time. In this case, `home_win` = 1 is the majority class. What are is the training accuracy from just doing this "prediction"? 

In [399]:
train['home_win'].value_counts()[1] / len(train)

0.4493341053850608

Pretty decent when we have 3 classes. What about the test accuracy? 

In [400]:
accuracy_score(test['home_win'], np.ones(len(test)))

0.421875

Still decent. Any model we build should be better than this test accuracy of just guessing.

Our baseline model will be pretty simple. We will utilize the differences in FIFA rankings, offense ratings, defense ratings, midfield ratings, and overall ratings. We will make a train and validation set out of the original train set. 

In [401]:
train = train[['rank_diff', 'home_win', 'attack_diff', 'defence_diff', 'midfield_diff', 'overall_diff']]
test = test[['rank_diff', 'home_win', 'attack_diff', 'defence_diff', 'midfield_diff', 'overall_diff', 'Group']]

In [402]:
np.random.seed(1)
X_train, X_validation = train_test_split(train, test_size = 0.2)
y_train = X_train['home_win'].ravel()
X_train = X_train.drop(['home_win'], axis = 1)
y_validation = X_validation['home_win'].ravel()
X_validation = X_validation.drop(['home_win'], axis = 1)
y_test = test['home_win'].ravel()

We will first try out logistic regression. 

In [403]:
lr_model = LogisticRegressionCV(solver = 'lbfgs', max_iter = 5000, cv = 5, multi_class='multinomial').fit(X_train, y_train)

In [404]:
print("Logistic Regression Train Score: {}".format(lr_model.score(X_train, y_train)))
print("Logistic Regression Validation Score: {}".format(lr_model.score(X_validation, y_validation)))

Logistic Regression Train Score: 0.5206372194062274
Logistic Regression Validation Score: 0.5260115606936416


We will also try out Linear Discriminant Analysis. However, we need to first check whether the variances across the three outcomes are equal. 

In [405]:
train.groupby('home_win').var()

Unnamed: 0_level_0,rank_diff,attack_diff,defence_diff,midfield_diff,overall_diff
home_win,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1031.324335,41.178286,36.447164,36.358733,30.316051
1,1030.878211,43.164681,38.173861,37.613713,31.811819
2,1100.677884,42.149203,40.387636,36.910224,32.722352


Surprisingly, they are actually quite similar! 

In [406]:
lda_model = LinearDiscriminantAnalysis().fit(X_train, y_train)

In [407]:
print("LDA Train Score: {}".format(lda_model.score(X_train, y_train)))
print("LDA Validation Score: {}".format(lda_model.score(X_validation, y_validation)))

LDA Train Score: 0.5184648805213613
LDA Validation Score: 0.5173410404624278


We will also try out Quadratic Discriminant Analysis, which should perform similarly to LDA in this case due to the almost equal variances.

In [408]:
qda_model = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

In [409]:
print("QDA Train Score: {}".format(qda_model.score(X_train, y_train)))
print("QDA Validation Score: {}".format(qda_model.score(X_validation, y_validation)))

QDA Train Score: 0.5141202027516293
QDA Validation Score: 0.5173410404624278


In [410]:
rf_model = RandomForestClassifier(min_samples_leaf = 20, n_estimators=100).fit(X_train, y_train)

In [411]:
print(rf_model.score(X_train, y_train))
print(rf_model.score(X_validation, y_validation))

0.55756698044895
0.5346820809248555


In [412]:
print("Logistic Regression Test Score: {}".
                format(accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, lr_model))))

Logistic Regression Test Score: 0.59375


In [413]:
print("LDA Test Score: {}".format(
    accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, lda_model))))

LDA Test Score: 0.59375


In [414]:
print("QDA Test Score: {}".format(
    accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, qda_model))))

QDA Test Score: 0.640625


In [None]:
print("Random Forest Test Score: {}".format(
    accuracy_score(test['home_win'], predict_test_data(test, X_train.columns, rf_model))))