# Modelling whether a player plays
As mentioned earlier, the goal of this analysis is to find the players we might want to select for our fantasy football team. There is considerable flexibility in how this can be approached. From the previous analysis, a binary classification of whether a player scores five or more points appears to be a good starting point. To make the problem more tractable I will also create a separate model of whether a player will play or not. As shown earlier, about half of all players will not play in a game. If we combine a value predicting whether a player is likely to play or not with a value for their predicted points assuming they do play, we will easily be able to select good players.

In [1]:
import os
import pickle
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut, LeavePGroupsOut, GroupShuffleSplit
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action='ignore', category=DataConversionWarning)
%matplotlib inline
pd.options.display.max_columns = None

players_train = pd.read_csv('../data/model_data/final_train_data.csv')
players_test = pd.read_csv('../data/model_data/final_test_data.csv')

## Predicting whether a player will play
The first model is a simple classification of whether a player is likely to play. I will train on the training set a variety of different models with different hyperparameters, and compare using the results of the validation set.

See this <a href='https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data'>link</a> for my approach to how I will approach cross validation wil grouped data. This data is grouped in the sense that we have multiple rows per player; in our training/validation splits we need to make sure each player only appears in one. The test set imported above has already had this taken into account (see other Exploratory_data_analysis notebook).

First I will confirm the balance in the training data:

In [2]:
print(f"Percentage of player rows playing in the row's game {np.mean(players_train.played_at_all):.1%}")

Percentage of player rows playing in the row's game 46.2%


First I will fit a simple base model using all appropriate features with pca and cross validation scheme. I will use a simple logistic regression.

In [3]:
players_train.head()

Unnamed: 0,player_id,first_name,second_name,team_id,team_difficulty,gameweek,kickoff_hour,kickoff_hour_cos,kickoff_hour_sin,kickoff_weekday,kickoff_weekday_cos,kickoff_weekday_sin,fixture_id,is_home,opponent_team,opponent_team_strength,opponent_difficulty,opponent_strength_ha_overall,opponent_strength_ha_attack,opponent_strength_ha_defence,target_total_points,target_minutes,target_goals_scored,target_goals_conceded,selected,value,value_change,custom_form,transfers_balance,transfers_in,transfers_out,team_strength,team_strength_ha_overall,team_strength_ha_attack,team_strength_ha_defence,prev_assists,prev_attempted_passes,prev_big_chances_created,prev_big_chances_missed,prev_bonus,prev_bps,prev_clean_sheets,prev_clearances_blocks_interceptions,prev_completed_passes,prev_creativity,prev_draw,prev_dribbles,prev_errors_leading_to_goal,prev_errors_leading_to_goal_attempt,prev_fouls,prev_goals_conceded,prev_goals_scored,prev_ict_index,prev_influence,prev_key_passes,prev_kickoff_hour,prev_kickoff_hour_cos,prev_kickoff_hour_sin,prev_kickoff_weekday,prev_kickoff_weekday_cos,prev_kickoff_weekday_sin,prev_loss,prev_minutes,prev_offside,prev_open_play_crosses,prev_opponent_score,prev_opponent_team,prev_own_goals,prev_penalties_conceded,prev_penalties_missed,prev_penalties_saved,prev_recoveries,prev_red_cards,prev_saves,prev_tackled,prev_tackles,prev_target_missed,prev_team_score,prev_threat,prev_total_points,prev_win,prev_winning_goals,prev_yellow_cards,roll_goals_scored,roll_mean_points,roll_minutes,roll_team_conceded,roll_team_points,roll_team_scored,roll_total_points,roll_unique_scorers,team_prev_mean_points,team_prev_result_points,team_prev_total_points,team_prev_unique_scorers,use_row,predict_row,model_row,played_at_all,event_day,position_FWD,position_GKP,position_MID,team_short_BHA,team_short_BOU,team_short_BUR,team_short_CAR,team_short_CHE,team_short_CRY,team_short_EVE,team_short_FUL,team_short_HUD,team_short_LEI,team_short_LIV,team_short_MCI,team_short_MUN,team_short_NEW,team_short_SOU,team_short_TOT,team_short_WAT,team_short_WHU,team_short_WOL,opponent_team_short_BHA,opponent_team_short_BOU,opponent_team_short_BUR,opponent_team_short_CAR,opponent_team_short_CHE,opponent_team_short_CRY,opponent_team_short_EVE,opponent_team_short_FUL,opponent_team_short_HUD,opponent_team_short_LEI,opponent_team_short_LIV,opponent_team_short_MCI,opponent_team_short_MUN,opponent_team_short_NEW,opponent_team_short_SOU,opponent_team_short_TOT,opponent_team_short_WAT,opponent_team_short_WHU,opponent_team_short_WOL,target_ge5
0,1,Petr,Cech,1,2,4,12,-1.0,1.224647e-16,6,0.62349,-0.781831,33,False,5,2.0,4,1080.0,1060.0,1090.0,1.0,90.0,0.0,2.0,123566.0,50.0,0.0,22.333333,9582.0,19332.0,9750.0,4.0,1320.0,1270.0,1340.0,0.0,24.0,0.0,0.0,0.0,17.0,0.0,0.0,20.0,0.0,False,0.0,0.0,0.0,0.0,1.0,0.0,2.8,27.8,0.0,14.0,-0.866025,-0.5,5.0,-0.222521,-0.974928,False,90.0,0.0,0.0,1.0,19.0,0.0,0.0,0.0,0.0,10.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,3.0,True,0.0,0.0,0.0,2.380952,90.0,2.0,1.0,1.666667,33.333333,1.333333,3.357143,3.0,47.0,2.0,True,False,True,True,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
1,1,Petr,Cech,1,3,5,14,-0.866025,-0.5,5,-0.222521,-0.974928,46,False,15,3.0,4,1120.0,1110.0,1050.0,2.0,90.0,0.0,1.0,123310.0,50.0,0.0,17.0,-3297.0,8837.0,12134.0,4.0,1320.0,1270.0,1340.0,0.0,38.0,0.0,0.0,0.0,8.0,0.0,0.0,24.0,0.0,True,0.0,0.0,2.0,0.0,2.0,0.0,0.2,2.4,0.0,12.0,-1.0,1.224647e-16,6.0,0.62349,-0.781831,False,90.0,0.0,0.0,3.0,5.0,0.0,0.0,0.0,0.0,8.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,1.0,False,0.0,0.0,0.0,3.0,90.0,2.0,2.0,2.666667,42.0,2.333333,3.214286,3.0,45.0,3.0,True,False,True,True,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,False
2,1,Petr,Cech,1,2,6,15,-0.707107,-0.7071068,6,0.62349,-0.781831,51,True,8,3.0,4,1090.0,1070.0,1140.0,11.0,90.0,0.0,0.0,124787.0,50.0,0.0,12.666667,-797.0,6593.0,7390.0,4.0,1260.0,1240.0,1310.0,0.0,33.0,0.0,0.0,0.0,13.0,0.0,2.0,23.0,0.0,True,0.0,0.0,0.0,0.0,1.0,0.0,1.4,14.2,0.0,14.0,-0.866025,-0.5,5.0,-0.222521,-0.974928,False,90.0,0.0,0.0,2.0,15.0,0.0,0.0,0.0,0.0,13.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,False,0.0,0.0,0.0,3.285714,90.0,1.333333,3.0,2.666667,46.0,2.333333,3.285714,3.0,46.0,2.0,True,False,True,True,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,True
3,1,Petr,Cech,1,3,7,14,-0.866025,-0.5,5,-0.222521,-0.974928,61,True,18,3.0,4,1100.0,1100.0,1130.0,1.0,45.0,0.0,0.0,138891.0,50.0,0.0,19.0,9392.0,13595.0,4203.0,4.0,1260.0,1240.0,1310.0,0.0,33.0,0.0,0.0,3.0,36.0,1.0,2.0,23.0,0.0,False,0.0,0.0,0.0,0.0,0.0,0.0,4.3,43.4,0.0,15.0,-0.707107,-0.7071068,6.0,0.62349,-0.781831,False,90.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,9.0,0.0,6.0,0.0,1.0,0.0,2.0,0.0,11.0,True,0.0,0.0,0.0,3.690476,90.0,1.0,3.0,2.333333,51.666667,2.333333,4.571429,3.0,64.0,2.0,True,False,True,True,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,False
4,1,Petr,Cech,1,2,8,11,-0.965926,0.258819,6,0.62349,-0.781831,74,False,9,2.0,4,1040.0,1030.0,1080.0,0.0,0.0,0.0,0.0,107912.0,50.0,0.0,18.333333,-28910.0,924.0,29834.0,4.0,1320.0,1270.0,1340.0,0.0,18.0,0.0,0.0,0.0,6.0,0.0,1.0,7.0,0.0,False,0.0,0.0,0.0,0.0,0.0,0.0,1.7,17.2,0.0,14.0,-0.866025,-0.5,5.0,-0.222521,-0.974928,False,45.0,0.0,0.0,0.0,18.0,0.0,0.0,0.0,0.0,5.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,1.0,True,0.0,0.0,0.0,4.071429,75.0,0.333333,3.0,2.0,57.0,1.666667,4.357143,3.0,61.0,1.0,True,False,True,False,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,False


In [4]:
# Rows we can use in the training set (i.e. they have a non-missing response)
full_training_data = players_train[~players_train.target_total_points.isna()].copy()

# Rows we cannot use in the model due to a missing response
pred_training_data = players_train[players_train.target_total_points.isna()].copy()

# Player_ids of those playing
training_player_ids = full_training_data.player_id

# The column to predict
target = 'played_at_all'

# Columns to use as features
all_features = [col for col in full_training_data if col not in
               ['player_id',
               'first_name',
               'second_name',
               'team_id',
               'gameweek',
               'fixture_id',
               'opponent_team',
                'selected',
                'use_row',
                'model_row',
                'played_at_all'
               ]
                and not col.startswith('target')
               ]

# Create a new 'test' dataset as separate from our main (which I will only use at the very end for the final
# end-to-end evaluation)
train_ids, test_ids = train_test_split(training_player_ids.unique(), test_size=0.7)
train_data = full_training_data.loc[full_training_data.index.isin(train_ids), all_features].copy()
train_target = full_training_data.loc[full_training_data.index.isin(train_ids), target]
test_data = full_training_data.loc[full_training_data.index.isin(test_ids), all_features].copy()
test_target = full_training_data.loc[full_training_data.index.isin(test_ids), target]

column_names = list(train_data.columns)

# Simple scaling of the data
ss = StandardScaler()

X_train = ss.fit_transform(train_data.values)
X_test = ss.transform(test_data.values)
y_train = train_target.values
y_test = test_target.values


In [5]:
lr = LogisticRegression()

lr.fit(X_train, y_train)

preds_train_play = lr.predict(X_train)
preds_test_play = lr.predict(X_test)

print('Accuracy of simple classifier on train: {:.2%}'.format(accuracy_score(y_train, preds_train_play)))
print('Accuracy of simple classifier on test: {:.2%}'.format(accuracy_score(y_test, preds_test_play)))

Accuracy of simple classifier on train: 92.96%
Accuracy of simple classifier on test: 66.37%


As the classes are relatively balanced, I am using a simple accuracy score as the evaluation for this part. The simple logistic regression performs well for the train data, but appears to be overfitting.

### Nested cross validation to choose the best out of a number of candidate models
To overcome this overfitting, and produce the best prediction of whether a player will play or not, I will perform nest cross validation to choose the best performing (in terms of accuracy on the test set) model from a set of algorithms.

In [6]:
# Now I'm using a proper cross-validation scheme, I will get the full set of training (and validation data)
X = full_training_data.loc[:, all_features]
y = full_training_data.loc[:, target]
grps = full_training_data.loc[:, 'player_id']

As of writing this, scikit-learn does not allow nested cross validation to be performed with groups using `cross_val_score` and `GridSearchCV`. As such, I will define a simple function to do cross validation with grouped data and `GridSearchCV`.

In [7]:
def cross_val_scorer_grouped(estimator, params, X, y=None, groups=None, scoring='accuracy', scorer=accuracy_score,
                             cv_outer=5, cv_inner='warn', test_split_outer=0.2, gs_verbosity=1):
    
    # This object creates splits which take the groups into account (e.g. a group can not appear in both the training
    # and test sets)
    gss = GroupShuffleSplit(n_splits=cv_outer, test_size=test_split_outer)
    
    # For each cross validation fold, calculate the accuracy of a tuned (inner cross validation classifier) defined by
    # the input estimator
    scores = np.zeros(cv_outer)
    for i, (train, test) in enumerate(gss.split(X, y, groups=groups)):
        # Subset data to be used to train and validate for this fold
        X_train = X[train, :]
        X_test = X[test, :]
        y_train = y[train]
        y_test = y[test]
        g_train = groups[train]
        
        # Fit the model for this fold's training data
        gs = GridSearchCV(estimator=estimator, param_grid=params, scoring=scoring, cv=cv_inner, verbose=gs_verbosity)
        gs.fit(X_train, y_train, groups=g_train)
        preds = gs.predict(X_test)
        
        # Get the score for this fold
        scores[i] = scorer(y_test, preds)
        print('Fold {} complete'.format(i))
        
    return scores

Now I have created a way to do nested cross validation, it is time to propose some candiate models to predict whether a player will play or not. These pipelines will include standardisation, principal component analysis, and feature selection along with the main model to avoid data leakage in the inner loop. Pca is used (as mentioned in the previous notebook) to account for the collinearity of features.

In [8]:
# Logistic regression - tune different regularisations
pipe_lr = Pipeline([('ss', StandardScaler()),
                ('pca', PCA(n_components=0.95, svd_solver='full')),
                ('sbf', SelectKBest(f_classif, k=10)),
                ('clf', LogisticRegression())])
params_lr = {'clf__penalty': ['l1', 'l2'],
         'clf__C': np.logspace(-3, 3, 7)}

# Decision tree
pipe_dt = Pipeline([('ss', StandardScaler()),
                ('pca', PCA(n_components=0.95, svd_solver='full')),
                ('clf', DecisionTreeClassifier())])
params_dt = {'clf__max_depth': np.linspace(1, 20, 20),
                'clf__min_samples_split': np.linspace(0.1, 1, 10),
                'clf__max_features': ['auto', 'log2']
               }

# Random forest
pipe_rf = Pipeline([('ss', StandardScaler()),
                ('pca', PCA(n_components=0.95, svd_solver='full')),
                ('sbf', SelectKBest(f_classif, k=10)),
                ('clf', RandomForestClassifier())])
params_rf = {'clf__n_estimators': [50, 100, 250],
                'clf__max_depth': np.linspace(1, 10, 5),
                'clf__max_features': ['auto', 'log2']
               }

# Adaptive boosting
pipe_ad = Pipeline([('ss', StandardScaler()),
                ('pca', PCA(n_components=0.95, svd_solver='full')),
                ('clf', AdaBoostClassifier())])
params_ad = {'clf__n_estimators':[2, 5, 10, 50, 100, 250, 500, 1000]
               },

# Support vector classifier
pipe_sv = Pipeline([('ss', StandardScaler()),
                ('pca', PCA(n_components=0.95, svd_solver='full')),
                ('clf', SVC())])
params_sv = {'clf__C': np.logspace(-3, 1, 5),
               'clf__gamma': np.logspace(-3, 0, 4),
               'clf__kernel': ['linear', 'rbf']
               }

# K-nearest neighbours
pipe_kn = Pipeline([('ss', StandardScaler()),
                ('pca', PCA(n_components=0.95, svd_solver='full')),
                ('clf', KNeighborsClassifier())])
params_kn = {'clf__n_neighbors': [1, 5, 10, 20, 50]
               }

In [9]:
# The splitting scheme for the inner cross validation
gkf = GroupKFold(n_splits=5)

In [10]:
scores_lr = cross_val_scorer_grouped(pipe_lr, params_lr, X.values, y.values, groups=grps.values,
                                     scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

Fitting 5 folds for each of 14 candidates, totalling 70 fits


[Parallel(n_jobs=1)]: Done  70 out of  70 | elapsed:   29.5s finished


Fold 0 complete
Fitting 5 folds for each of 14 candidates, totalling 70 fits


[Parallel(n_jobs=1)]: Done  70 out of  70 | elapsed:   29.0s finished


Fold 1 complete
Fitting 5 folds for each of 14 candidates, totalling 70 fits


[Parallel(n_jobs=1)]: Done  70 out of  70 | elapsed:   29.2s finished


Fold 2 complete
Fitting 5 folds for each of 14 candidates, totalling 70 fits


[Parallel(n_jobs=1)]: Done  70 out of  70 | elapsed:   29.9s finished


Fold 3 complete
Fitting 5 folds for each of 14 candidates, totalling 70 fits


[Parallel(n_jobs=1)]: Done  70 out of  70 | elapsed:   28.6s finished


Fold 4 complete


In [11]:
# scores_dt = cross_val_scorer_grouped(pipe_dt, params_dt, X.values, y.values, groups=grps.values,
#                                      scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

In [12]:
# scores_rf = cross_val_scorer_grouped(pipe_rf, params_rf, X.values, y.values, groups=grps.values,
#                                      scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

In [13]:
# scores_ad = cross_val_scorer_grouped(pipe_ad, params_ad, X.values, y.values, groups=grps.values,
#                                      scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

In [14]:
# scores_sv = cross_val_scorer_grouped(pipe_sv, params_sv, X.values, y.values, groups=grps.values,
#                                      scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

In [15]:
# scores_kn = cross_val_scorer_grouped(pipe_kn, params_kn, X.values, y.values, groups=grps.values,
#                                      scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

In [16]:
print('Nested CV accuracy for logistic regression: {:.4f} +/- {:.4f}'
      .format(np.mean(scores_lr), np.std(scores_lr)))
# print('Nested CV accuracy for logistic regression: {:.4f} +/- {:.4f}'
#       .format(np.mean(scores_dt), np.std(scores_dt)))
# print('Nested CV accuracy for logistic regression: {:.4f} +/- {:.4f}'
#       .format(np.mean(scores_rf), np.std(scores_rf)))
# print('Nested CV accuracy for logistic regression: {:.4f} +/- {:.4f}'
#       .format(np.mean(scores_ad), np.std(scores_ad)))
# print('Nested CV accuracy for logistic regression: {:.4f} +/- {:.4f}'
#       .format(np.mean(scores_sv), np.std(scores_sv)))
# print('Nested CV accuracy for logistic regression: {:.4f} +/- {:.4f}'
#       .format(np.mean(scores_kn), np.std(scores_kn)))

Nested CV accuracy for logistic regression: 0.8184 +/- 0.0107


The best model, as suggested by the above comparison is .... We can also see the accuracies are quite stable. Now I will refit this model from scratch to create the final one, before evaluating on the main test set.

In [None]:
pipe_final = Pipeline([('ss', StandardScaler()),
                ('pca', PCA(n_components=0.95, svd_solver='full')),
                ('sbf', SelectKBest(f_classif, k=10)),
                ('clf', LogisticRegression())])
params_final = {'clf__penalty': ['l1', 'l2'],
         'clf__C': np.logspace(-3, 3, 7)}

gs_final = GridSearchCV(estimator=pipe_final, param_grid=params_final, scoring='accuracy', cv=gkf, verbose=1)
gs_final.fit(X, y, groups=grps)

In [26]:
full_test_data = players_test[~players_test.target_total_points.isna()].copy()
full_test_data['played_at_all'] = full_test_data['target_minutes'] > 0
X_test = full_test_data.loc[:, all_features]
y_test = full_test_data.loc[:, target]
grps_test = full_test_data.loc[:, 'player_id']

preds_final_train = gs_final.predict(X)
preds_final_test = gs_final.predict(X_test)

print('Accuracy of final classifier on train: {:.2%}'.format(accuracy_score(y, preds_final_train)))
print('Accuracy of final classifier on test: {:.2%}'.format(accuracy_score(y_test, preds_final_test)))

Accuracy of final classifier on train: 81.95%
Accuracy of final classifier on test: 83.96%


At first glance, this might look worse than the very first model I built. However, it is better, as there is no overfitting to the training data. For my purposes, an accuracy of around 80%+ is sufficient.

I will now save this full pipeline (and other steps) to the disk so the model can be used to predict for new data.

In [34]:
model_all = {'feature_columns': all_features,
             'target_column': target,
             'model': gs_final}
    
with open('../models/play/model.pkl', 'wb') as f:
    pickle.dump(model_all, f)

Modelling the points players get will be done in the next notebook.