# Modelling whether a player plays
As mentioned in the previous notebook, the goal of this analysis is to find the players we might want to select for our fantasy football team. There is considerable flexibility in how this can be approached. From the previous analysis, a binary classification of whether a player scores four or more points appears to be a good starting point. To make the problem more tractable I will also create a separate model of whether a player will play or not. As shown earlier, about half of all players will not play in a game. If we combine a value predicting whether a player is likely to play or not with a value for their predicted points assuming they do play, we will easily be able to select good players.

For now, I will focus on a couple of algorithms (logistic regression and xgboost) for classifying whether players play or not to save time (I'm training these on a virtual machine on my laptop so, with cross-validation, it's not going to be fast!). In future, it would be good to try other algorithms, but I would imagine that the data will be the limiting factor in how accurate we can be.

As the data is, for the most part, balanced, my metric to evaluate will be accuracy. However, I will also consider ROC AUC.

In [1]:
import os
import pickle
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut, LeavePGroupsOut, GroupShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# Use sklearn API for easy inclusion in sklearn pipelines
from xgboost.sklearn import XGBClassifier
from sklearn.exceptions import DataConversionWarning

from helpers import PercentageCalc

warnings.filterwarnings(action='ignore', category=DataConversionWarning)
%matplotlib inline
pd.options.display.max_columns = None

data = pd.read_csv('./data/model_data.csv')

## Predicting whether a player will play
The first model is a simple classification of whether a player is likely to play. I will train on a training set a couple of different models with different hyperparameters, and compare using the results of the validation set.

See this [link](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data) for my approach to how I will approach cross validation wil grouped data. This data is grouped in the sense that we have multiple rows per player; in our training/validation splits we need to make sure each player only appears in one. The test set imported above has already had this taken into account (see other Exploratory_data_analysis notebook).

First I will confirm the balance in the training data:

In [2]:
print(f"Percentage of player rows playing in the row's game: "
      f"{np.mean(data.target_played):.1%}")

Percentage of player rows playing in the row's game: 48.1%


From previous notebook, there were no missings except for players' first gameweeks (as they obviously don't have previous game data for some of their features). We can simply remove these.

In [3]:
data_all = data.dropna()

For the validation, it is necessary to split players into different train, validation, and test groups so that a player only appears in one (for all their gameweeks). For now, create a final holdout set:

In [4]:
player_ids = np.unique(data_all.player_id)

player_ids_use, player_ids_test = train_test_split(player_ids, test_size=0.2)

data_use = data_all.loc[data_all.player_id.isin(player_ids_use)]
data_test = data_all.loc[data_all.player_id.isin(player_ids_test)]

## Base model
A first model we can do is simply say that if a player played in the last game, they will do in this one.

In [5]:
data_use.head()

Unnamed: 0,player_id,player_name,fixture_id,fixture_id_long,gameweek_id,total_points,fixture_home,value,transfers_balance,selected,transfers_in,transfers_out,team_strength,team_strength_overall_home,team_strength_overall_away,team_strength_attack_home,team_strength_attack_away,team_strength_defence_home,team_strength_defence_away,opponent_team_id,team_goals_conceded,opponent_team_strength,opponent_team_strength_overall_home,opponent_team_strength_overall_away,opponent_team_strength_attack_home,opponent_team_strength_attack_away,opponent_team_strength_defence_home,opponent_team_strength_defence_away,team_fixture_difficulty,opponent_team_fixture_difficulty,total_minutes,previous_points,previous_home_team_score,previous_away_team_score,previous_minutes,previous_goals_scored,previous_assists,previous_clean_sheets,previous_goals_conceded,previous_own_goals,previous_penalties_saved,previous_penalties_missed,previous_yellow_cards,previous_red_cards,previous_saves,previous_bonus,previous_bps,previous_influence,previous_creativity,previous_threat,previous_ict_index,previous_team_goals_scored,previous_win,previous_loss,previous_draw,kickoff_feature_sin,kickoff_feature_cos,kickoff_feature_game_day_of_week,kickoff_feature_time_diff,kickoff_feature_gameday,fifa_age,fifa_height_cm,fifa_weight_kg,fifa_overall,fifa_potential,fifa_value_eur,fifa_wage_eur,fifa_international_reputation,fifa_weak_foot,fifa_skill_moves,fifa_release_clause_eur,fifa_pace,fifa_shooting,fifa_passing,fifa_dribbling,fifa_defending,fifa_physic,fifa_gk_diving,fifa_gk_handling,fifa_gk_kicking,fifa_gk_reflexes,fifa_gk_speed,fifa_gk_positioning,fifa_attacking_crossing,fifa_attacking_finishing,fifa_attacking_heading_accuracy,fifa_attacking_short_passing,fifa_attacking_volleys,fifa_skill_dribbling,fifa_skill_curve,fifa_skill_fk_accuracy,fifa_skill_long_passing,fifa_skill_ball_control,fifa_movement_acceleration,fifa_movement_sprint_speed,fifa_movement_agility,fifa_movement_reactions,fifa_movement_balance,fifa_power_shot_power,fifa_power_jumping,fifa_power_stamina,fifa_power_strength,fifa_power_long_shots,fifa_mentality_aggression,fifa_mentality_interceptions,fifa_mentality_positioning,fifa_mentality_vision,fifa_mentality_penalties,fifa_mentality_composure,fifa_defending_marking,fifa_defending_standing_tackle,fifa_defending_sliding_tackle,fifa_goalkeeping_diving,fifa_goalkeeping_handling,fifa_goalkeeping_kicking,fifa_goalkeeping_positioning,fifa_goalkeeping_reflexes,fifa_work_rate_attack,fifa_work_rate_defense,fifa_ptag_,fifa_ptag_distanceshooter,fifa_ptag_completedefender,fifa_ptag_completemidfielder,fifa_ptag_acrobat,fifa_ptag_speedster,fifa_ptag_poacher,fifa_ptag_fkspecialist,fifa_ptag_clinicalfinisher,fifa_ptag_tactician,fifa_ptag_aerialthreat,fifa_ptag_completeforward,fifa_ptag_tackling,fifa_ptag_playmaker,fifa_ptag_crosser,fifa_ptag_engine,fifa_ptag_strength,fifa_ptag_dribbler,fifa_ptrait_,fifa_ptrait_powerfree-kick,fifa_ptrait_longthrow-in,fifa_ptrait_inflexible,fifa_ptrait_selfish,fifa_ptrait_acrobaticclearance,fifa_ptrait_finesseshot,fifa_ptrait_beatoffsidetrap,fifa_ptrait_crowdfavourite,fifa_ptrait_skilleddribbling,fifa_ptrait_flair,fifa_ptrait_giantthrow-in,fifa_ptrait_diver,fifa_ptrait_flairpasses,fifa_ptrait_injuryfree,fifa_ptrait_outsidefootshot,fifa_ptrait_leadership,fifa_ptrait_injuryprone,fifa_ptrait_argueswithofficials,fifa_ptrait_secondwind,fifa_ptrait_avoidsusingweakerfoot,fifa_ptrait_swervepass,fifa_ptrait_earlycrosser,fifa_pbodytype_stocky,fifa_pbodytype_lean,fifa_pbodytype_normal,fifa_preferred_foot_right,fifa_is_uk_roi_player,fifa_pos_rf,fifa_pos_ldm,fifa_pos_ram,fifa_pos_lm,fifa_pos_lcm,fifa_pos_cm,fifa_pos_rcm,fifa_pos_rm,fifa_pos_lwb,fifa_pos_cdm,fifa_pos_rdm,fifa_pos_rwb,fifa_pos_lb,fifa_pos_lcb,fifa_pos_cb,fifa_pos_rcb,fifa_pos_rb,fifa_pos_cam,fifa_pos_lam,fifa_pos_rw,fifa_pos_ls,fifa_pos_st,fifa_pos_rs,fifa_pos_lf,fifa_pos_cf,fifa_pos_lw,position_name_FWD,position_name_GKP,position_name_MID,team_name_AVL,team_name_BHA,team_name_BOU,team_name_BUR,team_name_CHE,team_name_CRY,team_name_EVE,team_name_LEI,team_name_LIV,team_name_MCI,team_name_MUN,team_name_NEW,team_name_NOR,team_name_SHU,team_name_SOU,team_name_TOT,team_name_WAT,team_name_WHU,team_name_WOL,opponent_team_name_AVL,opponent_team_name_BHA,opponent_team_name_BOU,opponent_team_name_BUR,opponent_team_name_CHE,opponent_team_name_CRY,opponent_team_name_EVE,opponent_team_name_LEI,opponent_team_name_LIV,opponent_team_name_MCI,opponent_team_name_MUN,opponent_team_name_NEW,opponent_team_name_NOR,opponent_team_name_SHU,opponent_team_name_SOU,opponent_team_name_TOT,opponent_team_name_WAT,opponent_team_name_WHU,opponent_team_name_WOL,target_played,target_points
1,1,Shkodran Mustafi,11,1059712,2,0.0,True,55.0,-5280.0,36709.0,2868.0,8148.0,4,1230,1270,1150,1190,1280,1330,5,1.0,3,1050,1110,1060,1130,1050,1050,2.0,4.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,-0.130526,-0.991445,5,5.0,0.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,False
2,1,Shkodran Mustafi,24,1059725,3,0.0,False,54.0,-6882.0,30975.0,534.0,7416.0,4,1230,1270,1150,1190,1280,1330,10,3.0,5,1340,1360,1300,1330,1340,1370,5.0,4.0,0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,-0.991445,-0.130526,5,7.0,0.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,False,False
3,1,Shkodran Mustafi,31,1059732,4,0.0,True,54.0,-3872.0,28096.0,346.0,4218.0,4,1230,1270,1150,1190,1280,1330,17,2.0,4,1240,1280,1210,1290,1280,1290,4.0,4.0,0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-0.92388,-0.382683,6,7.0,1.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,False,False
4,1,Shkodran Mustafi,49,1059750,5,0.0,False,53.0,-2073.0,26902.0,581.0,2654.0,4,1230,1270,1150,1190,1280,1330,18,2.0,3,1070,1090,1080,1120,1040,1120,2.0,4.0,0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,-0.92388,-0.382683,6,14.0,1.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,False,False
5,1,Shkodran Mustafi,51,1059752,6,0.0,True,53.0,-1041.0,26330.0,373.0,1414.0,4,1230,1270,1150,1190,1280,1330,2,2.0,2,1040,1080,1030,1060,1030,1050,2.0,4.0,0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,-0.92388,-0.382683,6,7.0,1.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,False


In [6]:
class BaseModel():
    def fit(X, y):
        pass
    
    def predict(X, y=None):
        return (X['previous_minutes'] > 0).astype(int)

preds_train = BaseModel.predict(data_use)

preds_test = BaseModel.predict(data_test)

accuracy_base = accuracy_score(data_test['target_played'].astype(int), preds_test)
print(f'Accuracy of base model (test data): {accuracy_base}')

Accuracy of base model (test data): 0.8262599469496021


## Nested cross validation to choose the best out of a number of candidate models
I will perform nest cross validation to choose the best performing (in terms of accuracy on the test set) model from a couple of algorithms.

In [7]:
cross_val_replace_cols = ['selected', 'transfers_in', 'transfers_out']
not_features = ['target_played', 'target_points', 'total_minutes',
               'opponent_team_id', 'home_team_id', 'away_team_id',
               'total_points', 'position_id', 'team_id',
                'player_id', 'player_name', 'fixture_id', 'fixture_id_long'
               ]
features = [c for c in data_use.columns if c not in not_features and
           c not in cross_val_replace_cols]

X_all = data_use.loc[:, features + cross_val_replace_cols]
y_all = data_use.loc[:, 'target_played']
grps = data_use.loc[:, 'player_id']

As of writing this, scikit-learn does not allow nested cross validation to be performed with groups using cross_val_score and GridSearchCV. As such, I will define a simple function to do cross validation with grouped data and GridSearchCV.

In [8]:
X_all.head()

Unnamed: 0,gameweek_id,fixture_home,value,transfers_balance,team_strength,team_strength_overall_home,team_strength_overall_away,team_strength_attack_home,team_strength_attack_away,team_strength_defence_home,team_strength_defence_away,team_goals_conceded,opponent_team_strength,opponent_team_strength_overall_home,opponent_team_strength_overall_away,opponent_team_strength_attack_home,opponent_team_strength_attack_away,opponent_team_strength_defence_home,opponent_team_strength_defence_away,team_fixture_difficulty,opponent_team_fixture_difficulty,previous_points,previous_home_team_score,previous_away_team_score,previous_minutes,previous_goals_scored,previous_assists,previous_clean_sheets,previous_goals_conceded,previous_own_goals,previous_penalties_saved,previous_penalties_missed,previous_yellow_cards,previous_red_cards,previous_saves,previous_bonus,previous_bps,previous_influence,previous_creativity,previous_threat,previous_ict_index,previous_team_goals_scored,previous_win,previous_loss,previous_draw,kickoff_feature_sin,kickoff_feature_cos,kickoff_feature_game_day_of_week,kickoff_feature_time_diff,kickoff_feature_gameday,fifa_age,fifa_height_cm,fifa_weight_kg,fifa_overall,fifa_potential,fifa_value_eur,fifa_wage_eur,fifa_international_reputation,fifa_weak_foot,fifa_skill_moves,fifa_release_clause_eur,fifa_pace,fifa_shooting,fifa_passing,fifa_dribbling,fifa_defending,fifa_physic,fifa_gk_diving,fifa_gk_handling,fifa_gk_kicking,fifa_gk_reflexes,fifa_gk_speed,fifa_gk_positioning,fifa_attacking_crossing,fifa_attacking_finishing,fifa_attacking_heading_accuracy,fifa_attacking_short_passing,fifa_attacking_volleys,fifa_skill_dribbling,fifa_skill_curve,fifa_skill_fk_accuracy,fifa_skill_long_passing,fifa_skill_ball_control,fifa_movement_acceleration,fifa_movement_sprint_speed,fifa_movement_agility,fifa_movement_reactions,fifa_movement_balance,fifa_power_shot_power,fifa_power_jumping,fifa_power_stamina,fifa_power_strength,fifa_power_long_shots,fifa_mentality_aggression,fifa_mentality_interceptions,fifa_mentality_positioning,fifa_mentality_vision,fifa_mentality_penalties,fifa_mentality_composure,fifa_defending_marking,fifa_defending_standing_tackle,fifa_defending_sliding_tackle,fifa_goalkeeping_diving,fifa_goalkeeping_handling,fifa_goalkeeping_kicking,fifa_goalkeeping_positioning,fifa_goalkeeping_reflexes,fifa_work_rate_attack,fifa_work_rate_defense,fifa_ptag_,fifa_ptag_distanceshooter,fifa_ptag_completedefender,fifa_ptag_completemidfielder,fifa_ptag_acrobat,fifa_ptag_speedster,fifa_ptag_poacher,fifa_ptag_fkspecialist,fifa_ptag_clinicalfinisher,fifa_ptag_tactician,fifa_ptag_aerialthreat,fifa_ptag_completeforward,fifa_ptag_tackling,fifa_ptag_playmaker,fifa_ptag_crosser,fifa_ptag_engine,fifa_ptag_strength,fifa_ptag_dribbler,fifa_ptrait_,fifa_ptrait_powerfree-kick,fifa_ptrait_longthrow-in,fifa_ptrait_inflexible,fifa_ptrait_selfish,fifa_ptrait_acrobaticclearance,fifa_ptrait_finesseshot,fifa_ptrait_beatoffsidetrap,fifa_ptrait_crowdfavourite,fifa_ptrait_skilleddribbling,fifa_ptrait_flair,fifa_ptrait_giantthrow-in,fifa_ptrait_diver,fifa_ptrait_flairpasses,fifa_ptrait_injuryfree,fifa_ptrait_outsidefootshot,fifa_ptrait_leadership,fifa_ptrait_injuryprone,fifa_ptrait_argueswithofficials,fifa_ptrait_secondwind,fifa_ptrait_avoidsusingweakerfoot,fifa_ptrait_swervepass,fifa_ptrait_earlycrosser,fifa_pbodytype_stocky,fifa_pbodytype_lean,fifa_pbodytype_normal,fifa_preferred_foot_right,fifa_is_uk_roi_player,fifa_pos_rf,fifa_pos_ldm,fifa_pos_ram,fifa_pos_lm,fifa_pos_lcm,fifa_pos_cm,fifa_pos_rcm,fifa_pos_rm,fifa_pos_lwb,fifa_pos_cdm,fifa_pos_rdm,fifa_pos_rwb,fifa_pos_lb,fifa_pos_lcb,fifa_pos_cb,fifa_pos_rcb,fifa_pos_rb,fifa_pos_cam,fifa_pos_lam,fifa_pos_rw,fifa_pos_ls,fifa_pos_st,fifa_pos_rs,fifa_pos_lf,fifa_pos_cf,fifa_pos_lw,position_name_FWD,position_name_GKP,position_name_MID,team_name_AVL,team_name_BHA,team_name_BOU,team_name_BUR,team_name_CHE,team_name_CRY,team_name_EVE,team_name_LEI,team_name_LIV,team_name_MCI,team_name_MUN,team_name_NEW,team_name_NOR,team_name_SHU,team_name_SOU,team_name_TOT,team_name_WAT,team_name_WHU,team_name_WOL,opponent_team_name_AVL,opponent_team_name_BHA,opponent_team_name_BOU,opponent_team_name_BUR,opponent_team_name_CHE,opponent_team_name_CRY,opponent_team_name_EVE,opponent_team_name_LEI,opponent_team_name_LIV,opponent_team_name_MCI,opponent_team_name_MUN,opponent_team_name_NEW,opponent_team_name_NOR,opponent_team_name_SHU,opponent_team_name_SOU,opponent_team_name_TOT,opponent_team_name_WAT,opponent_team_name_WHU,opponent_team_name_WOL,selected,transfers_in,transfers_out
1,2,True,55.0,-5280.0,4,1230,1270,1150,1190,1280,1330,1.0,3,1050,1110,1060,1130,1050,1050,2.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,-0.130526,-0.991445,5,5.0,0.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,36709.0,2868.0,8148.0
2,3,False,54.0,-6882.0,4,1230,1270,1150,1190,1280,1330,3.0,5,1340,1360,1300,1330,1340,1370,5.0,4.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,-0.991445,-0.130526,5,7.0,0.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,30975.0,534.0,7416.0
3,4,True,54.0,-3872.0,4,1230,1270,1150,1190,1280,1330,2.0,4,1240,1280,1210,1290,1280,1290,4.0,4.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-0.92388,-0.382683,6,7.0,1.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,28096.0,346.0,4218.0
4,5,False,53.0,-2073.0,4,1230,1270,1150,1190,1280,1330,2.0,3,1070,1090,1080,1120,1040,1120,2.0,4.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,-0.92388,-0.382683,6,14.0,1.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,26902.0,581.0,2654.0
5,6,True,53.0,-1041.0,4,1230,1270,1150,1190,1280,1330,2.0,2,1040,1080,1030,1060,1030,1050,2.0,4.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,-0.92388,-0.382683,6,7.0,1.0,27.0,184.0,82.0,79.0,81.0,13000000.0,76000.0,3.0,3.0,2.0,25700000.0,60.0,57.0,63.0,61.0,77.0,77.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,58.0,81.0,75.0,51.0,53.0,55.0,47.0,69.0,70.0,58.0,62.0,63.0,76.0,67.0,66.0,81.0,68.0,78.0,52.0,84.0,75.0,49.0,55.0,54.0,73.0,74.0,79.0,80.0,11.0,9.0,15.0,10.0,6.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,64.0,76.0,65.0,64.0,69.0,69.0,69.0,64.0,72.0,76.0,76.0,72.0,74.0,80.0,80.0,80.0,74.0,65.0,65.0,63.0,66.0,66.0,66.0,64.0,64.0,63.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,26330.0,373.0,1414.0


In [28]:
def cross_val_scorer_grouped(estimator, params, X, y=None, groups=None, scoring='accuracy', scorer=accuracy_score,
                             cv_outer=5, cv_inner='warn', test_split_outer=0.2, gs_verbosity=1):
    
    gss = GroupShuffleSplit(n_splits=cv_outer, test_size=test_split_outer)
    gss.split(X_all, y_all, groups=grps)

    # For each cross validation fold, calculate the accuracy of a tuned (inner cross validation classifier) defined by
    # the input estimator
    scores = np.zeros(cv_outer)

    for i, (train, test) in enumerate(gss.split(X, y, groups=groups)):
        print(i)
        X_train = X.iloc[train, :]
        X_test = X.iloc[test, :]
        y_train = y.iloc[train]
        y_test = y.iloc[test]
        g_train = groups.iloc[train]

        # Fit the model for this fold's training data
        gs = GridSearchCV(estimator=estimator, param_grid=params, scoring=scoring, cv=cv_inner, verbose=gs_verbosity, n_jobs=-1)
        gs.fit(X_train, y_train, groups=g_train)
        preds = gs.predict(X_test)

        # Get the score for this fold
        scores[i] = scorer(y_test, preds)
        print('Fold {} complete'.format(i))
        
    return scores, gs

Now I have created a way to do nested cross validation, it is time to propose some candiate models to predict whether a player will play or not. These pipelines will include standardisation and principal component analysis along with the main model to avoid data leakage in the inner loop. PCA is used (as mentioned in the previous notebook) to account for the collinearity of features.

In [25]:
shared = [('rpcol', PercentageCalc(by_group='gameweek_id',
                             variables=['selected', 'transfers_in', 'transfers_out'],
                             constant=15,
                            drop_by=False)),
          ('ss', StandardScaler()),
          ('pca', PCA(n_components=0.95, svd_solver='full'))]

pipe_lr = Pipeline(
    shared +
    [('clf', LogisticRegression(solver='liblinear'))]
)
params_lr = {
#      'clf__penalty': ['l1', 'l2'],
#      'clf__C': np.logspace(-3, 3, 7)
}

pipe_xg = Pipeline(
    shared +
    [('clf', LogisticRegression(solver='liblinear'))]
)
params_xg = {

}


gkf = GroupKFold(n_splits=5)

In [19]:
scores_lr, model_lr = cross_val_scorer_grouped(pipe_lr, params_lr, X_all, y_all, groups=grps,
                                     scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

0
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   10.2s finished


Fold 0 complete
1
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   10.4s finished


Fold 1 complete
2
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   10.7s finished


Fold 2 complete
3
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   10.6s finished


Fold 3 complete
4
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   10.3s finished


Fold 4 complete


In [29]:
scores_xg, model_xg = cross_val_scorer_grouped(pipe_xg, params_xg, X_all, y_all, groups=grps,
                                     scoring='accuracy', cv_inner=gkf, gs_verbosity=1)

0
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    8.4s finished


Fold 0 complete
1
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    7.0s finished


Fold 1 complete
2
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    6.8s finished


Fold 2 complete
3
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    6.8s finished


Fold 3 complete
4
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    7.3s finished


Fold 4 complete


In [33]:
model_xg.cv_results_

{'mean_fit_time': array([2.19633083]),
 'std_fit_time': array([0.18060331]),
 'mean_score_time': array([0.21344757]),
 'std_score_time': array([0.00662481]),
 'params': [{}],
 'split0_test_score': array([0.85117227]),
 'split1_test_score': array([0.81202241]),
 'split2_test_score': array([0.83698421]),
 'split3_test_score': array([0.83953133]),
 'split4_test_score': array([0.81141692]),
 'mean_test_score': array([0.83022543]),
 'std_test_score': array([0.01585031]),
 'rank_test_score': array([1], dtype=int32)}

In [27]:
print(f'Nested CV accuracy for logistic regression: {np.mean(scores_lr):.2f} +/- {np.std(scores_lr):2f}')
print(f'Nested CV accuracy for xgboost: {np.mean(scores_xg):.2f} +/- {np.std(scores_xg):2f}')

Nested CV accuracy for logistic regression: 0.84 +/- 0.009191
Nested CV accuracy for xgboost: 0.84 +/- 0.007600
