# Packages

In [2]:
import sys, os
import pandas as pd
import numpy as np

# Visualisation
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

# Statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Scikit-learn
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# Explicitly require this experimental feature
from sklearn.experimental import enable_halving_search_cv # noqa
# Now you can import normally from model_selection
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import HalvingRandomSearchCV

# XGBoost
import xgboost as xgb

# SFS
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

# Sciki-learn models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestRegressor

# SHAP values
import shap

# Warnings
import warnings
warnings.filterwarnings("ignore")
    
# Timings
%load_ext autotime
# %unload_ext autotime

# Progress bar
from tqdm import tqdm

time: 300 µs (started: 2021-09-05 20:42:28 +01:00)


# Modelling and Simulation

In the next section, we will look at a range of different machine learning algorithms and begin to evaluate their respective performance in predicting fantasy football points. We will achieve this by simulation. That is, we will simulate and compare how different models would have performed on *last season’s data*. Our choice of evaluation metric will be the **Mean Absolute Error** (MAE) between the predictions made and the observed fantasy football points.

# Contents

i. [Idea behind simulation](#IdeaBehindSimulation)

<br>

[Linear Models](#LinearModels)

i. [(Multiple) Linear Regression](#LR)

ii. [Ridge Regression and LASSO Regression](#RidgeAndLASSO)

<br>

[Non-linear Models](#NonLinearModels)

i. [Random Forest](#RF)

ii. [XGBoost](#XGB)

iii. [XGBoost V2](#XGBV2)

   * XGBoost V2: Error spike investigation
    

<a id='IdeaBehindSimulation'></a>
## Idea behind simulation

For a given model, the idea will be to simulate the model as if it were deployed at the beginning of last season. In other words, for every gameweek in the season, the model will make predictions (as if we were "in that gameweek") for that gameweek and ensuing gameweeks. But, what will this look like in practice? Let's give an example:

**Example: Simulating Gameweek *N***

Assume we're in Gameweek *N* last season. That is, Gameweeks *1, 2, ... , N-1* have happened and Gameweek *N* is about to be played. Then, 
* The model will be **trained** on Gameweeks *1, 2, ... , N-1* (i.e. Gameweeks *1, 2, ... , N-1* will form the training data).
* And the model will **predict** Gameweeks *N, N+1, N+2, N+3, N+4* (i.e. Gameweeks *N, N+1, N+2, N+3, N+4* will form the test data).

for *N* between *3, 4, ... , 38*.

<br>

*A couple of things to note...* 

1) **Why 5?:** It should be noted, that the model always makes predictions for the next 5 gameweeks. But why 5? My thinking here is that, in practice, a model which predicts for just the next gameweek is not that useful. Conversely, a model which makes predictions for *lots and lots* gameweeks ahead of time might start to make inaccurate estimates. Therefore, for now, I think 5 serves as a 'happy medium'.

2) **What about Gameweeks (GWs) 1 and 2?:** You may have noticed that we will not make predictions for GWs 1 and 2. I've chosen to exclude making predictions for these GWs for a few reasons:
* By **GW1** (and by design) all predictor variables will be null (Recall: the predictor variables reflect data observed *before* the start of the given GW).
* By **GW2** (and by the same design) we will still not have observed a "full" Gameweek. Yes, we will have seen points scored in GW1. However, the predictor variables for GW1 will still be null.
* For now, I think **it is acceptable to not make predictions for these Gameweeks**. I think this is ok because our models will still "work" (that is - make predictions) for 36 out of the 38 Gameweeks in the season. However, of course, this is something which should be addressed in a future iteration of the project...

<br>

## Read in cleaned data from local directory

Let's read in the cleaned dataset we've put together so far.

In [3]:
# Redefine Index
df = pd.read_csv("/Users/samharrison/Documents/data_sci/fpl_points_predictor/data/cleaned_data_2020_21_season.csv")
df = df.set_index(['player_name','position','team_title','event','opponent_team_title'])

print(df.shape)
df.head(4)

(14326, 47)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,finished,chance_of_playing_this_round,chance_of_playing_next_round,home_flag,goalkeeper_flag,defender_flag,midfielder_flag,forward_flag,goals_WMA,shots_WMA,...,xGBuildup_pgw,team_xG_pgw,team_goals_pgw,team_xGA_pgw,team_goals_against_pgw,opponent_xG_pgw,opponent_goals_pgw,opponent_xGA_pgw,opponent_goals_against_pgw,total_points
player_name,position,team_title,event,opponent_team_title,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
David Luiz Moreira Marinho,defender,Arsenal,3,Liverpool,True,50.0,100.0,0,0,1,0,0,0.0,0.0,...,0.0,1.745945,2.5,1.095049,0.5,2.703115,3.0,0.586998,1.5,1
David Luiz Moreira Marinho,defender,Arsenal,4,Sheffield United,True,50.0,100.0,1,0,1,0,0,0.0,0.0,...,0.015401,1.556687,2.0,1.642812,1.333333,1.110602,0.0,1.294917,1.333333,2
David Luiz Moreira Marinho,defender,Arsenal,5,Manchester City,True,50.0,100.0,0,0,1,0,0,0.0,0.0,...,0.126772,1.334298,2.0,1.268358,1.25,1.234542,1.5,1.540377,1.75,2
David Luiz Moreira Marinho,defender,Arsenal,6,Leicester,True,50.0,100.0,1,0,1,0,0,0.0,0.8,...,0.110798,1.235103,1.6,1.300732,1.2,1.5876,2.4,1.245883,1.6,1


time: 234 ms (started: 2021-09-05 20:42:28 +01:00)


In [4]:
df.head(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,finished,chance_of_playing_this_round,chance_of_playing_next_round,home_flag,goalkeeper_flag,defender_flag,midfielder_flag,forward_flag,goals_WMA,shots_WMA,...,xGBuildup_pgw,team_xG_pgw,team_goals_pgw,team_xGA_pgw,team_goals_against_pgw,opponent_xG_pgw,opponent_goals_pgw,opponent_xGA_pgw,opponent_goals_against_pgw,total_points
player_name,position,team_title,event,opponent_team_title,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
David Luiz Moreira Marinho,defender,Arsenal,3,Liverpool,True,50.0,100.0,0,0,1,0,0,0.0,0.0,...,0.0,1.745945,2.5,1.095049,0.5,2.703115,3.0,0.586998,1.5,1
David Luiz Moreira Marinho,defender,Arsenal,4,Sheffield United,True,50.0,100.0,1,0,1,0,0,0.0,0.0,...,0.015401,1.556687,2.0,1.642812,1.333333,1.110602,0.0,1.294917,1.333333,2
David Luiz Moreira Marinho,defender,Arsenal,5,Manchester City,True,50.0,100.0,0,0,1,0,0,0.0,0.0,...,0.126772,1.334298,2.0,1.268358,1.25,1.234542,1.5,1.540377,1.75,2
David Luiz Moreira Marinho,defender,Arsenal,6,Leicester,True,50.0,100.0,1,0,1,0,0,0.0,0.8,...,0.110798,1.235103,1.6,1.300732,1.2,1.5876,2.4,1.245883,1.6,1


time: 19.9 ms (started: 2021-09-05 20:42:28 +01:00)


*[Note: (You may have already noticed) The features **chance_of_playing_this_round** and **chance_of_playing_next_round** are not reflective of the gameweek defined. In actual fact, they represent probabilities **at the time the data was gathered**. This is down to how the FPL API works. Therefore, we will have to drop them from any simulations ran below since their values are incorrect.]*

In [5]:
# For this section of the notebook, we're only interested in past observations. Therefore, we drop any gameweeks in 
# the future which haven't happened yet
df = df.dropna(subset=['total_points'])

# We also drop the features mentioned above 
df = df.drop(columns={'finished','chance_of_playing_this_round','chance_of_playing_next_round'})

print(df.shape)
df.head(4)

(14326, 44)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,home_flag,goalkeeper_flag,defender_flag,midfielder_flag,forward_flag,goals_WMA,shots_WMA,xG_WMA,time_WMA,xA_WMA,...,xGBuildup_pgw,team_xG_pgw,team_goals_pgw,team_xGA_pgw,team_goals_against_pgw,opponent_xG_pgw,opponent_goals_pgw,opponent_xGA_pgw,opponent_goals_against_pgw,total_points
player_name,position,team_title,event,opponent_team_title,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
David Luiz Moreira Marinho,defender,Arsenal,3,Liverpool,0,0,1,0,0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.745945,2.5,1.095049,0.5,2.703115,3.0,0.586998,1.5,1
David Luiz Moreira Marinho,defender,Arsenal,4,Sheffield United,1,0,1,0,0,0.0,0.0,0.0,60.333333,0.0,...,0.015401,1.556687,2.0,1.642812,1.333333,1.110602,0.0,1.294917,1.333333,2
David Luiz Moreira Marinho,defender,Arsenal,5,Manchester City,0,0,1,0,0,0.0,0.0,0.0,75.166667,0.0,...,0.126772,1.334298,2.0,1.268358,1.25,1.234542,1.5,1.540377,1.75,2
David Luiz Moreira Marinho,defender,Arsenal,6,Leicester,1,0,1,0,0,0.0,0.8,0.032435,81.1,0.0,...,0.110798,1.235103,1.6,1.300732,1.2,1.5876,2.4,1.245883,1.6,1


time: 123 ms (started: 2021-09-05 20:42:32 +01:00)


<a id='LinearModels'></a>
## Linear Models

To begin with, I'd like to see how far we can get just by using linear models. I'm more familiar with linear models having studied them as part of my degree (in particular, Linear Regression), so I think it would be cool to put some of this theory into action.

For the linear models tried below, I believe stronger predictions will be made by first **partitioning the dataset by playing position** into smaller “subdatasets”. Namely, I've made the decision to create and train individual submodels for each playing position, in the hope of achieving better estimates. 

To illustrate my thinking, it should be noted that in fantasy football, players from different positions score points differently (e.g. goalkeepers and defenders score +4 points for a clean sheet, while midfielders score just +1 point, and forwards score 0 points). Therefore, I believe linear models, at least, will produce better estimates when each model/submodel is trained on just the one position specifically.

<a id='LR'></a>
## (Multiple) Linear Regression
We'll use Linear Regression (LR) as our first linear model. LR should provide two useful insights:

***Insight 1)*** *By examining the LR coeffecients, we can understand which variables are significant when predicting fantasy football points.*

***Insight 2)***  *LR will provide ‘baseline’ predictions which future models can be benchmarked against.*

<br>

With the above two insights in mind, let's define some methodology:

**Methodology**

*Part 1: Understanding the coeffecients* 
1. On *all* of last season's data, **check for multicollinearity** by examining bivariate correlations.
2. Remove the variables causing multicollinearity, i.e. bivariate correlations > 80%
3. **Check again for multicollinearity** by calculating the variance inflation factors
4. Remove the variables causing multicollinearity, i.e. VIF > 10
5. Finally, **perform LR for each position** / Visualize the significance of each coeffecient

*Part 2: Making Predictions*

Then, using the features kept from Part 1:

1. Simulate model on last season's data
2. Analyse predictions

### Linear Regression, Part 1: Understanding the coeffecients

The insight we'll be focusing on in this part will be ***Insight 1)***.

For ease, we'll run the Linear Regression model on *all* of last season's data. This way we'll quickly be able to understand which variables are important - when viewing the season as a whole. The alternative would be to model gameweek-by-gameweek and try to understand the variables that way, i.e. which variables are significant *and* for which gameweeks. However, I think for now it's perhaps best to keep things simple.

But first, let's define our subdatasets. And we'll wrap this step as a function for whenever we need to redefine them.


In [None]:
def define_player_subdatasets(df):

    # Note, for the goalkeeper subdataset we also drop irrelevant attacking player metrics like goals_WMA etc.
    df_goalkeeper = df[df['goalkeeper_flag']==1].drop(columns=['goalkeeper_flag','defender_flag','midfielder_flag','forward_flag', 
                                                           'goals_WMA','shots_WMA','xG_WMA','xA_WMA','assists_WMA','key_passes_WMA',
                                                           'npg_WMA','npxG_WMA','goals_pgw','shots_pgw','xG_pgw','xA_pgw', 
                                                           'assists_pgw', 'key_passes_pgw', 'npg_pgw', 'npxG_pgw'])
    df_defender = df[df['defender_flag']==1].drop(columns=['goalkeeper_flag','defender_flag','midfielder_flag','forward_flag'])
    df_midfielder = df[df['midfielder_flag']==1].drop(columns=['goalkeeper_flag','defender_flag','midfielder_flag','forward_flag'])
    df_forward = df[df['forward_flag']==1].drop(columns=['goalkeeper_flag','defender_flag','midfielder_flag','forward_flag'])

    print('Goalkeeper shape:'+str(df_goalkeeper.shape)+'\nDefender shape:'+str(df_defender.shape)+
          '\nMidfielder shape:'+str(df_midfielder.shape)+'\nForward shape:'+str(df_forward.shape))
    
    return df_goalkeeper, df_defender, df_midfielder, df_forward

df_goalkeeper, df_defender, df_midfielder, df_forward = define_player_subdatasets(df)

### 1<sup>st</sup> Check for Multicollinearity: Bivariate Correlation Test

We will first test for multicollinearity by calculating the bivariate correlations for all predictor variables. We will represent the correlation matrix via. heatmap.

In [None]:
# Calculate the correlation matrix for the predictor variables only 
corrMatrix_goalkeeper = df_goalkeeper.loc[:,~df_goalkeeper.columns.isin(['total_points'])].corr()
corrMatrix_defender = df_defender.loc[:,~df_defender.columns.isin(['total_points'])].corr()
corrMatrix_midfielder = df_midfielder.loc[:,~df_midfielder.columns.isin(['total_points'])].corr()
corrMatrix_forward = df_forward.loc[:,~df_forward.columns.isin(['total_points'])].corr()

# Round every position correlation matrix
for corrMatrix in [corrMatrix_goalkeeper, corrMatrix_defender, corrMatrix_midfielder, corrMatrix_forward]:
    corrMatrix = round(corrMatrix, 2)

# Visualize every position correlation matrix
for corrMatrix in [(corrMatrix_goalkeeper, 'goalkeeper'), (corrMatrix_defender, 'defender'), 
                   (corrMatrix_midfielder, 'midfielder'), (corrMatrix_forward, 'forward')]:
    sns.set(rc={'figure.figsize':(36,16)})
    sns.heatmap(corrMatrix[0], annot=True, cmap = 'rocket').set_title(corrMatrix[1].capitalize()+" Correlation Matrix", 
                                                                                       fontsize = 24)
    # Fontsize
    plt.yticks(fontsize = 18)
    plt.xticks(fontsize = 18)
    plt.show()

### Remove the variables causing Multicollinearity (*i.e. bivariate correlations > 80%*)
*[Note: In this subsection, the "significance" of a predictor variable is its R-squared value when fitted against the target variable (**total_points**) via. (Simple) Linear Regression.]*


When a pair of predictor variables A and B are too highly correlated, there is a risk of multicollinearity within the model - which we do not want. A straightforward way of dealing with this risk is to drop either variable A or variable B. However, one may ask themselves: which variable should we drop? 

There is also another risk attributed with dropping such variables. Namely, when you drop any predictor variable you, inevitably, risk losing some of the of the explainability within your data...

<br>

Therefore, I believe a good way of dealing with the challenge posed by multicollinearity - is to prioritise **dropping the less significant variables first** while **minimising the total number of variables dropped** altogether. And to achieve this, I've wrote the function described below: 

#### Drop features function: bivariate correlation version

<!-- 1. Calculate the bivariate correlation matrix.
2. If all bivariate correlations are below the "*risk of multicollinearity*" threshold (80%), stop here.
2. Find pairs of variables causing multicollinearity.
3. For each pair, identify the lesser significant variable and add it to the "*droppable_list*"  (based off R-squared values). 
4. Drop the least significant variable from the "*droppable_list*" list
5. Go back to step 1 -->


<img src="img/drop_features_v1.png" alt="v1" style="width: 500px;"/>

*Created using https://www.101computing.net/flowchart/*

Before we run the algorithm however, let's calculate the R-squared value for each predictor variable.

In [None]:
def r_squared_by_position(position):
    
    # Initialise DataFrame
    df_by_position_dict = {'goalkeeper':df_goalkeeper, 'defender':df_defender, 
                           'midfielder':df_midfielder, 'forward':df_forward}
    df_position = df_by_position_dict[position]
    
    # Initialise Range
    if position == 'goalkeeper':
        range_predictors = range(0,23)
    else:
        range_predictors = range(0,39)

    # For every predictor variable
    r_squared_col = []
    for i in range_predictors:
        var_name = df_position.columns[i]

        # Define variable & Standardise
        df_var = pd.DataFrame(StandardScaler().fit_transform(df_position[[var_name]])).rename(columns={0:var_name}, inplace=False)
        df_var['total_points'] = df_position[['total_points']].reset_index()['total_points']

        # Define independent and dependent variables
        X = df_var[var_name].values.reshape(-1,1)
        y = df_var['total_points']

        # Fit model  
        model = LinearRegression().fit(X, y)

        # Append R-squared value
        r_squared_col.append([var_name, model.score(X, y)])

    # Define DataFrame
    df_r_squared = pd.DataFrame(r_squared_col).rename(columns={0:'var',1:'r_squared'}, inplace=False)
    df_r_squared = df_r_squared.sort_values(by='r_squared', ascending = False).reset_index(drop=True)
    
    return df_r_squared

df_r_squared_goalkeeper = r_squared_by_position('goalkeeper')
df_r_squared_defender = r_squared_by_position('defender')
df_r_squared_midfielder = r_squared_by_position('midfielder')
df_r_squared_forward = r_squared_by_position('forward')
df_r_squared_forward.head(4)

Now we can run the algorithm for each position dataset.

In [None]:
def drop_features_v1(corrMatrix, df_r_squared_position, position):
    
    # Initialise DataFrame
    df_by_position_dict = {'goalkeeper':df_goalkeeper, 'defender':df_defender, 
                           'midfielder':df_midfielder, 'forward':df_forward}
    df_position = df_by_position_dict[position]

    print("Variables dropped from "+position.capitalize()+" dataset:")

    # Are all bivariate correlations are below the "risk of multicollinearity" threshold (80%)?
    while corrMatrix[corrMatrix>0.80].sum().mean() != 1: 

        # Create DataFrame containing the pairs of variables causing the multicollinearity
        bivar_corr = corrMatrix.stack()
        df_high_corr = pd.DataFrame(bivar_corr[bivar_corr>0.80]).reset_index()
        df_high_corr = df_high_corr.rename(columns={'level_0':'varA', 'level_1':'varB', 0:'corr'}, inplace=False)

        # Remove self-to-self correlations
        df_high_corr = df_high_corr[~(df_high_corr['varA']==df_high_corr['varB'])]

        # Join on R-squared values
        df_high_corr = pd.merge(df_high_corr, df_r_squared_position, left_on=['varA'], right_on=['var'], how='left')
        df_high_corr = df_high_corr.rename(columns={'var':'varAb'}, inplace=False)
        df_high_corr = pd.merge(df_high_corr, df_r_squared_position, left_on=['varB'], right_on=['var'], how='left')
        df_high_corr = df_high_corr.rename(columns={'var':'varBb', 'r_squared_x':'r_squared_varA',
                                                    'r_squared_y':'r_squared_varB'}, inplace=False)

        # For each pair, identify the lesser significant variable and add it to the "droppable_list"  (based off R-squared values)
        df_high_corr['min_varA_varB'] = df_high_corr[['r_squared_varA', 'r_squared_varB']].min(axis=1)
        df_high_corr['droppable_list'] = np.where(df_high_corr['r_squared_varB'] == df_high_corr['min_varA_varB'], 
                                                  df_high_corr['varB'], df_high_corr['varA'])
        df_high_corr = df_high_corr.sort_values(by='min_varA_varB')

        # Drop the least significant variable from the "droppable_list" list
        variable_to_be_dropped = df_high_corr['droppable_list'].iloc[0]                
        df_position = df_position.drop(columns = variable_to_be_dropped)
        print(variable_to_be_dropped)

        # Calculate the bivariate correlation matrix
        corrMatrix = df_position.loc[:,~df_position.columns.isin(['total_points'])].corr()
        corrMatrix = round(corrMatrix, 2)
    print("\n")
    
    return df_position

# Run for each subdataset
df_goalkeeper = drop_features_v1(corrMatrix_goalkeeper, df_r_squared_goalkeeper, 'goalkeeper')
df_defender = drop_features_v1(corrMatrix_defender, df_r_squared_defender, 'defender')
df_midfielder = drop_features_v1(corrMatrix_midfielder, df_r_squared_midfielder, 'midfielder')
df_forward = drop_features_v1(corrMatrix_forward, df_r_squared_forward, 'forward')

Looking again at the shape of our subdatasets - we can see they contain fewer predictor variables as expected.

In [None]:
print('Goalkeeper shape:'+str(df_goalkeeper.shape)+'\nDefender shape:'+str(df_defender.shape)+
      '\nMidfielder shape:'+str(df_midfielder.shape)+'\nForward shape:'+str(df_forward.shape))

### 2<sup>nd</sup> Check for Multicollinearity: Variance Inflation Factors

Let's test again for multicollinearity - by calculating the Variance Inflation Factor for each predictor variable. We will visualize the magnitude of the factors via. barchart.

In [None]:
# Initialise Dictionaries to store DataFrames
df_by_position_dict = {'goalkeeper':df_goalkeeper, 'defender':df_defender, 
                       'midfielder':df_midfielder, 'forward':df_forward}
df_vif_dict = {}

# For every predictor variable, calculate the VIF, and convert to DataFrame
for position in ['goalkeeper', 'defender', 'midfielder', 'forward']:
    
    # Initialise 
    df_vif_dict[position] = pd.DataFrame()
    df_position = df_by_position_dict[position]

    # Calculate VIFs using module  
    df_vif_dict[position]['vif_factor'] = [variance_inflation_factor(
                                                df_position.loc[:,~df_position.columns.isin(['total_points'])].values, i) 
                                                for i in range(df_position.loc[:,~df_position.columns.isin(['total_points'])].shape[1])]
    
    # Create column & sort
    df_vif_dict[position]['feature'] = df_position.loc[:,~df_position.columns.isin(['total_points'])].columns
    df_vif_dict[position] = df_vif_dict[position].sort_values(by='vif_factor')

# Visualize the VIFs for every position
for df_vif in [(df_vif_dict['goalkeeper'], 'goalkeeper'), (df_vif_dict['defender'], 'defender'), 
               (df_vif_dict['midfielder'], 'midfielder'), (df_vif_dict['forward'], 'forward')]:
    sns.set(rc={'figure.figsize':(21,7)})
    sns.barplot(x="feature", y="vif_factor", data=df_vif[0]).set_title(df_vif[1].capitalize()+
                " Variance Inflation Factors", fontsize=20)
    
    # Formatting
    axes = plt.gca()
    plt.hlines(y=10, xmin=axes.get_xlim()[0], xmax=axes.get_xlim()[1], colors='red')
    plt.text(axes.get_xlim()[0], 11, "Risk of multicollinearity")
    plt.xticks(rotation=70)
    plt.show()

### Remove the variables causing multicollinearity (*i.e. VIF > 10*)
*[Note: In this subsection, the "significance" of a predictor variable is its R-squared value when fitted against the target variable (**total_points**) via. (Simple) Linear Regression.]*


<!-- Simlarly to be before, when a variable has a Variance Inflation Factor of above 10, there is also a risk of multicollinearity. 
 -->
VIF factors greater than 10 may also indicate high multicollinearity within our dataset(s). Therefore, I've wrote another function - similar to the one before - to manage this challenge:

#### Drop features function: VIF version

<!-- 1. Calculate the VIFs for all predictor variables. 
2. If all VIFs are below the "*risk of multicollinearity*" threshold (10), stop here.
3. Find all the variables causing multicollinearity.
4. Drop the least significant variable from the list of variables.
5. Go back to step 1. -->

<img src="img/drop_features_v2.png" alt="v2" style="width: 500px;"/>

*Created using https://www.101computing.net/flowchart/*

Running the algorithm, we drop the following variables:

In [None]:
def drop_features_v2(df_r_squared_position, position):

    # Initialise 
    df_vif_dict[position] = pd.DataFrame()
    df_high_vif = pd.DataFrame(['NaN'])
    df_position = df_by_position_dict[position]
    
    print("Variables dropped from "+position.capitalize()+" dataset:")

    # Are all VIFs are below the "*risk of multicollinearity*" threshold (10)?
    while len(df_high_vif) != 0:
        
        # Calculate VIFs using module
        df_vif_dict[position]['vif_factor'] = [variance_inflation_factor(
                                                    df_position.loc[:,~df_position.columns.isin(['total_points'])].values, i) 
                                                    for i in range(df_position.loc[:,~df_position.columns.isin(['total_points'])].shape[1])]
        
        # Create column & sort
        df_vif_dict[position]['feature'] = df_position.loc[:,~df_position.columns.isin(['total_points'])].columns
        df_vif_dict[position] = df_vif_dict[position].sort_values(by='vif_factor').reset_index(drop=True)

        # Find all the variables causing multicollinearity
        df_high_vif = df_vif_dict[position][df_vif_dict[position]['vif_factor']>10].reset_index(drop=True)

        # Join on R-squared values
        df_high_vif = pd.merge(df_high_vif, df_r_squared_position, left_on=['feature'], right_on=['var'], how='left')
        df_high_vif = df_high_vif.sort_values(by='r_squared')

        # Drop the least significant variable from the list of variables.
        variable_to_be_dropped = df_high_vif['feature'].iloc[0]
        print(variable_to_be_dropped)
        df_position = df_position.drop(columns = variable_to_be_dropped)
        df_vif_dict[position] = df_vif_dict[position][df_vif_dict[position]['feature'] != variable_to_be_dropped]

        # (Re-)Calculate the VIFs using module
        df_vif_dict[position]['vif_factor'] = [variance_inflation_factor(
                                                    df_position.loc[:,~df_position.columns.isin(['total_points'])].values, i) 
                                                    for i in range(df_position.loc[:,~df_position.columns.isin(['total_points'])].shape[1])]
        
        # Create column & sort
        df_vif_dict[position]['feature'] = df_position.loc[:,~df_position.columns.isin(['total_points'])].columns
        df_vif_dict[position] = df_vif_dict[position].sort_values(by='vif_factor')

        # Re-find all the variables causing multicollinearity
        df_high_vif = df_vif_dict[position][df_vif_dict[position]['vif_factor']>10].reset_index(drop=True)
    
    print("\n")
    
    return df_position
    
# Run for each subdataset
df_goalkeeper = drop_features_v2(df_r_squared_goalkeeper, 'goalkeeper')
df_midfielder = drop_features_v2(df_r_squared_midfielder, 'midfielder')
df_defender = drop_features_v2(df_r_squared_defender, 'defender')
df_forward = drop_features_v2(df_r_squared_forward, 'forward')

Again, let's look at the shape of our subdatasets. Now they contain even fewer predictor variables as expected.

In [None]:
print('Goalkeeper shape:'+str(df_goalkeeper.shape)+'\nDefender shape:'+str(df_defender.shape)+
      '\nMidfielder shape:'+str(df_midfielder.shape)+'\nForward shape:'+str(df_forward.shape))

### Perform LR for each position / Visualize the significance of each coeffecient

Finally, we can perform the (Multiple) Linear Regression(s), since we have minimised the multicollinearity within our dataset(s).

In [None]:
def viz_significant_features(position):
    
    # Initialise
    df_by_position_dict = {'goalkeeper':df_goalkeeper, 'defender':df_defender, 
                           'midfielder':df_midfielder, 'forward':df_forward}
    df_position = df_by_position_dict[position]

    # Standardise independent variables
    df_position.loc[:, df_position.columns != 'total_points'] = StandardScaler().fit_transform(
                                                            df_position.loc[:, df_position.columns != 'total_points'])

    # Define independent and dependent variables
    X = df_position.loc[:, ~df_position.columns.isin(['total_points'])]
    y = df_position['total_points']

    # Fit model  
    model = LinearRegression().fit(X, y)

    # Make DataFrame w/ coeffecients
    df_plot = pd.DataFrame({'values': model.coef_, 'variable': X.columns}).sort_values(by='values') #, columns=['values'])
    df_plot = df_plot.set_index('variable')

    # Formatting 
    plt.figure(figsize=(21,7))
    plt.title(position.capitalize()+" dataset: Coeffecient Significance"+"\nR-squared: "+str(round(model.score(X,y),3)), fontsize=16)
    plt.xlabel("Variable", fontsize=14)
    plt.ylabel("Coeffecient", fontsize=14)

    # Plot
    df_plot['values'].plot(kind='bar',color=(df_plot['values'] > 0).map({True: 'g', False: 'r'}))
    plt.xticks(rotation=70)
    plt.xticks(fontsize=12)

    return 

viz_significant_features('goalkeeper')
viz_significant_features('defender')
viz_significant_features('midfielder')
viz_significant_features('forward')

To conclude Part 1, recall the goal of this section was to gain the following insight: "***Insight 1)*** *By examining the LR coeffecients, we can understand which variables are significant when predicting fantasy football points.*" - therefore, let's examine the coeffecients and the submodels below:

1. All submodels had a R-squared value of around ~17% apart from the defender submodel which has a considerably lower R-squared value of ~8%.
2. Notably, **time_pgw** was a signifcant variable in most submodels.
    * This shouldn't come as a surprise - players need to play to gain FPL points! The variable, however, was dropped from the defender submodel which perhaps could explain the submodel's lower R-squared value.
  
    
3. Interestingly, **xGChain_pgw** was very important for both goalkeepers and defenders. 
    * This variable captures players heavily involved in the buildup play to goals. Therefore, it's an interesting observation that goalkeepers and defenders more involved in the buildup play to goals are likely to gain more FPL points.


4. Unexpectedly, **assists_pgw** had a negative coeffecient  in the midfielder submodel.
    * This definitely came as a surprise, as the variable can almost certainly be considered to have the "wrong sign". After doing some research however (See [Oh No! I Got the Wrong Sign! What should I do](http://www.stat.columbia.edu/~gelman/stuff_for_blog/oh_no_I_got_the_wrong_sign.pdf) for more), I've decided to continue with the variables proposed by both of the above algorithms. There is a multitude of reasons which perhaps explain why this variable has the opposite sign than expected, and therefore, I think this is best left for future investigation. 

### Linear Regression, Part 2: Making Predictions 

Now we'll be focusing on ***Insight 2)*** .

Using the important features w.r.t each position we have just found - let's make some predictions on last season's data. In particular, let's simulate how LR would have performed as if the model(s) were deployed at the beginning of last season.

<br>

Before running any simulations, however, we must redefine our subdatasets/submodels using only the important features kept from Part 1:

In [None]:
# Define the important features we have just found
goalkeeper_vars = df_goalkeeper.columns
defender_vars = df_defender.columns
midfielder_vars = df_midfielder.columns
forward_vars = df_forward.columns

# Keep the important features we have just found
df_goalkeeper = df[df['goalkeeper_flag']==1][goalkeeper_vars]
df_defender = df[df['defender_flag']==1][defender_vars]
df_midfielder = df[df['midfielder_flag']==1][midfielder_vars]
df_forward = df[df['forward_flag']==1][forward_vars]

print('Goalkeeper shape:'+str(df_goalkeeper.shape)+'\nDefender shape:'+str(df_defender.shape)+
      '\nMidfielder shape:'+str(df_midfielder.shape)+'\nForward shape:'+str(df_forward.shape))

### Simulate model on last season's data: Linear Regression
Let's define some functions ```get_preds_linear_model()``` and ```simulate_linear_model()``` to simulate how our LR model(s) would have performed.

In [None]:
def get_preds_linear_model(gameweek, position, model_str):

    """Returns the predictions made for a given simulated gameweek and 
    (position) subdataset. 

    :param: int64 gameweek: The first gameweek in the test range. E.g. if 
            we're at the start of gameweek 5, we make predictions for gameweeks 
            5,6,7,8 and 9. 
            str position: The subdataset/submodel we're running. 
            str model: What model are we using to generate predictions?
            
    :rtype: float64 y_train_pred: Training data predictions obtained via.
            10 Fold Cross Validation.
            int64 y_train_actual: Training data target variable actual values.
            float64 y_test_pred: Test data predictions obtained via.
            model trained on the training data.
            int64 y_test_actual: Test data target variable actual values.
    
    """    

    # Initialise gameweek ranges
    prev_gw = gameweek-1
    all_gameweeks = list(range(0,prev_gw+6))
    train_gameweeks = list(range(0,prev_gw+1))
    test_gameweeks = list(range(prev_gw+1,prev_gw+6))

    # Initiaise position subdataset
    df_by_position_dict = {'goalkeeper':df_goalkeeper, 'defender':df_defender, 
                           'midfielder':df_midfielder, 'forward':df_forward}
    df_position = df_by_position_dict[position]

    # Get all gameweeks in both sets of ranges
    df_all_gameweeks = df_position[(df_position.index.get_level_values('event').isin(all_gameweeks))]

    # Rename target variable
    df_all_gameweeks = df_all_gameweeks.rename(columns={'total_points':'total_points_actual'})

    # Standardise the independent variables
    df_all_gameweeks.loc[:, df_all_gameweeks.columns != 'total_points_actual'] = StandardScaler().fit_transform(
                                            df_all_gameweeks.loc[:, df_all_gameweeks.columns != 'total_points_actual'])

    # Train-Test split the data
    train = df_all_gameweeks[(df_all_gameweeks.index.get_level_values('event').isin(train_gameweeks))]
    test = df_all_gameweeks[(df_all_gameweeks.index.get_level_values('event').isin(test_gameweeks))]

    # Define independent and dependent variables
    X_train = train.loc[:, train.columns != 'total_points_actual']
    y_train_actual = train['total_points_actual']     
    X_test = test.loc[:, train.columns != 'total_points_actual']
    y_test_actual = test['total_points_actual'] 
    
    # Which model are we simulating?...
    if model_str == 'Linear Regression': 
        model = LinearRegression().fit(X_train, y_train_actual)
        
    elif model_str == 'Ridge Regression':
        # Create Ridge regression with multiple alphas - then fit to training data
        regr_cv = RidgeCV(alphas = np.logspace(-10, 5, 100), cv=10)
        ridge_cv = regr_cv.fit(X_train, y_train_actual)
        
        # Find the best alpha using Cross Validation
        best_alpha = ridge_cv.alpha_
        model = RidgeCV(alphas = [best_alpha]).fit(X_train, y_train_actual)
        
    elif model_str == 'LASSO Regression':
        # Create LASSO regression with multiple alphas - then fit to training data
        regr_cv = LassoCV(alphas = np.logspace(-10, 5, 100), cv=10)
        lasso_cv = regr_cv.fit(X_train, y_train_actual)

        # Find the best alpha using Cross Validation
        best_alpha = lasso_cv.alpha_
        model = LassoCV(alphas = [best_alpha]).fit(X_train, y_train_actual) 

    # Make predictions    
    y_train_pred = cross_val_predict(model, X_train, y_train_actual, cv=10)
    y_test_pred = model.predict(X_test)
    
    return y_train_pred, y_train_actual, y_test_pred, y_test_actual



def simulate_linear_model(model_str):
    
    """Returns the predictions made for every gameweek and every subdataset/
    submodel, i.e. the full (simulated) Premier League 2020/2021 season.  

    :param: model: What model are we using to generate predictions?
    :rtype: DataFrame df_predictions_grouped: Stores the predictions made for 
            every gameweek and every subdataset/submodel.
    
    """   

    pos_predictions_array = []
    gw_predictions_array = []
    predictions_array = []

    # For every gameweek in the season
    for gw in tqdm(list(range(3,39))):
        
        # For every position
        for pos in ['goalkeeper', 'defender', 'midfielder', 'forward']:

            # Simulate predictions for the gameweek 
            y_train_pred, y_train_actual, y_test_pred, y_test_actual = get_preds_linear_model(gw, pos, model_str) 
            
            # Convert predictions to DataFrame
            df_train_pos_predictions = pd.DataFrame({'dataset':'training', 'gameweek':gw, 'position':pos, 
                                                     'predicted_points':y_train_pred, 'actual_points':y_train_actual})
            df_test_pos_predictions = pd.DataFrame({'dataset':'test', 'gameweek':gw, 'position':pos, 
                                                    'predicted_points':y_test_pred, 'actual_points':y_test_actual})
            
            # Concatenate & append
            df_pos_predictions = pd.concat([df_train_pos_predictions, df_test_pos_predictions])
            pos_predictions_array.append(df_pos_predictions)

        # Concatenate and append
        df_gw_predictions = pd.concat(pos_predictions_array)
        predictions_array.append(df_gw_predictions)
            
    # Concatenate ALL predictions for the season into single DataFrame
    df_predictions = pd.concat(predictions_array)
        
    # Concatenate ALL predictions for the season into single DataFrame
    df_predictions = pd.concat(predictions_array)
        
    # Calculate absolute error
    df_predictions['absolute_error'] = abs(df_predictions['predicted_points'] - df_predictions['actual_points'])
    df_predictions = df_predictions.drop(columns={'predicted_points','actual_points'}).reset_index(drop=True)

    # Groupby at two levels to get: a) Predictions for all positions by gameweek (i.e. overall level)
    df_predictions_overall = df_predictions.drop(columns='position').groupby(['dataset','gameweek']).agg(['mean']).reset_index()
    df_predictions_overall.insert(2, 'position', 'overall')

    # b) Predictions by gameweek and by position
    df_predictions_by_position = df_predictions.groupby(['dataset','gameweek','position']).agg(['mean']).reset_index()

    # Concatenate to get full grouped predictions
    df_predictions_grouped = pd.concat([df_predictions_overall, df_predictions_by_position])
    df_predictions_grouped.columns = df_predictions_grouped.columns.droplevel(1)
    df_predictions_grouped = df_predictions_grouped.rename(columns={'absolute_error':model_str.lower().replace(" ","_")+"_mae"})
    
    return df_predictions_grouped

Now, let's run the simulation using Linear Regression.

In [None]:
# Initialise dictionary which will store predictions
predictions_dict = {}

# Run simulation
df_predictions_grouped = simulate_linear_model(model_str = 'Linear Regression')

# Add to dictionary of predictions
predictions_dict['Linear Regression'] = df_predictions_grouped
df_predictions_grouped.head(4)

### Analyse predictions: Linear Regression

To help analyse the predictions/simulations - I've created a straightforward [tool](https://public.tableau.com/app/profile/samuel.harrison2532/viz/model_simulation_analysis/Dashboard) within Tableau. I had some experience using the software while on placement - so I thought it'd be cool to continue using/learning the software as part of this project. 

Anyhow, let's summarise how Linear Regression performed:
 
**Season-long Average Mean Absolute Errors**:

| Training/Test | Position<br>subdataset | Linear Regression<br>Season-long<br>Avg. MAE (2 dp) |
| --- | --- | --- |
| Test | Overall | **1.88** |
| Test | Goalkeepers | 2.18 |
| Test | Defenders | 2.04 |
| Test | Midfielders | 1.63 |
| Test | Forward | 2.14 |

* Using Linear Regression, we have found 'baseline' predictions as planned (recall ***Insight 2)***). Namely, we ended up with an Overall (i.e. all positions) Season-long average MAE of **1.88 points.**
* On average, Midfielders were predicted the best while Goalkeepers were predicted the worse. 

**Other insights**:

* However, there is cause for concern for Forwards and Midfielders. Between GWs 25-28, **MAE increased sharply** in both positions. Namely, MAE increased by about 21% for Forwards and by about 8% for Mifielders over the course of the three gameweeks. We expect models to improve with more data hence this was unexpected.
    


<!-- ; error spiked between GWs 23-28, namely, MAE increased from 1.61 to 2.07 over the course of the five gameweeks. 
 -->
<!-- Analysis Outline:
* High Level/Season-long analysis insights by overall/BestToWorst position
* Overfitting? -->

<a id='RidgeAndLASSO'></a>
## Ridge Regression and LASSO Regression
Next, let's try some types of regularized regression: Ridge Regression (Ridge) and LASSO Regression (LASSO). 

The benefit of using these two types of regression is that they handle multicollinearity better than Linear Regression. Because of this, the setup for this next section will be much simpler. Hence, let's define the methodology:

**Methodology** 

1. Parameter tuning: **How will we tune the parameter alpha $\alpha$?**
2. Simulate model on last season's data
3. Analyse predictions

<br>

Before continuing, let's must redefine our subdatasets.

In [None]:
df_goalkeeper, df_defender, df_midfielder, df_forward = define_player_subdatasets(df)

### Parameter tuning

**1. What is alpha $\alpha$?**

As we have just said, there isn't much setup for Ridge and LASSO when compared to LR. Ridge and LASSO are, however, **parameterised by the tuning parameter denoted by alpha $\alpha$**. This value adds a shrinkage penalty which - to put it briefly - penalizes large coefficients and makes sure coeffecients do not get too large. Statistically what this parameter does, therefore, is **add bias** to the predictions made in exchange for **reduced variance**. 

*[Note: I could talk more about both regularized regression models and the mathematical/statistical differences between the two; however, for now I'd like to keep this part brief. It's Summer 2021 at the time of completing this project and I'll soon be returning to University. The syllabus next semester covers these kind of statisical modelling topics (regression, bias, variance and estimators) in greater detail than what I've already learned. Hence, I'd like to wait until then before studying this material in greater depth!]*


<br>

**2. How do we optimize $\alpha$?**


We'll optimize alpha $\alpha$ **for each position and for each gameweek** by 10-Fold Cross Validation. Namely, we'll try n=100 different values spaced evenly on a log scale between 10^-10 and 10^5.

### Simulate model on last season's data: Ridge Regression and LASSO Regression

Now, let's run the simulation using Ridge Regression.

In [None]:
# Run simulation
df_predictions_grouped  = simulate_linear_model(model_str = 'Ridge Regression')

# Add to dictionary of predictions
predictions_dict['Ridge Regression'] = df_predictions_grouped
df_predictions_grouped.head(4)

Now, let's run the simulation using LASSO Regression.

In [None]:
# Run simulation
df_predictions_grouped = simulate_linear_model(model_str = 'LASSO Regression')

# Add to dictionary of predictions
predictions_dict['LASSO Regression'] = df_predictions_grouped
df_predictions_grouped.head(4)

### Analyse predictions: Ridge Regression and LASSO Regression

Again making use of the Tableau [dashboard](https://public.tableau.com/app/profile/samuel.harrison2532/viz/model_simulation_analysis/Dashboard), let's summarise the performance of both Ridge Regression and LASSO Regression:
 
**Season-long Average Mean Absolute Errors**:

| Training/Test | Position<br>subdataset | Linear Regression<br>Season-long<br>Avg. MAE (2 dp) |Ridge Regression<br>Season-long<br>Avg. MAE (2 dp) |LASSO Regression<br>Season-long<br>Avg. MAE (2 dp) |
| --- | --- | --- | --- | --- |
| Test | Overall     | **1.88** | **1.88**|**1.87** |
| Test | Goalkeepers | 2.18     | 2.28    | 2.18    |
| Test | Defenders   | 2.04     | 1.99    | 1.99    |
| Test | Midfielders | 1.63     | 1.64    | 1.63    |
| Test | Forward     | 2.14     | 2.18    | 2.14    |
  
* Both regularized regression models bettered Linear Regression's Overall Season-long Avg. MAE - with **LASSO Regression performing the best.**
* Similar to before, both regularized models predicted Midfielders the best and Goalkeepers the worst.

<!-- 
* Using Linear Regression, we have found 'baseline' predictions as planned - recall ***Insight 2)***. Namely, we ended up with an overall (i.e. for all positions) season-long average MAE of **1.88 points.**
* On average, Midfielders were predicted the best while Goalkeepers were predicted the worse.  -->

**Other insights**:

* Unfortunately, for both Ridge and LASSO, MAE increased sharply between GWs 25-28 for Forwards and Midfielders - as we saw before with the Linear Regression model. Hence there is clearly something is going on here. I've done some investigating and came up with a potential explanation as to what might be happening:

**Potential explanation for MAE spike between GWs 25-28 (for both Forwards and Midfielders):**

* I think the root cause for this problem may well be **squad rotation**. Namely, I think there was lots of unexpected rotation amongst both Forwards and Midfielders, which may have impacted the models' predictions. Here's a couple reasons why I think teams may have rotated their Forwards/Midfielders around this period:
    1. **Double Gameweeks:** GWs 24 and 25 were "double gameweeks", i.e. some teams had two games within one gameweek. And it's common knowledge that fixture congestion can lead to squad rotation.
    2. **Champions League/Europa League:** Around GW25 began the knockout stages in the Champions League and the Europa League which again likely led to lots of unexpected squad rotation. In fact, 7 Premier League teams played in these knockout games (Man City, Man United, Liverpool, Chelsea, Arsenal, Spurs and Leicester) which you might expect to have exacerbated the impact.
    3. **Why Forwards and Midfielders?:** Notice predictive accuracy for Goalkeepers and Defenders *did not worsen* as we saw for Forwards and Midfielders around the same period. I would argue this supports my hypothesis since we know Forwards/Midfielders tend to be rotated more than Goalkeepers/Defenders. 

*[Note: More rigorous future investigation should indeed be taken. For now, however, I plan on monitoring the models' performances next season and seeing if the same thing happens again.]*

<a id='NonLinearModels'></a>
## Non-linear models

Now, let's move on to some non-linear models. 

Since we're now using non-linear models, **we will no longer partition the dataset by playing position**. I'm hopeful the non-linear models tried below will capture the non-linear relationships between the predictor variables; in particular, the relationship between the position variables (e.g. defender_flag) and the rest of the predictor variables.

<!-- To begin with, I'd like to see how far we can get just by using linear models. I'm more familiar with linear models having studied them as part of my degree (in particular, Linear Regression), so I think it would be cool to put some of this theory into action.

For the linear models tried below, I believe stronger predictions will be made by first **partitioning the dataset by playing position** into smaller “subdatasets”. Namely, I've made the decision to create and train individual submodels for each playing position, in the hope of achieving better estimates. 

To illustrate my thinking, it should be noted that in fantasy football, players from different positions score points differently (e.g. goalkeepers and defenders score +4 points for a clean sheet, while midfielders score just +1 point, and forwards score 0 points). Therefore, I believe linear models, at least, will produce better estimates when each model/submodel is trained on just the one position specifically. -->

<a id='RF'></a>
## Random Forest 
For the first non-linear model, we'll try a Random Forest (RF). Let's define the methodology:

**Methodology**

1. On *all of last season's data*, use **RandomizedSearchCV** to narrow down the search space for the best possible parameters.
2. Using the best parameters found from step 1, concentrate our search for the best parameters using **GridSearchCV**.

Then, using the best settings (just found):

3. Simulate model on last season's data
4. Analyse predictions

*[Note: This methodology was inspired by the article [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) written by Will Koehrsen. Random Forest has lots of settings hence I wasn't sure where exactly to begin. So thanks to Will for writing such a helpful, well-written article on the subject!]*

### Randomized search
First, let's create a parameter grid to sample from.

In [None]:
# Number of trees in Random Forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# Define independent and dependent variables
X = df.loc[:, ~df.columns.isin(['total_points'])]
y = df['total_points']

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()

# Random search of parameters, using 3 fold cross validation (4320 settings in total)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 30, cv = 3, verbose=2, 
                               random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X, y)

From the n=30 random samples, what were the best parameters?

In [None]:
rf_random.best_params_

### Grid search
Using the best parameters from the above search, let's concentrate our search for the best RF parameters using GridSearchCV.

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [8, 10],
    'max_features': [5, 7],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [5, 7],
    'n_estimators': [50, 100, 200]
}

# Create a based model
rf = RandomForestRegressor()

# Instantiate the grid search model
rf_grid = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 10, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
rf_grid.fit(X, y)

From the n=48 combinations in the grid, what were the best parameters?

In [None]:
rf_grid.best_params_

Let's save down our estimate for the best possible RF settings.

In [None]:
# Save best estimator for future reference
rf_best_estimator = rf_grid.best_estimator_

### Simulate model on last season's data: Random Forest
As mentioned earlier, we'll no longer be partioning the dataset by playing position since we're now testing non-linear models. With that said, let's define two new functions ```get_preds_non_linear_model()``` and ```simulate_non_linear_model()``` to help run the simulations for the following non-linear models.

In [6]:
def get_preds_non_linear_model(gameweek, model_str):

    """Returns the predictions made for a given simulated gameweek.
    
    :param: int64 gameweek: The first gameweek in the test range. E.g. if 
            we're at the start of gameweek 5, we make predictions for gameweeks 
            5,6,7,8 and 9. 
            str model: What model are we using to generate predictions?
            
    :rtype: DataFrame train: Training data predictions obtained via.
            10 Fold Cross Validation.
            DataFrame test: Training data predictions obtained via.
            10 Fold Cross Validation.
    """    

    # Initialise gameweek ranges
    prev_gw = gameweek-1
    all_gameweeks = list(range(0,prev_gw+6))
    train_gameweeks = list(range(0,prev_gw+1))
    test_gameweeks = list(range(prev_gw+1,prev_gw+6))

    # Get all gameweeks in both sets of ranges
    df_all_gameweeks = df[(df.index.get_level_values('event').isin(all_gameweeks))]

    # Rename target variable
    df_all_gameweeks = df_all_gameweeks.rename(columns={'total_points':'total_points_actual'})

    # Standardise the independent variables
    df_all_gameweeks.loc[:, df_all_gameweeks.columns != 'total_points_actual'] = StandardScaler().fit_transform(
                                            df_all_gameweeks.loc[:, df_all_gameweeks.columns != 'total_points_actual'])

    # Train-Test split the data
    train = df_all_gameweeks[(df_all_gameweeks.index.get_level_values('event').isin(train_gameweeks))]
    test = df_all_gameweeks[(df_all_gameweeks.index.get_level_values('event').isin(test_gameweeks))]

    # Define independent and dependent variables
    X_train = train.loc[:, train.columns != 'total_points_actual']
    y_train_actual = train['total_points_actual']     
    X_test = test.loc[:, train.columns != 'total_points_actual']
    y_test_actual = test['total_points_actual'] 

    # Which model are we simulating?...
    if model_str == 'Random Forest':
        # Create Random Forest with best settings 
        rf = rf_best_estimator

        # Fit Random Forest to training data
        model = rf.fit(X_train, y_train_actual)    
        
    elif model_str == 'XGBoost':
        # Create XGBoost with best settings 
        xbgr = xgbr_best_estimator

        # Fit XGBoost to training data
        model = xbgr.fit(X_train, y_train_actual) 
        
    elif model_str == 'XGBoost V2':
        # Initialise XGBoost
        xbgr = xgb.XGBRegressor()
        
        # Setup search heuristic using parameter grid from earlier
        sh = HalvingGridSearchCV(xbgr, param_grid, cv = 5, factor = 5, 
                    min_resources ='exhaust', n_jobs = -1, verbose = 2, random_state = 42).fit(X_train, y_train_actual) 
        
        # Fit XGBoost best estimator to training data
        model = sh.best_estimator_.fit(X_train, y_train_actual) 

    # Make predictions    
    y_train_pred = cross_val_predict(model, X_train, y_train_actual, cv=10)
    y_test_pred = model.predict(X_test)

    # Create prediction column for train/test DataFrames
    train['total_points_predicted'] = y_train_pred
    test['total_points_predicted'] = y_test_pred

    return train, test


def simulate_non_linear_model(model_str):
    
    """Returns the predictions made for every gameweek and every subdataset/
    submodel, i.e. the full (simulated) Premier League 2020/2021 season.  

    :param: model: What model are we using to generate predictions?
    :rtype: DataFrame df_predictions_grouped: Stores the predictions made for 
            every gameweek and every subdataset/submodel.
    
    """   
    
    gw_predictions_array = []

    # For every gameweek in the season
    for gw in tqdm(list(range(3,39))):
        
        # Simulate predictions for the gameweek 
        train, test = get_preds_non_linear_model(gw, model_str) 

        # Manipulate training predictions
        df_train_predictions = train.reset_index()
        df_train_predictions['dataset'] = 'training'
        df_train_predictions['gameweek'] = gw
        df_train_predictions = df_train_predictions.rename(columns = {'total_points_predicted':'predicted_points',
                                                                      'total_points_actual':'actual_points'})
        df_train_predictions = df_train_predictions[['dataset','gameweek','position','predicted_points', 'actual_points']]
        
        # Manipulate test predictions
        df_test_predictions = test.reset_index()
        df_test_predictions['dataset'] = 'test'
        df_test_predictions['gameweek'] = gw
        df_test_predictions = df_test_predictions.rename(columns = {'total_points_predicted':'predicted_points',
                                                                    'total_points_actual':'actual_points'})
        df_test_predictions = df_test_predictions[['dataset','gameweek','position','predicted_points', 'actual_points']]

        # Concatenate & append
        df_gw_predictions = pd.concat([df_train_predictions, df_test_predictions])
        gw_predictions_array.append(df_gw_predictions)

    # Concatenate and append
    df_predictions = pd.concat(gw_predictions_array)
        
    # Calculate absolute error
    df_predictions['absolute_error'] = abs(df_predictions['predicted_points'] - df_predictions['actual_points'])
    df_predictions = df_predictions.drop(columns={'predicted_points','actual_points'}).reset_index(drop=True)

    # Groupby at two levels to get: a) Predictions for all positions by gameweek (i.e. overall level)
    df_predictions_overall = df_predictions.drop(columns='position').groupby(['dataset','gameweek']).agg(['mean']).reset_index()
    df_predictions_overall.insert(2, 'position', 'overall')

    # b) Predictions by gameweek and by position
    df_predictions_by_position = df_predictions.groupby(['dataset','gameweek','position']).agg(['mean']).reset_index()

    # Concatenate to get full grouped predictions
    df_predictions_grouped = pd.concat([df_predictions_overall, df_predictions_by_position])
    df_predictions_grouped.columns = df_predictions_grouped.columns.droplevel(1)
    df_predictions_grouped = df_predictions_grouped.rename(columns={'absolute_error':model_str.lower().replace(" ","_")+"_mae"})
    
    return df_predictions_grouped

time: 1.6 ms (started: 2021-08-28 15:22:01 +01:00)


Hence let's run the simulation using Random Forest.

In [None]:
# Run simulation
df_predictions_grouped = simulate_non_linear_model(model_str = 'Random Forest')

# Add to dictionary of predictions
predictions_dict['Random Forest'] = df_predictions_grouped
df_predictions_grouped.head(4)

### Analyse predictions: Random Forest

**Season-long Average Mean Absolute Errors**:

| Training/Test | Position<br>subdataset | Linear Regression<br>Season-long<br>Avg. MAE (2 dp) |Ridge Regression<br>Season-long<br>Avg. MAE (2 dp) |LASSO Regression<br>Season-long<br>Avg. MAE (2 dp) |Random Forest<br>Season-long<br>Avg. MAE (2 dp) |
| --- | --- | --- | --- | --- |--- |
| Test | Overall     | **1.88** | **1.88**|**1.87** |**1.83** |
| Test | Goalkeepers | 2.18     | 2.28    | 2.18    | 2.11    |
| Test | Defenders   | 2.04     | 1.99    | 1.99    | 1.98    |
| Test | Midfielders | 1.63     | 1.64    | 1.63    | 1.62    |
| Test | Forward     | 2.14     | 2.18    | 2.14    | 1.95    |
  
* **RF bettered all previous linear models'** Season-long Avg. MAE. across all positions.
* Forwards improved considerably using RF. Goalkeepers, however, still yield the worst estimates across all positions. 

<!-- 
* Using Linear Regression, we have found 'baseline' predictions as planned - recall ***Insight 2)***. Namely, we ended up with an overall (i.e. for all positions) season-long average MAE of **1.88 points.**
* On average, Midfielders were predicted the best while Goalkeepers were predicted the worse.  -->

**Other insights**:

* Interestingly, **RF made better predictions during the earlier gameweeks** across all postions excluding Goalkeepers. 
    * When viewing across all positions, (i.e. Overall) Random Forest made considerably better predictions than all previous models across GWs 3-10. 



<a id='XGB'></a>
## XGBoost
Since RF was sucessful, I'd like to try another non-linear model. And after doing some reading, I've decided to try XGBoost (XGB) next. 

I'm hopeful we'll obtain the best predictions so far using XGB, as my reading suggests the algorithm is very powerful and one of the most popular within the community for large datasets which is what we have. Methodology-wise, I'd like to replicate what we did for RF. I'm aware there is a variety of cool hyperparameter optimization techniques out there (many of which look powerful), however, for now I'd like to start simple. 

**Methodology**

1. On *all* of last season's data, use **RandomizedSearchCV** to narrow down the search space for the best possible parameters.
2. Using the best parameters found from RandomizedSearchCV, concentrate our search for the best parameters using **GridSearchCV**.

Then, using the best settings (just found):

3. Simulate model on last season's data
4. Analyse predictions

### Randomized search
First, let's create a parameter grid to sample from.

In [None]:
# Learning rate
eta = np.linspace(0.01, 0.4, num = 11)

# Minimum loss reduction required to make a further partition on a leaf node
gamma = np.logspace(-10, 5, 100)

# Maximum number of levels in tree
max_depth = [3, 4, 5, 6, 8, 10, 12, 15]

# Minimum weight needed in a child 
min_child_weight = [1, 3, 5, 7, 10, 12, 15]

# Subsample ratio of columns when constructing each tree
colsample_bytree = [0.3, 0.4, 0.5 , 0.7] 

# Create the random grid
random_grid = {
    'eta': eta,
    'gamma': gamma, 
    'max_depth': max_depth,
    'min_child_weight': min_child_weight, 
    'colsample_bytree': colsample_bytree}

# Define independent and dependent variables
X = df.loc[:, ~df.columns.isin(['total_points'])]
y = df['total_points']

# Use the random grid to search for best hyperparameters
# First create the base model to tune
xgbr = xgb.XGBRegressor()

# Random search of parameters, using 3 fold cross validation (7040 settings in total)
xgbr_random = RandomizedSearchCV(estimator = xgbr, param_distributions = random_grid, n_iter = 500, cv = 3, verbose=2, 
                                 random_state=42, n_jobs = -1)

# Fit the random search model
xgbr_random.fit(X, y)

From the n=500 random samples, what were the best parameters?

In [None]:
xgbr_random.best_params_

### Grid search
Using the best parameters from the above search, let's concentrate our search for the best XGB parameters using GridSearchCV.

In [10]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'eta': np.linspace(0.01, 0.1, num = 5),
    'gamma': np.logspace(-10, -8, 5), 
    'max_depth': [3, 4, 6],
    'min_child_weight': [10, 12, 15], 
    'colsample_bytree': [0.45, 0.5, 0.55]}

# Create a based model
xgbr = xgb.XGBRegressor()

# Instantiate the grid search model
xgbr_grid = GridSearchCV(estimator = xgbr, param_grid = param_grid, cv = 10, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
xgbr_grid.fit(X, y)

time: 830 µs (started: 2021-09-05 20:44:54 +01:00)


From the n=675 combinations in the grid, what were the best parameters?

In [None]:
xgbr_grid.best_params_

Let's save down our estimate for the best possible XGB settings.

In [None]:
# Save best estimator for future reference
xgbr_best_estimator = xgbr_grid.best_estimator_

### Simulate model on last season's data: XGBoost 

In [None]:
# Run simulation
df_predictions_grouped = simulate_non_linear_model(model_str = 'XGBoost')

# Add to dictionary of predictions
predictions_dict['XGBoost'] = df_predictions_grouped
df_predictions_grouped.head(4)

### Analyse predictions: XGBoost

**Season-long Average Mean Absolute Errors**:

| Training/Test | Position<br>subdataset | Linear Regression<br>Season-long<br>Avg. MAE (2 dp) |Ridge Regression<br>Season-long<br>Avg. MAE (2 dp) |LASSO Regression<br>Season-long<br>Avg. MAE (2 dp) |Random Forest<br>Season-long<br>Avg. MAE (2 dp) |XGBoost<br>Season-long<br>Avg. MAE (2 dp) |
| --- | --- | --- | --- | --- |--- |--- |
| Test | Overall     | **1.88** | **1.88**|**1.87** |**1.83** |**1.82** |
| Test | Goalkeepers | 2.18     | 2.28    | 2.18    | 2.11    | 2.12    |
| Test | Defenders   | 2.04     | 1.99    | 1.99    | 1.98    | 1.97    |
| Test | Midfielders | 1.63     | 1.64    | 1.63    | 1.62    | 1.62    |
| Test | Forward     | 2.14     | 2.18    | 2.14    | 1.95    | 1.94    |
  
* **XGB is the best-performing model so far** but only performed *slightly* better than RF.  

<!-- 
* Using Linear Regression, we have found 'baseline' predictions as planned - recall ***Insight 2)***. Namely, we ended up with an overall (i.e. for all positions) season-long average MAE of **1.88 points.**
* On average, Midfielders were predicted the best while Goalkeepers were predicted the worse.  -->

**Other insights**:

* However, when viewing Overall (i.e. all positions), XGB's MAE was better than RF's for 31 out of the 36 Gameweeks. In other words, **XGB was consistently better than RF**. 



<a id='XGBV2'></a>
## XGBoost V2

XGB has performed the best out of all the models tested so far. Therefore, I'd like to try running the XGB algorithm again, but this time with a slightly different methodology. In particular, I'd like to see how we can do if we were to **retune the model's hyperparameters EVERY gameweek.** We'll denote this new model by XGBv2.

**Methodology**

1. Simulate model on last season's data
    * For every Gameweek *3, 4, ... , 38*:
        * Using the parameter grid found earlier, search for the best parameters using **HalvingGridSearchCV**.
2. Analyse predictions

*A couple of things to note...* 

1) **What are we actually doing here?:** In essence, all we're doing here is retuning the XGBoost model every Gameweek. Before, we found some "good" XGB settings which we used for the whole season. However this time,  I'd like to see whether there's any benefit in retuning the model every Gameweek, as it should be noted that this could be easily replicated in practice. 

2) **What is HalvingGridSearchCV and why are we using it?:** * drumroll * HalvingGridSearchCV is just like GridSearchCV! The search strategy does this by an iterative procedure known as "halving". Basically, what happens here is different settings(/canditates) are tested using a small amount of data. The best candidates then are tested using more and more data until a best candidate is found. The benefit of using HalvingGridSearchCV is that is **much faster**  than GridSearchCV, which is crucial since this makes the methodology practical.


### Simulate model on last season's data: XGBoost  V2

In [7]:
predictions_dict = {}

time: 392 µs (started: 2021-08-28 15:22:08 +01:00)


In [8]:
# Run simulation
df_predictions_grouped = simulate_non_linear_model(model_str = 'XGBoost V2')

# Add to dictionary of predictions
predictions_dict['XGBoost V2'] = df_predictions_grouped
df_predictions_grouped.head(4)

  0%|          | 0/36 [00:00<?, ?it/s]

n_iterations: 2
n_required_iterations: 5
n_possible_iterations: 2
min_resources_: 10
max_resources_: 157
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits


  3%|▎         | 1/36 [03:52<2:15:47, 232.78s/it]

n_iterations: 3
n_required_iterations: 5
n_possible_iterations: 3
min_resources_: 10
max_resources_: 443
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits


  6%|▌         | 2/36 [08:28<2:19:16, 245.77s/it]

n_iterations: 3
n_required_iterations: 5
n_possible_iterations: 3
min_resources_: 10
max_resources_: 761
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits


  8%|▊         | 3/36 [12:35<2:15:22, 246.12s/it]

n_iterations: 3
n_required_iterations: 5
n_possible_iterations: 3
min_resources_: 10
max_resources_: 1109
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits


 11%|█         | 4/36 [16:27<2:08:52, 241.65s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 1473
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 14%|█▍        | 5/36 [20:50<2:08:16, 248.28s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 1845
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 17%|█▋        | 6/36 [24:49<2:02:40, 245.34s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 2227
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 19%|█▉        | 7/36 [28:52<1:58:18, 244.79s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 2604
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 22%|██▏       | 8/36 [33:18<1:57:10, 251.09s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 2977
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 25%|██▌       | 9/36 [37:38<1:54:10, 253.72s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 3315
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 28%|██▊       | 10/36 [42:09<1:52:12, 258.94s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 3691
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 31%|███       | 11/36 [46:35<1:48:48, 261.14s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 4077
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 33%|███▎      | 12/36 [51:02<1:45:10, 262.94s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 4458
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 36%|███▌      | 13/36 [55:17<1:39:52, 260.52s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 4847
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 39%|███▉      | 14/36 [59:46<1:36:25, 262.96s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 5166
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 42%|████▏     | 15/36 [1:04:30<1:34:13, 269.19s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 5532
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 44%|████▍     | 16/36 [1:09:31<1:32:57, 278.89s/it]

n_iterations: 4
n_required_iterations: 5
n_possible_iterations: 4
min_resources_: 10
max_resources_: 5776
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 47%|████▋     | 17/36 [1:14:22<1:29:28, 282.53s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 10
max_resources_: 6385
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 6250
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 50%|█████     | 18/36 [1:19:56<1:29:20, 297.78s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 10
max_resources_: 6801
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 10
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 50
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 250
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1250
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 6250
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 53%|█████▎    | 19/36 [1:24:52<1:24:17, 297.47s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 11
max_resources_: 7213
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 11
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 55
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 275
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1375
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 6875
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 56%|█████▌    | 20/36 [1:29:46<1:19:03, 296.45s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 12
max_resources_: 7635
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 12
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 60
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 300
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1500
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 7500
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 58%|█████▊    | 21/36 [1:34:49<1:14:32, 298.19s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 12
max_resources_: 8056
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 12
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 60
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 300
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1500
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 7500
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 61%|██████    | 22/36 [1:39:53<1:09:59, 299.94s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 13
max_resources_: 8554
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 13
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 65
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 325
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1625
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 8125
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 64%|██████▍   | 23/36 [1:45:53<1:08:56, 318.19s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 14
max_resources_: 9013
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 14
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 70
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 350
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1750
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 8750
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 67%|██████▋   | 24/36 [1:52:05<1:06:49, 334.09s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 15
max_resources_: 9708
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 15
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 75
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 375
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 1875
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 9375
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 69%|██████▉   | 25/36 [1:57:36<1:01:06, 333.35s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 16
max_resources_: 10166
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 16
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 80
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 400
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2000
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 10000
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 72%|███████▏  | 26/36 [2:03:07<55:24, 332.47s/it]  

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 16
max_resources_: 10589
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 16
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 80
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 400
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2000
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 10000
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 75%|███████▌  | 27/36 [2:08:22<49:07, 327.46s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 17
max_resources_: 10748
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 17
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 85
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 425
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2125
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 10625
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 78%|███████▊  | 28/36 [2:13:43<43:24, 325.53s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 17
max_resources_: 11154
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 17
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 85
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 425
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2125
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 10625
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 81%|████████  | 29/36 [2:19:12<38:05, 326.46s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 18
max_resources_: 11552
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 18
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 90
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 450
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2250
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 11250
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 83%|████████▎ | 30/36 [2:24:34<32:30, 325.04s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 19
max_resources_: 11956
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 19
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 95
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 475
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2375
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 11875
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 86%|████████▌ | 31/36 [2:30:08<27:18, 327.63s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 19
max_resources_: 12269
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 19
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 95
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 475
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2375
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 11875
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 89%|████████▉ | 32/36 [2:35:27<21:41, 325.31s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 20
max_resources_: 12620
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 20
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 100
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 500
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2500
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 12500
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 92%|█████████▏| 33/36 [2:40:52<16:15, 325.05s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 21
max_resources_: 13192
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 21
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 105
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 525
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2625
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 13125
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 94%|█████████▍| 34/36 [2:47:14<11:24, 342.26s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 21
max_resources_: 13502
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 21
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 105
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 525
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2625
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 13125
Fitting 5 folds for each of 2 candidates, totalling 10 fits


 97%|█████████▋| 35/36 [2:54:04<06:02, 362.58s/it]

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 22
max_resources_: 13907
aggressive_elimination: False
factor: 5
----------
iter: 0
n_candidates: 675
n_resources: 22
Fitting 5 folds for each of 675 candidates, totalling 3375 fits
----------
iter: 1
n_candidates: 135
n_resources: 110
Fitting 5 folds for each of 135 candidates, totalling 675 fits
----------
iter: 2
n_candidates: 27
n_resources: 550
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 3
n_candidates: 6
n_resources: 2750
Fitting 5 folds for each of 6 candidates, totalling 30 fits
----------
iter: 4
n_candidates: 2
n_resources: 13750
Fitting 5 folds for each of 2 candidates, totalling 10 fits


100%|██████████| 36/36 [3:00:01<00:00, 300.05s/it]


Unnamed: 0,dataset,gameweek,position,xgboost_v2_mae
0,test,3,overall,1.972546
1,test,4,overall,2.000645
2,test,5,overall,1.778185
3,test,6,overall,1.837988


time: 3h 2s (started: 2021-08-28 15:22:20 +01:00)


### Analyse predictions: XGBoost V2

**Season-long Average Mean Absolute Errors**:

| Training/Test | Position<br>subdataset | Linear Regression<br>Season-long<br>Avg. MAE (2 dp) |Ridge Regression<br>Season-long<br>Avg. MAE (2 dp) |LASSO Regression<br>Season-long<br>Avg. MAE (2 dp) |Random Forest<br>Season-long<br>Avg. MAE (2 dp) |XGBoost<br>Season-long<br>Avg. MAE (2 dp) |XGBoost V2<br>Season-long<br>Avg. MAE (2 dp) |
| --- | --- | --- | --- | --- |--- |--- |--- |
| Test | Overall     | **1.88** | **1.88**|**1.87** |**1.83** |**1.82** |**1.74** |
| Test | Goalkeepers | 2.18     | 2.28    | 2.18    | 2.11    | 2.12    | 2.09    |
| Test | Defenders   | 2.04     | 1.99    | 1.99    | 1.98    | 1.97    | 1.89    |
| Test | Midfielders | 1.63     | 1.64    | 1.63    | 1.62    | 1.62    | 1.55    |
| Test | Forward     | 2.14     | 2.18    | 2.14    | 1.95    | 1.94    | 1.80    |
  
* Changing the methodology for XGBoost ***considerably* improved results** across most positions.
    * In particular, XGBv2 bettered our baseline LR model by ~7.5%.

**Other insights**:

* When viewing across all positions (i.e. Overall), XGBv2 **experienced some sharp error spikes** throughout the season (GWs: 11, 15, 19, 20, 26, 27 and 29).

## Disclaimer: Error spike investigation

Before concluding this notebook/research, it should be noted that I have spent quite some time investigating the unusual error spikes observed in the predictions made by XGBv2. Some of the material which I looked at (and in chronological order) was:

* The actual **parameters used** every GW and the **feature importances** (where feature importances were "gain" by default).
* Then, after doing some more reading, I learned that consistency and accuracy are desirable properties for measures of feature importance (See [Interpretable Machine Learning with XGBoost](https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27)). Hence, I tried using **SHAP values** instead as a method for feature attribution.

Unfortunately, however, it became rather difficult when I tried to put this theory into practice. In fact, the project came to a sort of standstill while I was mulling over different methodologies which could work. 

Therefore, I have decided to **proceed with the project** and not investigate further. The reason being is that I soon return to University which I mentioned earlier on in the notebook. I'm aware that this is not 'best practice' as problems like this are akin to those frequently seen in industry; however, for my own personal development, I want to learn more about how we actually deploy ML algorithms in the real-world as well as researching them which is what we have done up until now.

# Next steps

1. ~XGBoost V2 section~
* ~Research XGBoost interpretability/explainability methodology~
    * ~SHAP values~
* ~Decide/document interpretability methodology~
* ~Code interpretability section~ 

2. Linear Regression section
* Complete analysis section
* Replace old methodology

3. Redo doc. where necessary
4. Push changes
5. Move onto 4_deployment

## Export simulations (predictions) to Tableau
Here, we're just exporting the predictions made above into [Tableau](https://public.tableau.com/app/profile/samuel.harrison2532/viz/model_simulation_analysis/Dashboard) for further analysis.

In [None]:
# Merge all predictions made by models
df_model_simulation_output = pd.merge(predictions_dict['Linear Regression'], predictions_dict['Ridge Regression'],
                                      on = ['dataset', 'gameweek', 'position'],
                                      how = 'left')
df_model_simulation_output = pd.merge(df_model_simulation_output, predictions_dict['LASSO Regression'],
                                      on = ['dataset', 'gameweek', 'position'],
                                      how = 'left')
df_model_simulation_output = pd.merge(df_model_simulation_output, predictions_dict['Random Forest'],
                                      on = ['dataset', 'gameweek', 'position'],
                                      how = 'left')
# Concatenate and export
df_model_simulation_output.to_csv("data/df_model_simulation_output_2020_21_season.csv")
df_model_simulation_output

### temp preds

In [15]:
t = pd.read_csv("data/df_model_simulation_output_2020_21_season.csv").drop(columns={'Unnamed: 0', 'xgboost_v2_mae'})

df_model_simulation_output = pd.merge(t, predictions_dict['XGBoost V2'],
                                      on = ['dataset', 'gameweek', 'position'],
                                      how = 'left')
df_model_simulation_output.to_csv("data/df_model_simulation_output_2020_21_season.csv")
df_model_simulation_output

Unnamed: 0,dataset,gameweek,position,linear_regression_mae,ridge_regression_mae,lasso_regression_mae,random_forest_mae,xgboost_mae,xgboost_v2_mae
0,test,3,overall,2.860811,2.765668,2.590824,2.202398,2.325225,1.972546
1,test,4,overall,2.198158,2.161335,2.135199,2.032455,2.063182,2.000645
2,test,5,overall,1.987693,2.038638,2.032880,1.864460,1.860265,1.778185
3,test,6,overall,1.922248,1.994142,1.967020,1.826913,1.831799,1.837988
4,test,7,overall,1.916588,1.970831,1.895659,1.831690,1.854625,1.755095
...,...,...,...,...,...,...,...,...,...
355,training,37,midfielder,1.613621,1.610566,1.610815,1.641023,1.609423,1.521125
356,training,38,defender,2.023379,1.996601,1.994431,2.000822,1.983949,1.996171
357,training,38,forward,1.946664,1.932540,1.941696,1.984778,1.907572,1.948799
358,training,38,goalkeeper,2.153495,2.161428,2.162056,2.079689,2.072203,2.109571


time: 51.9 ms (started: 2021-08-29 13:18:03 +01:00)
