# 1. Player value records
In this notebook we explore the available player value records. They are stored in a dataframe in pickle format.

In [1]:
import pandas as pd
values = pd.read_pickle('../data/value_records_for_ratings_based_predictions.pkl')

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model 

In [3]:
values.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9610 entries, 5665 to 160599
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   player_id    9610 non-null   int32         
 1   player_name  9610 non-null   object        
 2   player_role  9607 non-null   category      
 3   birth        9610 non-null   datetime64[ns]
 4   height       9610 non-null   float64       
 5   foot         9610 non-null   object        
 6   value        9610 non-null   float64       
 7   league       9610 non-null   object        
 8   value_at     9610 non-null   datetime64[ns]
 9   nat1         9610 non-null   object        
 10  nat2         9610 non-null   object        
dtypes: category(1), datetime64[ns](2), float64(2), int32(1), object(5)
memory usage: 797.8+ KB


Here is a little explanation of the data contained in the `values` data frame.
- `player_id` is the TransferMarkt id of the player
- `player_name`, `player_role`, `height`, `foot`, and `birt` are self-explanatory

In [4]:
values['birth'].describe()

  values['birth'].describe()


count                    9610
unique                   2616
top       1994-05-27 00:00:00
freq                       22
first     1977-01-02 00:00:00
last      2001-08-16 00:00:00
Name: birth, dtype: object

- `nat1` and `nat2` are string-valued columns that store the first and second nationality of the players which have more than one.

In [5]:
values.loc[:,['player_name','nat1','nat2']].head()

Unnamed: 0,player_name,nat1,nat2
5665,Marko Dmitrovic,Serbia,-
5667,Paulo Oliveira,Portugal,-
5671,JosÃ© Ãngel,Spain,-
5674,Gonzalo Escalante,Argentina,Italy
5676,BebÃ©,Portugal,CapeVerde


- `value` contains the market value of the player recorded at time `value_at`.
- `league` is the league in which the player was competing at time `value_at`. 

In [6]:
values.loc[:,['player_id','player_name','birth','value','value_at','league']].head()

Unnamed: 0,player_id,player_name,birth,value,value_at,league
5665,94308,Marko Dmitrovic,1992-01-24,3.6,2018-06-30,SPA1
5667,139336,Paulo Oliveira,1992-01-08,2.7,2018-06-30,SPA1
5671,87469,JosÃ© Ãngel,1989-09-05,2.25,2018-06-30,SPA1
5674,266795,Gonzalo Escalante,1993-03-27,2.25,2018-06-30,SPA1
5676,153427,BebÃ©,1990-07-12,0.9,2018-06-30,SPA1


We can, for example, extract all records for the players born in a certain time interval (e.g., born in 2000).

In [7]:
values.loc[(values['birth']> '2000-1-1') & (values['birth'] < '2000-12-31'),:].head()

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
18850,316760,Vincent Thill,Attacking Midfield,2000-02-04,1.7,left,0.54,FR1,2017-06-30,Luxembourg,-
19488,370846,Timothy Weah,Centre-Forward,2000-02-22,1.85,right,0.9,FR1,2018-06-30,UnitedStates,France
21177,464990,Arton Zekaj,Defensive Midfield,2000-04-16,1.87,right,0.045,FR1,2018-06-30,Kosovo,Serbia
23745,463665,Oumar Solet,Centre-Back,2000-02-07,1.92,right,0.9,FR1,2018-06-30,France,CentralAfricanRepublic
23758,418659,Amine Gouiri,Centre-Forward,2000-02-16,1.8,right,4.05,FR1,2018-06-30,France,Algeria


We can further narrow down the search and select the records for the players born in a certain year (e.g., 1990) when they where 25 years old, i.e., in 2015.

In [8]:
values.loc[(values['birth']> '1990-1-1') & (values['birth'] < '1990-12-31') & (values['value_at'] == '2015-06-30') ,:]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
9768,133794,Edgar MÃ©ndez,Right Winger,1990-01-02,1.87,right,2.25,SPA1,2015-06-30,Spain,-
9779,73092,Rene Krhin,Defensive Midfield,1990-05-21,1.89,right,2.25,SPA1,2015-06-30,Slovenia,-
10313,203043,Pedro Bigas,Centre-Back,1990-05-15,1.81,left,0.90,SPA1,2015-06-30,Spain,-
10321,223068,Tana,Attacking Midfield,1990-09-20,1.69,left,0.18,SPA1,2015-06-30,Spain,-
13754,59344,Asier Illarramendi,Defensive Midfield,1990-03-08,1.79,right,13.50,SPA1,2015-06-30,Spain,-
...,...,...,...,...,...,...,...,...,...,...,...
158151,45184,Grzegorz Krychowiak,Defensive Midfield,1990-01-29,1.87,right,10.80,SPA1,2015-06-30,Poland,-
158621,183647,Ãlvaro GonzÃ¡lez,Centre-Back,1990-01-08,1.82,right,2.70,SPA1,2015-06-30,Spain,-
159548,58426,Douglas,Right-Back,1990-08-06,1.72,right,1.80,SPA1,2015-06-30,Brazil,-
160482,58884,Nacho,Centre-Back,1990-01-18,1.80,right,5.40,SPA1,2015-06-30,Spain,-


# 2. Filtering out missing players
For a number of players in the values records there are no ratings statistics. This might be due to the fact that they have not made an appearence in official matches. We filter them out.

In [9]:
missing = pd.read_csv('../data/missing.csv')
len(missing)

41

In [10]:
missing.head()

Unnamed: 0,ID,Name
0,250038,Mamadou Sissako
1,361082,Jasper Schendelaar
2,215094,Nick Olij
3,387241,Raman ten Hove
4,153751,Nick Hengelman


In [11]:
len(values)

9610

In [12]:
values = values.loc[~values['player_id'].isin(missing['ID']), :]
len(values) 

9556

This number is smaller than 9610 - 41 = 9559 as some players may be in more records, i.e., have value for different years.

# 3. Ranking
Market values are not predicted directly, that is, not the dollar/euro value of the player. This is due to the fact that this value may be influenced by several external factors such as the general economic situation. Thus, we cannot e.g., compare a dollar value of 2010 and a dollar value of 2020, even if discounted. Instead, we predict the *ranking* of the player in a value table listing all players born in the same year and having the same age. 
In fact, we cannot simply rank all players at the age of, say, 25. The value of a 25 years old player in e.g., 2010 cannot be compared directly with the value of a 25 years old player in 2015. Rather, we should divide players based on their birth year and age. As an example, we could rank all the 25 years old players born in 1990.
This is done as follows
- We divide the players according to their birth year. Thus we will have a set of players born in 1990, a set of players born in 1991 and so on.
- For each birth year, we divide the players based on age, thus, for the players born in, say, 1990, we we divide the records with `value_at` in year 2010 (age 20), in year 2011 (age 21) and so on.
- We rank the players in each birth-age group in non-increasing order of the market value.
- Since the number of players changes from a year to another, we calculate the ranking as the percentage position on the table, i.e., $i/L$ where $i=1,\ldots,L$ is the position of the player in the table, and $L$ is the length of the table, or number of players.

## 3.1. Dividing players based on their birth year and age
Let us first identify the unique birth years


In [13]:
values['birth'].describe()

  values['birth'].describe()


count                    9556
unique                   2595
top       1994-05-27 00:00:00
freq                       22
first     1977-01-02 00:00:00
last      2001-08-16 00:00:00
Name: birth, dtype: object

The years seem to range from 1977 to 2001.

In [14]:
years = [i for i in range(1977,2001,1)]
years

[1977,
 1978,
 1979,
 1980,
 1981,
 1982,
 1983,
 1984,
 1985,
 1986,
 1987,
 1988,
 1989,
 1990,
 1991,
 1992,
 1993,
 1994,
 1995,
 1996,
 1997,
 1998,
 1999,
 2000]

We are likely to be interested in predicting market values for a limited number of ages. Let us say, from 21 to 28, i.e., not too far after the peak value. 

In [15]:
ages = [a for a in range(21,28,1)]
ages

[21, 22, 23, 24, 25, 26, 27]

In [16]:
players_groups = {}
for y in years:
    for a in ages:
        df = values.loc[(values['birth']> str(y)+'-1-1') 
                        & (values['birth'] < str(y)+'-12-31') 
                        & ( (values['value_at'].dt.year - values['birth'].dt.year) == a),:]
        players_groups[(y,a)] = df

In [17]:
players_groups[(1990,25)]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
9768,133794,Edgar MÃ©ndez,Right Winger,1990-01-02,1.87,right,2.25,SPA1,2015-06-30,Spain,-
9779,73092,Rene Krhin,Defensive Midfield,1990-05-21,1.89,right,2.25,SPA1,2015-06-30,Slovenia,-
10313,203043,Pedro Bigas,Centre-Back,1990-05-15,1.81,left,0.90,SPA1,2015-06-30,Spain,-
10321,223068,Tana,Attacking Midfield,1990-09-20,1.69,left,0.18,SPA1,2015-06-30,Spain,-
13754,59344,Asier Illarramendi,Defensive Midfield,1990-03-08,1.79,right,13.50,SPA1,2015-06-30,Spain,-
...,...,...,...,...,...,...,...,...,...,...,...
158151,45184,Grzegorz Krychowiak,Defensive Midfield,1990-01-29,1.87,right,10.80,SPA1,2015-06-30,Poland,-
158621,183647,Ãlvaro GonzÃ¡lez,Centre-Back,1990-01-08,1.82,right,2.70,SPA1,2015-06-30,Spain,-
159548,58426,Douglas,Right-Back,1990-08-06,1.72,right,1.80,SPA1,2015-06-30,Brazil,-
160482,58884,Nacho,Centre-Back,1990-01-18,1.80,right,5.40,SPA1,2015-06-30,Spain,-


In [18]:
players_groups[(1991,25)]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
5720,127108,Florian Lejeune,Centre-Back,1991-05-20,1.90,right,0.72,SPA1,2016-06-30,France,-
5724,238868,RubÃ©n PeÃ±a,Right-Back,1991-07-18,1.70,right,0.36,SPA1,2016-06-30,Spain,-
9736,129893,Molla WaguÃ©,Centre-Back,1991-02-21,1.91,right,1.80,SPA1,2016-06-30,Mali,France
10254,96540,Alfredo OrtuÃ±o,Centre-Forward,1991-01-21,1.82,both,0.90,SPA1,2016-06-30,Spain,-
10742,119905,Hugo Mallo,Right-Back,1991-06-22,1.74,right,5.40,SPA1,2016-06-30,Spain,-
...,...,...,...,...,...,...,...,...,...,...,...
159764,131505,Rodrigo,Centre-Forward,1991-03-06,1.82,left,10.80,SPA1,2016-06-30,Spain,Brazil
159774,68645,MartÃ­n Montoya,Right-Back,1991-04-14,1.75,right,2.70,SPA1,2016-06-30,Spain,-
160234,145707,Danilo,Right-Back,1991-07-15,1.84,right,16.20,SPA1,2016-06-30,Brazil,-
160238,88103,James RodrÃ­guez,Attacking Midfield,1991-07-12,1.81,left,63.00,SPA1,2016-06-30,Colombia,Spain


In [19]:
players_groups[(1987,25)]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
9797,63689,Fran Rico,Central Midfield,1987-08-03,1.78,right,0.9,SPA1,2012-06-30,Spain,-
13775,37501,Imanol Agirretxe,Centre-Forward,1987-02-24,1.87,right,2.7,SPA1,2012-06-30,Spain,-
18604,56802,Vincent Muratori,Left-Back,1987-08-03,1.79,left,0.9,FR1,2012-06-30,France,-
43766,56836,Aaron Meijers,Left-Back,1987-10-28,1.78,left,0.54,NE1,2012-06-30,Netherlands,-
47110,55972,Guram Kashia,Centre-Back,1987-07-04,1.85,right,3.6,NE1,2012-06-30,Georgia,-
49423,31880,Jordens Peters,Centre-Back,1987-05-03,1.83,right,0.225,NE1,2012-06-30,Netherlands,-
71366,33639,Kevin Mirallas,Right Winger,1987-10-05,1.82,right,5.85,UK1,2012-06-30,Belgium,-
71976,54170,Ramires,Central Midfield,1987-03-24,1.79,right,27.0,UK1,2012-06-30,Brazil,-
77216,40204,Joe Hart,Goalkeeper,1987-04-19,1.96,right,21.15,UK1,2012-06-30,England,-
77231,18935,Samir Nasri,Attacking Midfield,1987-06-26,1.77,right,22.5,UK1,2012-06-30,France,Algeria


## 3.2. We assign a rank to each record

In [20]:
for y in years:
    for a in ages:
        players_groups[(y,a)]['rank'] = players_groups[(y,a)]['value'].rank(pct=True,ascending=False,method='average',na_option='bottom')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_groups[(y,a)]['rank'] = players_groups[(y,a)]['value'].rank(pct=True,ascending=False,method='average',na_option='bottom')


In [21]:
players_groups[(1990,26)]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2,rank
5729,153427,BebÃ©,Left Winger,1990-07-12,1.90,right,1.80,SPA1,2016-06-30,Portugal,CapeVerde,0.678808
8221,42920,Fran MÃ©rida,Central Midfield,1990-03-04,1.74,left,0.54,SPA1,2016-06-30,Spain,-,0.933775
9721,73092,Rene Krhin,Defensive Midfield,1990-05-21,1.89,right,2.70,SPA1,2016-06-30,Slovenia,-,0.579470
9737,93759,Matthieu Saunier,Centre-Back,1990-02-07,1.81,right,0.90,SPA1,2016-06-30,France,-,0.847682
10232,223068,Tana,Attacking Midfield,1990-09-20,1.69,left,1.35,SPA1,2016-06-30,Spain,-,0.771523
...,...,...,...,...,...,...,...,...,...,...,...,...
158377,183647,Ãlvaro GonzÃ¡lez,Centre-Back,1990-01-08,1.82,right,3.60,SPA1,2016-06-30,Spain,-,0.490066
159298,58426,Douglas,Right-Back,1990-08-06,1.72,right,0.90,SPA1,2016-06-30,Brazil,-,0.847682
159748,227805,Jaume DomÃ©nech,Goalkeeper,1990-11-05,1.85,right,3.60,SPA1,2016-06-30,Spain,-,0.490066
160231,58884,Nacho,Centre-Back,1990-01-18,1.80,right,4.50,SPA1,2016-06-30,Spain,-,0.427152


# 4. Ratings

In [22]:
ratings = pd.read_csv('../data/ratings.csv',names = ['player_id_lm','player_id','player_name','birth_year','date_rating_18','rating_18','peak_rating_18','minutes_played_18','date_rating_19','rating_19','peak_rating_19','minutes_played_19','date_rating_20','rating_20','peak_rating_20','minutes_played_20'])

In [23]:
ratings.head()

Unnamed: 0,player_id_lm,player_id,player_name,birth_year,date_rating_18,rating_18,peak_rating_18,minutes_played_18,date_rating_19,rating_19,peak_rating_19,minutes_played_19,date_rating_20,rating_20,peak_rating_20,minutes_played_20
0,39151,94308,Marko Dmitrovic,1992,30.06.2010,,,0,30.06.2011,-0.056574,0.0528,90,30.06.2012,-0.05577,0.029722,180
1,11673,139336,Paulo Oliveira,1992,30.06.2010,,,0,30.06.2011,,,0,30.06.2012,,,0
2,6213,87469,Jose Angel,1989,30.06.2007,,,0,30.06.2008,,,0,30.06.2009,-0.023174,0.05129,992
3,52516,266795,Gonzalo Escalante,1993,30.06.2011,,,0,30.06.2012,,,0,30.06.2013,-0.07295,0.013075,277
4,3184,153427,Bebe,1990,30.06.2008,,,0,30.06.2009,,,0,30.06.2010,,,0


Let us select only the ratings at the age of 20.

In [24]:
ratings_20 = ratings.loc[:,['player_id','player_name','birth_year','date_rating_20','rating_20','peak_rating_20','minutes_played_20']]
ratings_20.head()

Unnamed: 0,player_id,player_name,birth_year,date_rating_20,rating_20,peak_rating_20,minutes_played_20
0,94308,Marko Dmitrovic,1992,30.06.2012,-0.05577,0.029722,180
1,139336,Paulo Oliveira,1992,30.06.2012,,,0
2,87469,Jose Angel,1989,30.06.2009,-0.023174,0.05129,992
3,266795,Gonzalo Escalante,1993,30.06.2013,-0.07295,0.013075,277
4,153427,Bebe,1990,30.06.2010,,,0


From these let us filter out missing values.

In [25]:
ratings_20 = ratings_20.dropna(subset=['rating_20', 'peak_rating_20'])
ratings_20

Unnamed: 0,player_id,player_name,birth_year,date_rating_20,rating_20,peak_rating_20,minutes_played_20
0,94308,Marko Dmitrovic,1992,30.06.2012,-0.055770,0.029722,180
2,87469,Jose Angel,1989,30.06.2009,-0.023174,0.051290,992
3,266795,Gonzalo Escalante,1993,30.06.2013,-0.072950,0.013075,277
5,142027,Pablo Hervias,1993,30.06.2013,-0.096118,-0.011392,1732
8,138935,Sergio Alvarez,1992,30.06.2012,-0.077713,0.007705,2003
...,...,...,...,...,...,...,...
1750,44501,Marcelo,1988,30.06.2008,0.105363,0.183862,2844
1751,138927,Daniel Carvajal Ramos,1992,30.06.2012,0.041555,0.126091,5650
1753,39381,Gareth Frank Bale,1989,30.06.2009,0.024184,0.090404,3629
1754,18922,Karim Benzema,1987,30.06.2007,0.077195,0.184693,199


In [26]:
ratings_20['birth_year'].describe()

count     918.000000
mean     1990.921569
std         1.747518
min      1987.000000
25%      1990.000000
50%      1991.000000
75%      1992.000000
max      1993.000000
Name: birth_year, dtype: float64

# 5. Regression

## 5.1. Data collection

As we have seen, the birth year of the ratings ranges from 1987 to 1993.

In [27]:
years = [i for i in range(1987,1993,1)]
print(years)

[1987, 1988, 1989, 1990, 1991, 1992]


Let us retrieve the appropriate rankings.

In [28]:
rankings = None
for y in years:
    df = players_groups[(y,25)]
    if rankings is None:
        rankings = df
    else:
        rankings = rankings.append(df)
print(rankings)

        player_id        player_name         player_role      birth  height  \
9797        63689          Fran Rico    Central Midfield 1987-08-03    1.78   
13775       37501   Imanol Agirretxe      Centre-Forward 1987-02-24    1.87   
18604       56802   Vincent Muratori           Left-Back 1987-08-03    1.79   
43766       56836      Aaron Meijers           Left-Back 1987-10-28    1.78   
47110       55972       Guram Kashia         Centre-Back 1987-07-04    1.85   
...           ...                ...                 ...        ...     ...   
159350      85370      Sergi Roberto          Right-Back 1992-02-07    1.78   
159357      80444  Philippe Coutinho         Left Winger 1992-06-12    1.72   
159789     131102     Jeison Murillo         Centre-Back 1992-05-27    1.82   
160282     138927    Daniel Carvajal          Right-Back 1992-01-11    1.73   
160287      85288               Isco  Attacking Midfield 1992-04-21    1.76   

         foot  value league   value_at         nat1

We have 918 ratings for the age of 20 and 616 rankings. We should now find the players for which we have both a rating and a ranking. First, let us select only the columns which for now necessary or we want to use in regression.

In [29]:
rankings = rankings.loc[:,['player_id','player_name','player_role','birth','height','foot','rank']]

In [30]:
data = ratings_20.merge(rankings,on='player_id', how='inner')

In [31]:
data

Unnamed: 0,player_id,player_name_x,birth_year,date_rating_20,rating_20,peak_rating_20,minutes_played_20,player_name_y,player_role,birth,height,foot,rank
0,94308,Marko Dmitrovic,1992,30.06.2012,-0.055770,0.029722,180,Marko Dmitrovic,Goalkeeper,1992-01-24,1.94,left,0.827830
1,127108,Florian Lejeune,1991,30.06.2011,-0.067635,0.018432,3689,Florian Lejeune,Centre-Back,1991-05-20,1.90,right,0.902516
2,129679,Cheick Doukoure,1992,30.06.2012,-0.016803,0.087269,153,Cheick DoukourÃ©,Defensive Midfield,1992-09-11,1.80,right,0.780660
3,73092,Rene Krhin,1990,30.06.2010,0.056437,0.130754,74,Rene Krhin,Defensive Midfield,1990-05-21,1.89,right,0.649123
4,133794,Edgar Mendez,1991,30.06.2011,-0.099999,-0.015180,1163,Edgar MÃ©ndez,Right Winger,1990-01-02,1.87,right,0.649123
...,...,...,...,...,...,...,...,...,...,...,...,...,...
373,44501,Marcelo,1988,30.06.2008,0.105363,0.183862,2844,Marcelo,Left-Back,1988-05-12,1.74,left,0.179487
374,138927,Daniel Carvajal Ramos,1992,30.06.2012,0.041555,0.126091,5650,Daniel Carvajal,Right-Back,1992-01-11,1.73,right,0.070755
375,39381,Gareth Frank Bale,1989,30.06.2009,0.024184,0.090404,3629,Gareth Bale,Right Winger,1989-07-16,1.85,left,0.015152
376,18922,Karim Benzema,1987,30.06.2007,0.077195,0.184693,199,Karim Benzema,Centre-Forward,1987-12-19,1.85,both,0.115385


It looks like we have ratings and rankings for 378 players. This is the data we can use to do regression. 

## 5.2. Random Forests version 1

In this version we use the following regressors: rating, peak_rating, and minutes played at the age of 20.
We start by importing the necessary packages.

In [32]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

### 5.2.1 Data selection

In [33]:
Y = data['rank']
Y

0      0.827830
1      0.902516
2      0.780660
3      0.649123
4      0.649123
         ...   
373    0.179487
374    0.070755
375    0.015152
376    0.115385
377    0.243590
Name: rank, Length: 378, dtype: float64

In [34]:
X1 = data.loc[:,['rating_20','peak_rating_20','minutes_played_20']]
X1

Unnamed: 0,rating_20,peak_rating_20,minutes_played_20
0,-0.055770,0.029722,180
1,-0.067635,0.018432,3689
2,-0.016803,0.087269,153
3,0.056437,0.130754,74
4,-0.099999,-0.015180,1163
...,...,...,...
373,0.105363,0.183862,2844
374,0.041555,0.126091,5650
375,0.024184,0.090404,3629
376,0.077195,0.184693,199


In [35]:
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1,Y)

### 5.2.2 Grid Search for optimal parameters setup

We specify the parameters of the RF regressor that we want to optimize using Grid Search cross-validation.

In [36]:
RandomForestRegressor().get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [37]:
parameters = [
    {"max_depth": [2, 4, 6],'min_samples_split': [2,4,6], 'min_samples_leaf':[1,2,3]}
]

We perform grid search

In [38]:
gs = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_root_mean_squared_error')
gs.fit(X1_train, Y1_train)

GridSearchCV(estimator=RandomForestRegressor(),
             param_grid=[{'max_depth': [2, 4, 6], 'min_samples_leaf': [1, 2, 3],
                          'min_samples_split': [2, 4, 6]}],
             scoring='neg_root_mean_squared_error')

In [39]:
gs.best_params_

{'max_depth': 2, 'min_samples_leaf': 2, 'min_samples_split': 6}

In [60]:
gs.cv_results_["mean_test_score"]

array([-0.24103447, -0.24143007, -0.24126869, -0.24095372, -0.24129021,
       -0.24025318, -0.24128116, -0.24177759, -0.24108022, -0.24575578,
       -0.24504101, -0.24521541, -0.2461247 , -0.24664863, -0.24492954,
       -0.24660689, -0.24441189, -0.2453877 , -0.24991453, -0.24967247,
       -0.24951562, -0.25079484, -0.24857798, -0.2503685 , -0.24878825,
       -0.25068497, -0.24949658])

## 5.3 Random Forests version 2

In this version we add the role as a predictor and repeat the entire experiment.

### 5.3.1 We select the data

In [45]:
X2 = data.loc[:,['rating_20','peak_rating_20','minutes_played_20','player_role']]
X2

Unnamed: 0,rating_20,peak_rating_20,minutes_played_20,player_role
0,-0.055770,0.029722,180,Goalkeeper
1,-0.067635,0.018432,3689,Centre-Back
2,-0.016803,0.087269,153,Defensive Midfield
3,0.056437,0.130754,74,Defensive Midfield
4,-0.099999,-0.015180,1163,Right Winger
...,...,...,...,...
373,0.105363,0.183862,2844,Left-Back
374,0.041555,0.126091,5650,Right-Back
375,0.024184,0.090404,3629,Right Winger
376,0.077195,0.184693,199,Centre-Forward


We use one-hot encoding for the role.

In [46]:
one_hot = pd.get_dummies(X2['player_role'],drop_first=True)
one_hot

Unnamed: 0,Central Midfield,Centre-Back,Centre-Forward,Defender,Defensive Midfield,Forward,Goalkeeper,Left Midfield,Left Winger,Left-Back,Midfielder,Right Midfield,Right Winger,Right-Back,Second Striker,Sweeper
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
374,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
375,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
376,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [47]:
X2 = X2.drop('player_role',axis = 1)
X2 = X2.join(one_hot)
X2.head()

Unnamed: 0,rating_20,peak_rating_20,minutes_played_20,Central Midfield,Centre-Back,Centre-Forward,Defender,Defensive Midfield,Forward,Goalkeeper,Left Midfield,Left Winger,Left-Back,Midfielder,Right Midfield,Right Winger,Right-Back,Second Striker,Sweeper
0,-0.05577,0.029722,180,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,-0.067635,0.018432,3689,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,-0.016803,0.087269,153,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,0.056437,0.130754,74,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,-0.099999,-0.01518,1163,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [49]:
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2,Y)

### 5.3.2. We use grid search to find the best parameters

In [50]:
parameters = [
    {"max_depth": [2, 4, 6],'min_samples_split': [2,4,6], 'min_samples_leaf':[1,2,3]}
]
gs = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_root_mean_squared_error')
gs.fit(X2_train, Y2_train)
gs.best_params_

{'max_depth': 2, 'min_samples_leaf': 3, 'min_samples_split': 6}

In [51]:
gs.cv_results_["mean_test_score"]

array([-0.24453207, -0.24379515, -0.24348085, -0.24368185, -0.24401977,
       -0.24393228, -0.24357465, -0.24395198, -0.24325987, -0.24588713,
       -0.24712428, -0.24654045, -0.24647207, -0.24457985, -0.24800728,
       -0.24556454, -0.24584019, -0.24724134, -0.25015727, -0.24931258,
       -0.25024396, -0.24968549, -0.24839327, -0.24807069, -0.24740121,
       -0.24899973, -0.24749332])

We can notice that the RMSE is comparable to that of the RF version without the role as a predictor.

## 5.4. Random Forests version 3

In this version we standardize the values (except the one-hot encoding of the role).

### 5.4.1. We gather the data

In [53]:
from sklearn.preprocessing import StandardScaler
X3 = X2
X3[['rating_20', 'peak_rating_20','minutes_played_20']] = StandardScaler().fit_transform(X3[['rating_20', 'peak_rating_20','minutes_played_20']])
X3.head()

Unnamed: 0,rating_20,peak_rating_20,minutes_played_20,Central Midfield,Centre-Back,Centre-Forward,Defender,Defensive Midfield,Forward,Goalkeeper,Left Midfield,Left Winger,Left-Back,Midfielder,Right Midfield,Right Winger,Right-Back,Second Striker,Sweeper
0,-0.785149,-0.8445,-0.886444,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,-1.013924,-1.06949,1.065738,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,-0.033824,0.302281,-0.901465,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,1.378319,1.16884,-0.945416,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,-1.637936,-1.739289,-0.339566,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


### 5.4.2. Grid Search for best parameters

In [54]:
X3_train, X3_test, Y3_train, Y3_test = train_test_split(X3,Y)

In [55]:
parameters = [
    {"max_depth": [2, 4, 6],'min_samples_split': [2,4,6], 'min_samples_leaf':[1,2,3]}
]
gs = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_root_mean_squared_error')
gs.fit(X3_train, Y3_train)
gs.best_params_

{'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 6}

In [56]:
gs.cv_results_["mean_test_score"]

array([-0.24279985, -0.24403598, -0.24302354, -0.24312902, -0.24257621,
       -0.24332559, -0.24309642, -0.24276567, -0.24223485, -0.24114951,
       -0.24034322, -0.24024455, -0.24135801, -0.24128191, -0.24087202,
       -0.2424983 , -0.24201104, -0.24278718, -0.24289585, -0.24299246,
       -0.24112028, -0.24164554, -0.2409278 , -0.24133884, -0.24244625,
       -0.2418267 , -0.24221127])

Also in this case we do not obtain a significant improvement in the prediction accuracy of the model.

### 5.4.3 Creation of retained RF predictor with optimal parameters

Let us now create a Random Forest predictor with the parameters obtained in the grid search

In [57]:
X3_train, X3_test, Y3_train, Y3_test = train_test_split(X3,Y)
rf3 = RandomForestRegressor(max_depth=4,min_samples_leaf=1,min_samples_split=6)
rf3.fit(X3_train,Y3_train)

RandomForestRegressor(max_depth=4, min_samples_split=6)

In [59]:
Y3_pred = rf3.predict(X3_test)

In [61]:
rf3.score(X3_test,Y3_test)

0.13587954070506836

In [62]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(Y3_test, Y3_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y3_test, Y3_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y3_test, Y3_pred)))
print('Mean Absolute Percentage Error:', metrics.mean_absolute_percentage_error(Y3_test,Y3_pred))

Mean Absolute Error: 0.19762961944243937
Mean Squared Error: 0.05909357866770758
Root Mean Squared Error: 0.24309170834832597
Mean Absolute Percentage Error: 2.1711336970006037


As we can see we have a MAE of 0.19 which means that on average we are 20% off the right value ranking.