# 1. Introduction
In this notebook we will use random forest to predict future player value ranks based on ratings. We will use the best Random Forest model derived in the preliminary notebook available [here](./player_values_info.ipynb). The hypothesis we wish to verify is that predictions become more accurate as we get closer to the target year. For example, if we wish to predict the value rank at the age of 25, predictors (e.g., rating, role) at the age of, say 23, will lead to better prediction accurancy than predictors at the age of, say, 22. 

We start by loading the necessary packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

# 2. Values

Let us load value data

In [2]:
values = pd.read_pickle('../data/value_records_for_ratings_based_predictions.pkl')
values.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9610 entries, 5665 to 160599
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   player_id    9610 non-null   int32         
 1   player_name  9610 non-null   object        
 2   player_role  9607 non-null   category      
 3   birth        9610 non-null   datetime64[ns]
 4   height       9610 non-null   float64       
 5   foot         9610 non-null   object        
 6   value        9610 non-null   float64       
 7   league       9610 non-null   object        
 8   value_at     9610 non-null   datetime64[ns]
 9   nat1         9610 non-null   object        
 10  nat2         9610 non-null   object        
dtypes: category(1), datetime64[ns](2), float64(2), int32(1), object(5)
memory usage: 797.8+ KB


Let us simplify a bit players roles using macro roles

In [3]:
macro_role = {'Goalkeeper':'GK', 
              'Centre-Back':'DF',
              'Defensive Midfield': 'MF',
              'Right Winger':'MF',
              'Centre-Forward':'FW',
              'Right-Back':'DF',
              'Attacking Midfield':'MF',
              'Central Midfield':'MF',
              'Left-Back':'DF',
              'Left Winger':'MF',
              'Right Midfield':'MF',
              'Left Midfield':'MF',
              'Second Striker':'FW',
              'Attacking Midfield':'MF'
             }
values["macro_role"] = values["player_role"].apply(lambda x: macro_role.get(x))
values.head()

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2,macro_role
5665,94308,Marko Dmitrovic,Goalkeeper,1992-01-24,1.94,left,3.6,SPA1,2018-06-30,Serbia,-,GK
5667,139336,Paulo Oliveira,Centre-Back,1992-01-08,1.87,right,2.7,SPA1,2018-06-30,Portugal,-,DF
5671,87469,JosÃ© Ãngel,Left-Back,1989-09-05,1.82,left,2.25,SPA1,2018-06-30,Spain,-,DF
5674,266795,Gonzalo Escalante,Central Midfield,1993-03-27,1.82,right,2.25,SPA1,2018-06-30,Argentina,Italy,MF
5676,153427,BebÃ©,Left Winger,1990-07-12,1.9,right,0.9,SPA1,2018-06-30,Portugal,CapeVerde,MF


For a number of players in the values records there are no ratings statistics. This might be due to the fact that they have not made an appearence in official matches. We filter them out.

In [4]:
missing = pd.read_csv('../data/missing.csv')
len(missing)

41

In [5]:
len(values)

9610

In [6]:
values = values.loc[~values['player_id'].isin(missing['ID']), :]
len(values) 

9556

This number is smaller than 9610 - 41 = 9559 as some players may be in more records, i.e., have value for different years.

# 2.1. Ranking
Market values are not predicted directly, that is, not the dollar/euro value of the player. This is due to the fact that this value may be influenced by several external factors such as the general economic situation. Thus, we cannot e.g., compare a dollar value of 2010 and a dollar value of 2020, even if discounted. Instead, we predict the *ranking* of the player in a value table listing all players born in the same year and having the same age. 
In fact, we cannot simply rank all players at the age of, say, 25. The value of a 25 years old player in e.g., 2010 cannot be compared directly with the value of a 25 years old player in 2015. Rather, we should divide players based on their birth year and age. As an example, we could rank all the 25 years old players born in 1990.
This is done as follows
- We divide the players according to their birth year. Thus we will have a set of players born in 1990, a set of players born in 1991 and so on.
- For each birth year, we divide the players based on age, thus, for the players born in, say, 1990, we we divide the records with `value_at` in year 2010 (age 20), in year 2011 (age 21) and so on.
- We rank the players in each birth-age group in non-increasing order of the market value.
- Since the number of players changes from a year to another, we calculate the ranking as the percentage position on the table, i.e., $i/L$ where $i=1,\ldots,L$ is the position of the player in the table, and $L$ is the length of the table, or number of players.


Let us first identify the unique birth years

In [7]:
values['birth'].describe()

  values['birth'].describe()


count                    9556
unique                   2595
top       1994-05-27 00:00:00
freq                       22
first     1977-01-02 00:00:00
last      2001-08-16 00:00:00
Name: birth, dtype: object

The year ranges from $1977$ to $2001$.

In [8]:
years = [i for i in range(1977,2001,1)]

We are likely to be interested in predicting market values for a limited number of ages. Let us say, from 21 to 28, i.e., not too far after the peak value. The following vector contains the ages at which we may want to predict the value.

In [9]:
ages = [a for a in range(20,30,1)]
ages

[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]

We now divide the value records according to the birth year of the player and the age at which the value was recorded.

In [10]:
players_groups = {}
for y in years:
    for a in ages:
        df = values.loc[(values['birth']> str(y)+'-1-1') 
                        & (values['birth'] < str(y)+'-12-31') 
                        & ( (values['value_at'].dt.year - values['birth'].dt.year) == a),:]
        players_groups[(y,a)] = df

For example

In [11]:
players_groups[(1990,25)]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2,macro_role
9768,133794,Edgar MÃ©ndez,Right Winger,1990-01-02,1.87,right,2.25,SPA1,2015-06-30,Spain,-,MF
9779,73092,Rene Krhin,Defensive Midfield,1990-05-21,1.89,right,2.25,SPA1,2015-06-30,Slovenia,-,MF
10313,203043,Pedro Bigas,Centre-Back,1990-05-15,1.81,left,0.90,SPA1,2015-06-30,Spain,-,DF
10321,223068,Tana,Attacking Midfield,1990-09-20,1.69,left,0.18,SPA1,2015-06-30,Spain,-,MF
13754,59344,Asier Illarramendi,Defensive Midfield,1990-03-08,1.79,right,13.50,SPA1,2015-06-30,Spain,-,MF
...,...,...,...,...,...,...,...,...,...,...,...,...
158151,45184,Grzegorz Krychowiak,Defensive Midfield,1990-01-29,1.87,right,10.80,SPA1,2015-06-30,Poland,-,MF
158621,183647,Ãlvaro GonzÃ¡lez,Centre-Back,1990-01-08,1.82,right,2.70,SPA1,2015-06-30,Spain,-,DF
159548,58426,Douglas,Right-Back,1990-08-06,1.72,right,1.80,SPA1,2015-06-30,Brazil,-,DF
160482,58884,Nacho,Centre-Back,1990-01-18,1.80,right,5.40,SPA1,2015-06-30,Spain,-,DF


Finally, we assign a value rank to the records.

In [12]:
for y in years:
    for a in ages:
        players_groups[(y,a)]['rank'] = players_groups[(y,a)]['value'].rank(pct=True,ascending=False,method='average',na_option='bottom')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_groups[(y,a)]['rank'] = players_groups[(y,a)]['value'].rank(pct=True,ascending=False,method='average',na_option='bottom')


Let us see an example

In [13]:
players_groups[(1990,26)]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2,macro_role,rank
5729,153427,BebÃ©,Left Winger,1990-07-12,1.90,right,1.80,SPA1,2016-06-30,Portugal,CapeVerde,MF,0.678808
8221,42920,Fran MÃ©rida,Central Midfield,1990-03-04,1.74,left,0.54,SPA1,2016-06-30,Spain,-,MF,0.933775
9721,73092,Rene Krhin,Defensive Midfield,1990-05-21,1.89,right,2.70,SPA1,2016-06-30,Slovenia,-,MF,0.579470
9737,93759,Matthieu Saunier,Centre-Back,1990-02-07,1.81,right,0.90,SPA1,2016-06-30,France,-,DF,0.847682
10232,223068,Tana,Attacking Midfield,1990-09-20,1.69,left,1.35,SPA1,2016-06-30,Spain,-,MF,0.771523
...,...,...,...,...,...,...,...,...,...,...,...,...,...
158377,183647,Ãlvaro GonzÃ¡lez,Centre-Back,1990-01-08,1.82,right,3.60,SPA1,2016-06-30,Spain,-,DF,0.490066
159298,58426,Douglas,Right-Back,1990-08-06,1.72,right,0.90,SPA1,2016-06-30,Brazil,-,DF,0.847682
159748,227805,Jaume DomÃ©nech,Goalkeeper,1990-11-05,1.85,right,3.60,SPA1,2016-06-30,Spain,-,GK,0.490066
160231,58884,Nacho,Centre-Back,1990-01-18,1.80,right,4.50,SPA1,2016-06-30,Spain,-,DF,0.427152


# 3. Ratings
Let us now load the ratings values for different ages.

In [14]:
ratings_17 = pd.read_csv('../data/ratings_17.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_18 = pd.read_csv('../data/ratings_18.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_19 = pd.read_csv('../data/ratings_19.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_20 = pd.read_csv('../data/ratings_20.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_21 = pd.read_csv('../data/ratings_21.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_22 = pd.read_csv('../data/ratings_22.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_23 = pd.read_csv('../data/ratings_23.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_24 = pd.read_csv('../data/ratings_24.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])
ratings_25 = pd.read_csv('../data/ratings_25.txt',names = ['player_id_lm','player_id','player_name','rating','peak_rating','minutes_played'])

In [15]:
ratings_17.head()

Unnamed: 0,player_id_lm,player_id,player_name,rating,peak_rating,minutes_played
0,39151,94308,Marko Dmitrovic,,,0
1,11673,139336,Paulo Oliveira,,,0
2,6213,87469,Jose Angel,,,0
3,52516,266795,Gonzalo Escalante,,,0
4,3184,153427,Bebe,,,0


In [16]:
ratings_25.head()

Unnamed: 0,player_id_lm,player_id,player_name,rating,peak_rating,minutes_played
0,39151,94308,Marko Dmitrovic,-9.5e-05,0.010286,6556
1,11673,139336,Paulo Oliveira,0.068622,0.079448,11162
2,6213,87469,Jose Angel,0.100548,0.10762,8934
3,52516,266795,Gonzalo Escalante,0.010727,0.020054,7998
4,3184,153427,Bebe,-0.022201,-0.019779,5617


From these records let us filter out missing values.

In [17]:
ratings_17 = ratings_17.dropna(subset=['rating', 'peak_rating'])
ratings_18 = ratings_18.dropna(subset=['rating', 'peak_rating'])
ratings_19 = ratings_19.dropna(subset=['rating', 'peak_rating'])
ratings_20 = ratings_20.dropna(subset=['rating', 'peak_rating'])
ratings_21 = ratings_21.dropna(subset=['rating', 'peak_rating'])
ratings_22 = ratings_22.dropna(subset=['rating', 'peak_rating'])
ratings_23 = ratings_23.dropna(subset=['rating', 'peak_rating'])
ratings_24 = ratings_24.dropna(subset=['rating', 'peak_rating'])
ratings_25 = ratings_25.dropna(subset=['rating', 'peak_rating'])

Finally, we assign a rank also to ratings and minutes played.

In [18]:
ratings_17['rank_rating'] = ratings_17['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_17['rank_peak_rating'] = ratings_17['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_17['rank_minutes_played'] = ratings_17['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_18['rank_rating'] = ratings_18['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_18['rank_peak_rating'] = ratings_18['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_18['rank_minutes_played'] = ratings_18['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_19['rank_rating'] = ratings_19['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_19['rank_peak_rating'] = ratings_19['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_19['rank_minutes_played'] = ratings_19['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_20['rank_rating'] = ratings_20['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_20['rank_peak_rating'] = ratings_20['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_20['rank_minutes_played'] = ratings_20['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_21['rank_rating'] = ratings_21['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_21['rank_peak_rating'] = ratings_21['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_21['rank_minutes_played'] = ratings_21['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_22['rank_rating'] = ratings_22['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_22['rank_peak_rating'] = ratings_22['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_22['rank_minutes_played'] = ratings_22['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_23['rank_rating'] = ratings_23['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_23['rank_peak_rating'] = ratings_23['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_23['rank_minutes_played'] = ratings_23['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_24['rank_rating'] = ratings_24['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_24['rank_peak_rating'] = ratings_24['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_24['rank_minutes_played'] = ratings_24['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

ratings_25['rank_rating'] = ratings_25['rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_25['rank_peak_rating'] = ratings_25['peak_rating'].rank(pct=True,ascending=False,method='average',na_option='bottom')
ratings_25['rank_minutes_played'] = ratings_25['minutes_played'].rank(pct=True,ascending=False,method='average',na_option='bottom')

In [19]:
ratings_25.head()

Unnamed: 0,player_id_lm,player_id,player_name,rating,peak_rating,minutes_played,rank_rating,rank_peak_rating,rank_minutes_played
0,39151,94308,Marko Dmitrovic,-9.5e-05,0.010286,6556,0.638777,0.652337,0.600115
1,11673,139336,Paulo Oliveira,0.068622,0.079448,11162,0.254761,0.254183,0.266301
2,6213,87469,Jose Angel,0.100548,0.10762,8934,0.141662,0.153491,0.426572
3,52516,266795,Gonzalo Escalante,0.010727,0.020054,7998,0.569244,0.586555,0.489325
4,3184,153427,Bebe,-0.022201,-0.019779,5617,0.768609,0.826313,0.666763


Let us store all ratings in a dictionary

In [20]:
ratings = {}
ratings[17] = ratings_17
ratings[18] = ratings_18
ratings[19] = ratings_19
ratings[20] = ratings_20
ratings[21] = ratings_21
ratings[22] = ratings_22
ratings[23] = ratings_23
ratings[24] = ratings_24
ratings[25] = ratings_25
ratings[25].head()

Unnamed: 0,player_id_lm,player_id,player_name,rating,peak_rating,minutes_played,rank_rating,rank_peak_rating,rank_minutes_played
0,39151,94308,Marko Dmitrovic,-9.5e-05,0.010286,6556,0.638777,0.652337,0.600115
1,11673,139336,Paulo Oliveira,0.068622,0.079448,11162,0.254761,0.254183,0.266301
2,6213,87469,Jose Angel,0.100548,0.10762,8934,0.141662,0.153491,0.426572
3,52516,266795,Gonzalo Escalante,0.010727,0.020054,7998,0.569244,0.586555,0.489325
4,3184,153427,Bebe,-0.022201,-0.019779,5617,0.768609,0.826313,0.666763


# 4. Random Forest Regression

We will now set up a regression task where we use the ratings, values, and minutes played at a given age (e.g., 20) to predict the value (rank) at some target age (e.g., 25). 

## 4.1. Data collection

We will now collect the data necessary to fit a regression model. We create a function that can be called for different target and prediction ages.

In [42]:
def gather_data(target_age:int,prediction_age:int):
    
    # Gathers prediction ratings
    prediction_ratings = ratings[prediction_age] 
    
    # Gathers value rankings
    rankings = None
    for y in years:
        df = players_groups[(y,target_age)]
        if rankings is None:
            rankings = df
        else:
            rankings = rankings.append(df)
    # Selects only the characteristics of the players we are interested in
    rankings = rankings.loc[:,['player_id','player_name','player_role','macro_role','birth','height','foot','rank']]
    
    # Merges ratings and rankings
    data = prediction_ratings.merge(rankings,on='player_id', how='inner')
    
    
    X = data.loc[:,['rank','player_id','rank_rating','rank_peak_rating','rank_minutes_played']]

    
    # Finds the value rank at the prediction age
    prediction_rankings = None
    for y in years:
        df = players_groups.get((y,prediction_age),None)
        if df is not None:
            if prediction_rankings is None:
                prediction_rankings = df
            else:
                prediction_rankings = prediction_rankings.append(df)
    if prediction_rankings is not None:
        prediction_rankings = prediction_rankings.rename(columns={"rank": "prediction_value_rank"})
        # Adds the value rank as a predictor
        X = X.merge(prediction_rankings.loc[:,['player_id','prediction_value_rank']],on='player_id', how='inner')
        Y = X['rank']
        X = X.loc[:,['rank_rating','rank_peak_rating','rank_minutes_played','prediction_value_rank']]
    else:
        Y = X['rank']
        X = X.loc[:,['rank_rating','rank_peak_rating','rank_minutes_played']]    

    return X,Y

In [43]:
X,Y = gather_data(25,17)
X.head()

Unnamed: 0,rank_rating,rank_peak_rating,rank_minutes_played
0,0.203704,0.351852,0.5
1,0.157407,0.212963,0.555556
2,0.574074,0.777778,0.462963
3,0.083333,0.111111,0.981481
4,0.111111,0.101852,0.018519


We are now ready to train a Random Forest regressor.

## 4.2 Example model training

In [25]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y)
parameters = [
    {"max_depth": [2, 4, 6],'min_samples_split': [2,4,6], 'min_samples_leaf':[1,2,3]}
]
#gs = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_root_mean_squared_error')
gs = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_mean_absolute_error')
gs.fit(X_train, Y_train)
gs.cv_results_["mean_test_score"]

array([-0.17138725, -0.1714    , -0.17189995, -0.17040478, -0.17076111,
       -0.17092786, -0.1731289 , -0.17187622, -0.16860052, -0.16485915,
       -0.16584329, -0.16693777, -0.16548071, -0.16642416, -0.16517401,
       -0.16594152, -0.16564796, -0.16683734, -0.16601445, -0.16578772,
       -0.16583189, -0.16518874, -0.16634504, -0.16601674, -0.16655383,
       -0.16413403, -0.1644497 ])

In [24]:
print("Avg RMSE",-gs.cv_results_["mean_test_score"].mean())

Avg RMSE 0.20241728966253816


## 4.3 Regression for different ages
Let us now collect the average 

In [47]:
results = {'age' : [], 'mae':[]}
target_age = 25
for prediction_age in range(18,25):
    print("Prediction age ",prediction_age)
    # Gathers the data
    X,Y = gather_data(target_age,prediction_age)
    # Finds an RF model
    #X_train, X_test, Y_train, Y_test = train_test_split(X,Y)
    parameters = [{"max_depth": [2, 4, 6],'min_samples_split': [2,4,6], 'min_samples_leaf':[1,2,3]}]
    gs = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_mean_absolute_error')
    #gs.fit(X_train, Y_train)
    gs.fit(X, Y)
    avg_mae = gs.cv_results_["mean_test_score"].mean()
    results['age'].append(prediction_age)
    results['mae'].append(-avg_mae)
df = pd.DataFrame(results)
df

Prediction age  18
Prediction age  19
Prediction age  20
Prediction age  21
Prediction age  22
Prediction age  23
Prediction age  24


Unnamed: 0,age,mae
0,18,0.227898
1,19,0.209378
2,20,0.1707
3,21,0.162392
4,22,0.124828
5,23,0.126843
6,24,0.098924
