# 1. Player value records
In this notebook we explore the available player value records. They are stored in a dataframe in pickle format.

In [1]:
import pandas as pd
values = pd.read_pickle('../data/value_records_for_ratings_based_predictions.pkl')

In [2]:
values.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9610 entries, 5665 to 160599
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   player_id    9610 non-null   int32         
 1   player_name  9610 non-null   object        
 2   player_role  9607 non-null   category      
 3   birth        9610 non-null   datetime64[ns]
 4   height       9610 non-null   float64       
 5   foot         9610 non-null   object        
 6   value        9610 non-null   float64       
 7   league       9610 non-null   object        
 8   value_at     9610 non-null   datetime64[ns]
 9   nat1         9610 non-null   object        
 10  nat2         9610 non-null   object        
dtypes: category(1), datetime64[ns](2), float64(2), int32(1), object(5)
memory usage: 797.8+ KB


Here is a little explanation of the data contained in the `values` data frame.
- `player_id` is the TransferMarkt id of the player
- `player_name`, `player_role`, `height`, `foot`, and `birt` are self-explanatory

In [3]:
values['birth'].describe()

  values['birth'].describe()


count                    9610
unique                   2616
top       1994-05-27 00:00:00
freq                       22
first     1977-01-02 00:00:00
last      2001-08-16 00:00:00
Name: birth, dtype: object

- `nat1` and `nat2` are string-valued columns that store the first and second nationality of the players which have more than one.

In [4]:
values.loc[:,['player_name','nat1','nat2']].head()

Unnamed: 0,player_name,nat1,nat2
5665,Marko Dmitrovic,Serbia,-
5667,Paulo Oliveira,Portugal,-
5671,JosÃ© Ãngel,Spain,-
5674,Gonzalo Escalante,Argentina,Italy
5676,BebÃ©,Portugal,CapeVerde


- `value` contains the market value of the player recorded at time `value_at`.
- `league` is the league in which the player was competing at time `value_at`. 

In [5]:
values.loc[:,['player_id','player_name','birth','value','value_at','league']].head()

Unnamed: 0,player_id,player_name,birth,value,value_at,league
5665,94308,Marko Dmitrovic,1992-01-24,3.6,2018-06-30,SPA1
5667,139336,Paulo Oliveira,1992-01-08,2.7,2018-06-30,SPA1
5671,87469,JosÃ© Ãngel,1989-09-05,2.25,2018-06-30,SPA1
5674,266795,Gonzalo Escalante,1993-03-27,2.25,2018-06-30,SPA1
5676,153427,BebÃ©,1990-07-12,0.9,2018-06-30,SPA1


We can, for example, extract all records for the players born in a certain time interval (e.g., born in 2000).

In [6]:
values.loc[(values['birth']> '2000-1-1') & (values['birth'] < '2000-12-31'),:].head()

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
18850,316760,Vincent Thill,Attacking Midfield,2000-02-04,1.7,left,0.54,FR1,2017-06-30,Luxembourg,-
19488,370846,Timothy Weah,Centre-Forward,2000-02-22,1.85,right,0.9,FR1,2018-06-30,UnitedStates,France
21177,464990,Arton Zekaj,Defensive Midfield,2000-04-16,1.87,right,0.045,FR1,2018-06-30,Kosovo,Serbia
23745,463665,Oumar Solet,Centre-Back,2000-02-07,1.92,right,0.9,FR1,2018-06-30,France,CentralAfricanRepublic
23758,418659,Amine Gouiri,Centre-Forward,2000-02-16,1.8,right,4.05,FR1,2018-06-30,France,Algeria


We can further narrow down the search and select the records for the players born in a certain year (e.g., 1990) when they where 25 years old, i.e., in 2015.

In [7]:
values.loc[(values['birth']> '1990-1-1') & (values['birth'] < '1990-12-31') & (values['value_at'] == '2015-06-30') ,:]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
9768,133794,Edgar MÃ©ndez,Right Winger,1990-01-02,1.87,right,2.25,SPA1,2015-06-30,Spain,-
9779,73092,Rene Krhin,Defensive Midfield,1990-05-21,1.89,right,2.25,SPA1,2015-06-30,Slovenia,-
10313,203043,Pedro Bigas,Centre-Back,1990-05-15,1.81,left,0.90,SPA1,2015-06-30,Spain,-
10321,223068,Tana,Attacking Midfield,1990-09-20,1.69,left,0.18,SPA1,2015-06-30,Spain,-
13754,59344,Asier Illarramendi,Defensive Midfield,1990-03-08,1.79,right,13.50,SPA1,2015-06-30,Spain,-
...,...,...,...,...,...,...,...,...,...,...,...
158151,45184,Grzegorz Krychowiak,Defensive Midfield,1990-01-29,1.87,right,10.80,SPA1,2015-06-30,Poland,-
158621,183647,Ãlvaro GonzÃ¡lez,Centre-Back,1990-01-08,1.82,right,2.70,SPA1,2015-06-30,Spain,-
159548,58426,Douglas,Right-Back,1990-08-06,1.72,right,1.80,SPA1,2015-06-30,Brazil,-
160482,58884,Nacho,Centre-Back,1990-01-18,1.80,right,5.40,SPA1,2015-06-30,Spain,-


# 2. Filtering out missing players
For a number of players in the values records there are no ratings statistics. This might be due to the fact that they have not made an appearence in official matches. We filter them out.

In [13]:
missing = pd.read_csv('../data/missing.csv')
len(missing)

41

In [16]:
missing.head()

Unnamed: 0,ID,Name
0,250038,Mamadou Sissako
1,361082,Jasper Schendelaar
2,215094,Nick Olij
3,387241,Raman ten Hove
4,153751,Nick Hengelman


In [14]:
len(values)

9610

In [17]:
values = values.loc[~values['player_id'].isin(missing['ID']), :]
len(values) 

9556

This number is smaller than 9610 - 41 = 9559 as some players may be in more records, i.e., have value for different years.

# 3. Ranking
Market values are not predicted directly, that is, not the dollar/euro value of the player. This is due to the fact that this value may be influenced by several external factors such as the general economic situation. Thus, we cannot e.g., compare a dollar value of 2010 and a dollar value of 2020, even if discounted. Instead, we predict the *ranking* of the player in a value table listing all players born in the same year and having the same age. 
In fact, we cannot compare directly the ranking of a player at the age of 25 of a player born in 1980 (25 years old in 2005) and of a player born in 1995 (25 years old in 2020).
This is done as follows
- We divide the players according to their birth year. Thus we will have a set of players born in 1990, a set of players born in 1991 and so on. We can then further divide these players based on age.
- For each birth year, we divide the players based on age, thus, for the players born in 1990, we we divide the records for year 2010 (age 20), year 2011 (age 21) and so on.
- We rank the players in each year group and age group in non-increasing order of the market value
- Since the number of players changes from a year to another, we calculate the ranking as the percentage position on the table, i.e., $i/L$ where $i=1,\ldots,L$ is the position of the player in the table, and $L$ is the length of the table, or number of players.

## 3.1 Dividing players based on their birth year
Let us first identify the unique birth years


In [21]:
values['birth'].describe()

  values['birth'].describe()


count                    9556
unique                   2595
top       1994-05-27 00:00:00
freq                       22
first     1977-01-02 00:00:00
last      2001-08-16 00:00:00
Name: birth, dtype: object

The years seem to range from 1977 to 2001.

In [27]:
years = [i for i in range(1977,2001,1)]
years

[1977,
 1978,
 1979,
 1980,
 1981,
 1982,
 1983,
 1984,
 1985,
 1986,
 1987,
 1988,
 1989,
 1990,
 1991,
 1992,
 1993,
 1994,
 1995,
 1996,
 1997,
 1998,
 1999,
 2000]

In [29]:
players_by_year = {}
for y in years:
    df = values.loc[(values['birth']> str(y)+'-1-1') & (values['birth'] < str(y)+'-12-31'),:]
    players_by_year[y] = df

In [30]:
players_by_year[1977]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2
21638,33927,Benjamin Nivet,Attacking Midfield,1977-01-02,1.78,right,0.225,FR1,2017-06-30,France,-
21720,33927,Benjamin Nivet,Attacking Midfield,1977-01-02,1.78,right,0.225,FR1,2015-06-30,France,-
21734,33927,Benjamin Nivet,Attacking Midfield,1977-01-02,1.78,right,0.63,FR1,2012-06-30,France,-
24961,4811,Hilton,Centre-Back,1977-09-13,1.8,right,0.225,FR1,2018-06-30,Brazil,-
24992,4811,Hilton,Centre-Back,1977-09-13,1.8,right,0.81,FR1,2011-06-30,Brazil,-
25008,4811,Hilton,Centre-Back,1977-09-13,1.8,right,0.225,FR1,2016-06-30,Brazil,-
25042,4811,Hilton,Centre-Back,1977-09-13,1.8,right,0.225,FR1,2017-06-30,Brazil,-
25153,4811,Hilton,Centre-Back,1977-09-13,1.8,right,0.225,FR1,2015-06-30,Brazil,-
25170,4811,Hilton,Centre-Back,1977-09-13,1.8,right,0.9,FR1,2012-06-30,Brazil,-
25205,4811,Hilton,Centre-Back,1977-09-13,1.8,right,0.45,FR1,2013-06-30,Brazil,-


Let us also divide the players based on age. Here the age is the difference between the `value_at` year and the `birth` year. 

In [39]:
values['age'] = values.apply(lambda row: (row.value_at.year - row.birth.year), axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  values['age'] = values.apply(lambda row: (row.value_at.year - row.birth.year), axis = 1)


In [40]:
values.head()

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2,age
5665,94308,Marko Dmitrovic,Goalkeeper,1992-01-24,1.94,left,3.6,SPA1,2018-06-30,Serbia,-,26
5667,139336,Paulo Oliveira,Centre-Back,1992-01-08,1.87,right,2.7,SPA1,2018-06-30,Portugal,-,26
5671,87469,JosÃ© Ãngel,Left-Back,1989-09-05,1.82,left,2.25,SPA1,2018-06-30,Spain,-,29
5674,266795,Gonzalo Escalante,Central Midfield,1993-03-27,1.82,right,2.25,SPA1,2018-06-30,Argentina,Italy,25
5676,153427,BebÃ©,Left Winger,1990-07-12,1.9,right,0.9,SPA1,2018-06-30,Portugal,CapeVerde,28


In [42]:
values.loc[values['age']== 25,:]

Unnamed: 0,player_id,player_name,player_role,birth,height,foot,value,league,value_at,nat1,nat2,age
5674,266795,Gonzalo Escalante,Central Midfield,1993-03-27,1.82,right,2.25,SPA1,2018-06-30,Argentina,Italy,25
5678,142027,Pablo HervÃ­as,Right Winger,1993-03-08,1.74,right,0.72,SPA1,2018-06-30,Spain,-,25
5684,311287,JosÃ© Antonio MartÃ­nez,Centre-Back,1993-02-12,1.91,left,0.45,SPA1,2018-06-30,Spain,-,25
5720,127108,Florian Lejeune,Centre-Back,1991-05-20,1.90,right,0.72,SPA1,2016-06-30,France,-,25
5724,238868,RubÃ©n PeÃ±a,Right-Back,1991-07-18,1.70,right,0.36,SPA1,2016-06-30,Spain,-,25
...,...,...,...,...,...,...,...,...,...,...,...,...
160487,31909,Toni Kroos,Central Midfield,1990-01-04,1.83,both,45.00,SPA1,2015-06-30,Germany,-,25
160526,18922,Karim Benzema,Centre-Forward,1987-12-19,1.85,both,31.50,SPA1,2012-06-30,France,Algeria,25
160532,44501,Marcelo,Left-Back,1988-05-12,1.74,left,22.50,SPA1,2013-06-30,Brazil,Spain,25
160549,33648,FÃ¡bio CoentrÃ£o,Left-Back,1988-03-11,1.79,left,16.20,SPA1,2013-06-30,Portugal,-,25
