# Data Loading & Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


## Data Loading

In [2]:
#Loading the valuations csv;
valuations = pd.read_csv("data1/player_valuations.csv")

In [3]:
#Loading the players csv;
players =pd.read_csv("data1/players.csv")

This dataset has a lot of null values in a few places. We will have to investigate further.
We'll also have to look for duplicates.

In [4]:
players.describe()

Unnamed: 0,player_id,last_season,current_club_id,height_in_cm,market_value_in_eur,highest_market_value_in_eur
count,30298.0,30298.0,30298.0,28188.0,19358.0,28955.0
mean,311229.8,2018.768335,4365.629547,182.233113,2180419.0,3523106.0
std,250218.4,3.654054,10056.593385,6.83413,7096501.0,9217968.0
min,10.0,2012.0,3.0,18.0,10000.0,10000.0
25%,95268.25,2016.0,403.0,178.0,175000.0,250000.0
50%,257824.0,2019.0,1071.0,182.0,350000.0,750000.0
75%,465540.8,2022.0,3008.0,187.0,1000000.0,2600000.0
max,1186012.0,2023.0,83678.0,207.0,180000000.0,200000000.0


In [5]:
appearances=pd.read_csv("data1/appearances.csv")

In [6]:
gameevents = pd.read_csv("data1/game_events.csv")

I"ve loaded in most of the csv files I"ll be using for my dataset. 

## Data Cleaning

#### Valuations & Players Dataset 

In [7]:
#look at valuations dataset
valuations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440663 entries, 0 to 440662
Data columns (total 9 columns):
 #   Column                               Non-Null Count   Dtype 
---  ------                               --------------   ----- 
 0   player_id                            440663 non-null  int64 
 1   last_season                          440663 non-null  int64 
 2   datetime                             440663 non-null  object
 3   date                                 440663 non-null  object
 4   dateweek                             440663 non-null  object
 5   market_value_in_eur                  440663 non-null  int64 
 6   n                                    440663 non-null  int64 
 7   current_club_id                      440663 non-null  int64 
 8   player_club_domestic_competition_id  440663 non-null  object
dtypes: int64(5), object(4)
memory usage: 30.3+ MB


The valuations dataframe doesnt seem to have any null values which is good. But we will still have to check for duplicates.

In [8]:
valuations.describe()

Unnamed: 0,player_id,last_season,market_value_in_eur,n,current_club_id
count,440663.0,440663.0,440663.0,440663.0,440663.0
mean,196411.3,2018.762887,2357557.0,1.0,4041.891491
std,179362.2,3.624305,6603356.0,0.0,9508.375247
min,10.0,2012.0,10000.0,1.0,3.0
25%,55322.0,2016.0,200000.0,1.0,368.0
50%,140748.0,2019.0,500000.0,1.0,1010.0
75%,289645.0,2022.0,1600000.0,1.0,2944.0
max,1166093.0,2023.0,200000000.0,1.0,83678.0


In [9]:
players.describe()

Unnamed: 0,player_id,last_season,current_club_id,height_in_cm,market_value_in_eur,highest_market_value_in_eur
count,30298.0,30298.0,30298.0,28188.0,19358.0,28955.0
mean,311229.8,2018.768335,4365.629547,182.233113,2180419.0,3523106.0
std,250218.4,3.654054,10056.593385,6.83413,7096501.0,9217968.0
min,10.0,2012.0,3.0,18.0,10000.0,10000.0
25%,95268.25,2016.0,403.0,178.0,175000.0,250000.0
50%,257824.0,2019.0,1071.0,182.0,350000.0,750000.0
75%,465540.8,2022.0,3008.0,187.0,1000000.0,2600000.0
max,1186012.0,2023.0,83678.0,207.0,180000000.0,200000000.0


In [10]:
valuations.isna().sum()

player_id                              0
last_season                            0
datetime                               0
date                                   0
dateweek                               0
market_value_in_eur                    0
n                                      0
current_club_id                        0
player_club_domestic_competition_id    0
dtype: int64

No null values here.

In [11]:
players.isna().sum()

player_id                                   0
first_name                               1964
last_name                                   0
name                                        0
last_season                                 0
current_club_id                             0
player_code                                 0
country_of_birth                         2691
city_of_birth                            2205
country_of_citizenship                    544
date_of_birth                              47
sub_position                              173
position                                    0
foot                                     2397
height_in_cm                             2110
market_value_in_eur                     10940
highest_market_value_in_eur              1343
contract_expiration_date                11478
agent_name                              15362
image_url                                   0
url                                         0
current_club_domestic_competition_

Quite a few null values here, they'll have to be dealt with sooner or later.

In [12]:
valuations["player_id"].nunique()

28794

.nunique will return me the number of unique counts of player_ids I have in my valuations dataset. Essentially suggesting that I have valuations for 28,700 individual players with each player having numerous valuations. I"ll have to decide later on whether I want one valuation for each individual player or to keep the numerous valuations.

In [13]:
valuations.isna().sum()

player_id                              0
last_season                            0
datetime                               0
date                                   0
dateweek                               0
market_value_in_eur                    0
n                                      0
current_club_id                        0
player_club_domestic_competition_id    0
dtype: int64

In [14]:
valuations.sample(25)

Unnamed: 0,player_id,last_season,datetime,date,dateweek,market_value_in_eur,n,current_club_id,player_club_domestic_competition_id
255491,34553,2018,2018-12-31 00:00:00,2018-12-31,2018-12-31,600000,1,398,IT1
177014,7449,2019,2016-08-01 00:00:00,2016-08-01,2016-08-01,9000000,1,6890,TR1
330137,192539,2018,2020-12-21 00:00:00,2020-12-21,2020-12-21,400000,1,441,GR1
361166,256696,2018,2021-06-29 00:00:00,2021-06-29,2021-06-28,125000,1,2503,PO1
276123,52191,2020,2019-06-28 00:00:00,2019-06-28,2019-06-24,500000,1,873,GB1
162754,72195,2022,2016-02-22 00:00:00,2016-02-22,2016-02-22,2000000,1,6676,GR1
382985,268122,2022,2022-01-06 00:00:00,2022-01-06,2022-01-03,400000,1,354,BE1
209696,230219,2013,2017-07-14 00:00:00,2017-07-14,2017-07-10,100000,1,366,ES1
384461,98631,2018,2022-01-17 00:00:00,2022-01-17,2022-01-17,225000,1,2293,TR1
425393,576982,2023,2023-01-05 00:00:00,2023-01-05,2023-01-02,150000,1,868,TR1


Get a lengthy sample of 25 of the dataset to get a gauge of how it looks.

### Data Merging

In [15]:
#merge players and valuations dataset
fdf = pd.merge(players, valuations, on="player_id")

Here I'm concatonating the valuations and player dataframes into one wider dataframe with both the data from both collumns joined on the player_id collumn. I"ll continue to concat/merge with the other dataframes until I have my desired dataset.

In [16]:
#do .info on dataframe to get overview
fdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 440584 entries, 0 to 440583
Data columns (total 31 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   player_id                             440584 non-null  int64  
 1   first_name                            410797 non-null  object 
 2   last_name                             440584 non-null  object 
 3   name                                  440584 non-null  object 
 4   last_season_x                         440584 non-null  int64  
 5   current_club_id_x                     440584 non-null  int64  
 6   player_code                           440584 non-null  object 
 7   country_of_birth                      423624 non-null  object 
 8   city_of_birth                         430453 non-null  object 
 9   country_of_citizenship                433617 non-null  object 
 10  date_of_birth                         440061 non-null  object 
 11  

In [17]:
#describe 
fdf.describe()

Unnamed: 0,player_id,last_season_x,current_club_id_x,height_in_cm,market_value_in_eur_x,highest_market_value_in_eur,last_season_y,market_value_in_eur_y,n,current_club_id_y
count,440584.0,440584.0,440584.0,430044.0,304570.0,440584.0,440584.0,440584.0,440584.0,440584.0
mean,196419.5,2018.765509,4047.309421,182.331513,2631117.0,5515200.0,2018.763761,2357931.0,1.0,4037.26897
std,179357.4,3.622222,9526.404484,6.677484,7894860.0,11872750.0,3.623842,6603889.0,0.0,9502.420155
min,10.0,2012.0,3.0,18.0,10000.0,10000.0,2012.0,10000.0,1.0,3.0
25%,55317.0,2016.0,368.0,178.0,200000.0,600000.0,2016.0,200000.0,1.0,368.0
50%,140757.0,2019.0,1010.0,183.0,400000.0,1500000.0,2019.0,500000.0,1.0,1010.0
75%,289648.0,2022.0,2969.0,187.0,1500000.0,5000000.0,2022.0,1600000.0,1.0,2944.0
max,1166093.0,2023.0,83678.0,207.0,180000000.0,200000000.0,2023.0,200000000.0,1.0,83678.0


In [18]:
#have a look at messis data in the dataset
fdf[fdf["player_id"] == 28003]

Unnamed: 0,player_id,first_name,last_name,name,last_season_x,current_club_id_x,player_code,country_of_birth,city_of_birth,country_of_citizenship,...,current_club_domestic_competition_id,current_club_name,last_season_y,datetime,date,dateweek,market_value_in_eur_y,n,current_club_id_y,player_club_domestic_competition_id
51400,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2004-12-20 00:00:00,2004-12-20,2004-12-20,3000000,1,583,FR1
51401,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2005-12-28 00:00:00,2005-12-28,2005-12-26,5000000,1,583,FR1
51402,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2006-01-20 00:00:00,2006-01-20,2006-01-16,15000000,1,583,FR1
51403,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2007-07-26 00:00:00,2007-07-26,2007-07-23,40000000,1,583,FR1
51404,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2007-09-12 00:00:00,2007-09-12,2007-09-10,60000000,1,583,FR1
51405,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2008-02-04 00:00:00,2008-02-04,2008-02-04,55000000,1,583,FR1
51406,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2008-07-10 00:00:00,2008-07-10,2008-07-07,55000000,1,583,FR1
51407,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2009-01-26 00:00:00,2009-01-26,2009-01-26,55000000,1,583,FR1
51408,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2009-04-28 00:00:00,2009-04-28,2009-04-27,60000000,1,583,FR1
51409,28003,Lionel,Messi,Lionel Messi,2022,583,lionel-messi,Argentina,Rosario,Argentina,...,FR1,Paris Saint-Germain,2022,2009-07-22 00:00:00,2009-07-22,2009-07-20,70000000,1,583,FR1


#### Appearances Dataset

In [19]:
#get head of appearances
appearances.head()

Unnamed: 0,appearance_id,game_id,player_id,player_club_id,player_current_club_id,date,player_name,competition_id,yellow_cards,red_cards,goals,assists,minutes_played
0,2231978_38004,2231978,38004,853,235,2012-07-03,Aurélien Joachim,CLQ,0,0,2,0,90
1,2233748_79232,2233748,79232,8841,2698,2012-07-05,Ruslan Abyshov,ELQ,0,0,0,0,90
2,2234413_42792,2234413,42792,6251,465,2012-07-05,Sander Puri,ELQ,0,0,0,0,45
3,2234418_73333,2234418,73333,1274,6646,2012-07-05,Vegar Hedenstad,ELQ,0,0,0,0,90
4,2234421_122011,2234421,122011,195,3008,2012-07-05,Markus Henriksen,ELQ,0,0,0,1,90


In [20]:
fdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 440584 entries, 0 to 440583
Data columns (total 31 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   player_id                             440584 non-null  int64  
 1   first_name                            410797 non-null  object 
 2   last_name                             440584 non-null  object 
 3   name                                  440584 non-null  object 
 4   last_season_x                         440584 non-null  int64  
 5   current_club_id_x                     440584 non-null  int64  
 6   player_code                           440584 non-null  object 
 7   country_of_birth                      423624 non-null  object 
 8   city_of_birth                         430453 non-null  object 
 9   country_of_citizenship                433617 non-null  object 
 10  date_of_birth                         440061 non-null  object 
 11  

In [21]:
fdf2=fdf.loc[:,["player_id","name","last_season_x","current_club_id_x","country_of_citizenship","date_of_birth","sub_position","position","foot","height_in_cm","market_value_in_eur_y","highest_market_value_in_eur","contract_expiration_date","current_club_name","current_club_domestic_competition_id","date"]]

In [22]:
fdf2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 440584 entries, 0 to 440583
Data columns (total 16 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   player_id                             440584 non-null  int64  
 1   name                                  440584 non-null  object 
 2   last_season_x                         440584 non-null  int64  
 3   current_club_id_x                     440584 non-null  int64  
 4   country_of_citizenship                433617 non-null  object 
 5   date_of_birth                         440061 non-null  object 
 6   sub_position                          439781 non-null  object 
 7   position                              440584 non-null  object 
 8   foot                                  426723 non-null  object 
 9   height_in_cm                          430044 non-null  float64
 10  market_value_in_eur_y                 440584 non-null  int64  
 11  

In [23]:
fdf2.sample(20)

Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,date_of_birth,sub_position,position,foot,height_in_cm,market_value_in_eur_y,highest_market_value_in_eur,contract_expiration_date,current_club_name,current_club_domestic_competition_id,date
392032,226100,Anthony van den Hurk,2018,642,Curacao,1993-01-09,Centre-Forward,Attack,right,180.0,700000,700000.0,2023-06-30 00:00:00,De Graafschap Doetinchem,NL1,2022-06-24
19590,452607,Alexander Bah,2023,294,Denmark,1997-12-09,Right-Back,Defender,right,183.0,6000000,14000000.0,2027-06-30 00:00:00,SL Benfica,PO1,2021-12-26
171933,148957,Carl McHugh,2018,987,Ireland,1993-02-05,Defensive Midfield,Midfield,left,180.0,325000,400000.0,2024-05-31 00:00:00,Motherwell FC,SC1,2021-12-07
197899,139339,João Amorim,2015,68608,Portugal,1992-07-26,Right-Back,Defender,right,178.0,200000,800000.0,2023-06-30 00:00:00,CF Os Belenenses,PO1,2019-02-12
182855,330289,Cheick Keita,2018,1245,Mali,1996-04-16,Left-Back,Defender,left,180.0,1000000,1250000.0,2023-05-31 00:00:00,KAS Eupen,BE1,2018-06-07
310243,240260,Franjo Prce,2019,2477,Croatia,1996-01-07,Centre-Back,Defender,left,188.0,150000,600000.0,2024-06-30 00:00:00,Karpaty Lviv (-2021),UKR1,2016-01-04
133062,21863,Andrea Dossena,2013,289,Italy,1981-09-11,Left Midfield,Midfield,left,180.0,3200000,8000000.0,,Sunderland AFC,GB1,2012-01-11
258260,317818,Carlos Abad,2019,128,Spain,1995-06-28,Goalkeeper,Goalkeeper,right,193.0,200000,500000.0,2023-06-30 00:00:00,AO Xanthi,GR1,2022-12-29
350252,287183,Fedor Chalov,2023,2410,Russia,1998-04-10,Centre-Forward,Attack,right,181.0,2000000,16000000.0,2024-12-31 00:00:00,CSKA Moscow,RU1,2018-06-06
285879,332132,Valeriy Rogozynskyi,2023,61825,Ukraine,1995-09-03,Left Midfield,Midfield,left,180.0,125000,400000.0,2024-06-30 00:00:00,FK Minaj,UKR1,2020-12-05


In [24]:
#look back at messis data with the new features
fdf2[fdf2["player_id"]==28003]   

Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,date_of_birth,sub_position,position,foot,height_in_cm,market_value_in_eur_y,highest_market_value_in_eur,contract_expiration_date,current_club_name,current_club_domestic_competition_id,date
51400,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,3000000,180000000.0,,Paris Saint-Germain,FR1,2004-12-20
51401,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,5000000,180000000.0,,Paris Saint-Germain,FR1,2005-12-28
51402,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,15000000,180000000.0,,Paris Saint-Germain,FR1,2006-01-20
51403,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,40000000,180000000.0,,Paris Saint-Germain,FR1,2007-07-26
51404,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,60000000,180000000.0,,Paris Saint-Germain,FR1,2007-09-12
51405,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,55000000,180000000.0,,Paris Saint-Germain,FR1,2008-02-04
51406,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,55000000,180000000.0,,Paris Saint-Germain,FR1,2008-07-10
51407,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,55000000,180000000.0,,Paris Saint-Germain,FR1,2009-01-26
51408,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,60000000,180000000.0,,Paris Saint-Germain,FR1,2009-04-28
51409,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,70000000,180000000.0,,Paris Saint-Germain,FR1,2009-07-22


In [25]:
fdf2["player_id"].nunique()

28785

In [26]:
fdf2["date_of_birth"] = pd.to_datetime(fdf2["date_of_birth"])
fdf2["date"] = pd.to_datetime(fdf2["date"])
fdf2["contract_expiration_date"] = pd.to_datetime(fdf2["contract_expiration_date"])

In [27]:
fdf2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 440584 entries, 0 to 440583
Data columns (total 16 columns):
 #   Column                                Non-Null Count   Dtype         
---  ------                                --------------   -----         
 0   player_id                             440584 non-null  int64         
 1   name                                  440584 non-null  object        
 2   last_season_x                         440584 non-null  int64         
 3   current_club_id_x                     440584 non-null  int64         
 4   country_of_citizenship                433617 non-null  object        
 5   date_of_birth                         440061 non-null  datetime64[ns]
 6   sub_position                          439781 non-null  object        
 7   position                              440584 non-null  object        
 8   foot                                  426723 non-null  object        
 9   height_in_cm                          430044 non-null  floa

In [28]:
fdf2[fdf2["player_id"]==28003]

Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,date_of_birth,sub_position,position,foot,height_in_cm,market_value_in_eur_y,highest_market_value_in_eur,contract_expiration_date,current_club_name,current_club_domestic_competition_id,date
51400,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,3000000,180000000.0,NaT,Paris Saint-Germain,FR1,2004-12-20
51401,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,5000000,180000000.0,NaT,Paris Saint-Germain,FR1,2005-12-28
51402,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,15000000,180000000.0,NaT,Paris Saint-Germain,FR1,2006-01-20
51403,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,40000000,180000000.0,NaT,Paris Saint-Germain,FR1,2007-07-26
51404,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,60000000,180000000.0,NaT,Paris Saint-Germain,FR1,2007-09-12
51405,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,55000000,180000000.0,NaT,Paris Saint-Germain,FR1,2008-02-04
51406,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,55000000,180000000.0,NaT,Paris Saint-Germain,FR1,2008-07-10
51407,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,55000000,180000000.0,NaT,Paris Saint-Germain,FR1,2009-01-26
51408,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,60000000,180000000.0,NaT,Paris Saint-Germain,FR1,2009-04-28
51409,28003,Lionel Messi,2022,583,Argentina,1987-06-24,Right Winger,Attack,left,170.0,70000000,180000000.0,NaT,Paris Saint-Germain,FR1,2009-07-22


In [29]:
#drop nulls
fdf2 = fdf2.dropna()

In [30]:
fdf2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 262413 entries, 138 to 440583
Data columns (total 16 columns):
 #   Column                                Non-Null Count   Dtype         
---  ------                                --------------   -----         
 0   player_id                             262413 non-null  int64         
 1   name                                  262413 non-null  object        
 2   last_season_x                         262413 non-null  int64         
 3   current_club_id_x                     262413 non-null  int64         
 4   country_of_citizenship                262413 non-null  object        
 5   date_of_birth                         262413 non-null  datetime64[ns]
 6   sub_position                          262413 non-null  object        
 7   position                              262413 non-null  object        
 8   foot                                  262413 non-null  object        
 9   height_in_cm                          262413 non-null  fl

### Basic Collumn Engineering

In [31]:
# calculate age at the time of date
fdf2["age"] = (fdf2["date"] - fdf2["date_of_birth"]).dt.days / 365

#round to nearest year
fdf2["age"] = fdf2["age"].round().astype(int)

Now that I've derived age the date of birth collumn is useless. I'll drop it

In [32]:
#drop DOB
fdf2=fdf2.drop("date_of_birth",axis=1)

In [33]:
#calculate days left of contract
current_date = pd.to_datetime("today")
fdf2["contract_days_left"] = (fdf2["contract_expiration_date"] - current_date).dt.days

In [34]:
#get head
appearances.head()

Unnamed: 0,appearance_id,game_id,player_id,player_club_id,player_current_club_id,date,player_name,competition_id,yellow_cards,red_cards,goals,assists,minutes_played
0,2231978_38004,2231978,38004,853,235,2012-07-03,Aurélien Joachim,CLQ,0,0,2,0,90
1,2233748_79232,2233748,79232,8841,2698,2012-07-05,Ruslan Abyshov,ELQ,0,0,0,0,90
2,2234413_42792,2234413,42792,6251,465,2012-07-05,Sander Puri,ELQ,0,0,0,0,45
3,2234418_73333,2234418,73333,1274,6646,2012-07-05,Vegar Hedenstad,ELQ,0,0,0,0,90
4,2234421_122011,2234421,122011,195,3008,2012-07-05,Markus Henriksen,ELQ,0,0,0,1,90


In [35]:
#turn month and year into datetime
fdf2["month"]=fdf2["date"].dt.month
fdf2["year"] = fdf2["date"].dt.year


In [36]:
#get head
fdf2.head()

Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value_in_eur_y,highest_market_value_in_eur,contract_expiration_date,current_club_name,current_club_domestic_competition_id,date,age,contract_days_left,month,year
138,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,100000,2000000.0,2023-06-30,Feyenoord Rotterdam,NL1,2004-10-04,23,-134,10,2004
139,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,250000,2000000.0,2023-06-30,Feyenoord Rotterdam,NL1,2005-04-14,23,-134,4,2005
140,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,350000,2000000.0,2023-06-30,Feyenoord Rotterdam,NL1,2006-04-14,24,-134,4,2006
141,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,350000,2000000.0,2023-06-30,Feyenoord Rotterdam,NL1,2007-05-02,25,-134,5,2007
142,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,700000,2000000.0,2023-06-30,Feyenoord Rotterdam,NL1,2007-07-29,25,-134,7,2007


In [37]:
#convert date to datetime
appearances["date"] = pd.to_datetime(appearances["date"])

#get year
appearances["year"] = appearances["date"].dt.year

# group by player identifiers and year and get sum of statistics
yearly_stats = appearances.groupby(["player_id", "player_name", "year"]).agg(
    {"goals": "sum", "assists": "sum", "yellow_cards": "sum", "red_cards": "sum", "minutes_played": "sum", "player_club_id": "first","competition_id":"first"}
).reset_index()



In [38]:
#test
yearly_stats[yearly_stats["player_club_id"]==234]

Unnamed: 0,player_id,player_name,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,competition_id
557,3321,Luke Wilkshire,2015,0,0,1,0,138,234,NL1
901,4010,Dirk Kuyt,2016,16,6,4,0,3411,234,NL1
902,4010,Dirk Kuyt,2017,6,2,1,0,968,234,NL1
918,4042,Brad Jones,2017,0,0,1,0,3690,234,NL1
919,4042,Brad Jones,2018,0,0,1,0,1620,234,NL1
...,...,...,...,...,...,...,...,...,...,...
89081,683620,Antoni Milambo,2021,0,0,0,0,13,234,ECLQ
89082,683620,Antoni Milambo,2023,0,0,0,0,17,234,NL1
89559,710886,Leo Sauer,2023,1,0,0,0,93,234,NL1
89714,721470,Patrik Walemark,2022,3,3,3,0,1104,234,NL1


My `competition_id` collumn seems to be showing the domestic cup ID sometimes instead of the domestic league ID. Maybe we can link the `player_club_id` collumn with the club_id collumn of the clubs dataset.

#### Clubs Dataset

In [39]:
#read clubs csv
clubs=pd.read_csv("data1/clubs.csv")

In [40]:
#look at data
clubs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426 entries, 0 to 425
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   club_id                  426 non-null    int64  
 1   club_code                426 non-null    object 
 2   name                     426 non-null    object 
 3   domestic_competition_id  426 non-null    object 
 4   total_market_value       0 non-null      float64
 5   squad_size               426 non-null    int64  
 6   average_age              388 non-null    float64
 7   foreigners_number        426 non-null    int64  
 8   foreigners_percentage    379 non-null    float64
 9   national_team_players    426 non-null    int64  
 10  stadium_name             426 non-null    object 
 11  stadium_seats            426 non-null    int64  
 12  net_transfer_record      426 non-null    object 
 13  coach_name               0 non-null      float64
 14  last_season              4

In [41]:
#merge the yearly stats collumn with a few club stats from the club dataset
merged_data = yearly_stats.merge(clubs[['club_id', 'domestic_competition_id', 'net_transfer_record',"national_team_players"]], left_on='player_club_id', right_on='club_id', how='left')

merged_data.sample(10)

Unnamed: 0,player_id,player_name,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,competition_id,club_id,domestic_competition_id,net_transfer_record,national_team_players
279,1986,Markus Feulner,2013,2,1,6,0,1942,4,L1,4.0,L1,+€2.63m,2.0
47423,162961,Valerio Verre,2016,2,1,1,1,947,2921,CIT,2921.0,IT1,+€4.60m,1.0
88708,661053,Jarrad Branthwaite,2022,1,0,0,1,876,29,GB1,29.0,GB1,+€42.30m,6.0
17706,47011,Óscar Scarione,2014,11,4,0,0,2589,10484,TR1,10484.0,TR1,€-1.00m,2.0
7768,23825,Yoan Gouffran,2019,0,0,0,0,290,1467,TR1,1467.0,TR1,+€1.60m,0.0
61087,242648,Amos Nasha,2015,0,0,0,0,4,379,ELQ,379.0,GB1,+€18.84m,13.0
84386,524290,Fisnik Asllani,2023,0,0,0,0,121,533,L1,533.0,L1,€-7.30m,12.0
69167,305956,Nano Mesa,2018,0,0,0,0,72,3368,ES1,3368.0,ES1,+€7.80m,0.0
48844,170516,Tim Dierßen,2016,0,0,0,0,19,42,L1,42.0,L1,€-300k,1.0
71559,332866,Shaun Want,2018,0,0,5,0,1583,2999,SC1,2999.0,SC1,+-0,0.0


In [42]:
#drop competition and club id
merged_data=merged_data.drop(["competition_id","club_id"],axis=1)

In [43]:
#test on messi data
merged_data[merged_data["player_name"]=="Lionel Messi"]

Unnamed: 0,player_id,player_name,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,domestic_competition_id,net_transfer_record,national_team_players
9285,28003,Lionel Messi,2012,35,9,1,0,2220,131,ES1,+€102.00m,15.0
9286,28003,Lionel Messi,2013,39,12,1,0,3018,131,ES1,+€102.00m,15.0
9287,28003,Lionel Messi,2014,50,20,5,0,4531,131,ES1,+€102.00m,15.0
9288,28003,Lionel Messi,2015,48,27,5,0,4559,131,ES1,+€102.00m,15.0
9289,28003,Lionel Messi,2016,51,27,7,0,4429,131,ES1,+€102.00m,15.0
9290,28003,Lionel Messi,2017,50,19,9,0,4931,131,ES1,+€102.00m,15.0
9291,28003,Lionel Messi,2018,42,24,5,0,3503,131,ES1,+€102.00m,15.0
9292,28003,Lionel Messi,2019,42,14,4,0,3490,131,ES1,+€102.00m,15.0
9293,28003,Lionel Messi,2020,26,23,8,0,3915,131,ES1,+€102.00m,15.0
9294,28003,Lionel Messi,2021,34,14,4,1,3933,131,ES1,+€102.00m,15.0


Everything looks good here, it seems to update based on the year which is perfect.

In [44]:
#sanity check on mbappe
fdf2[fdf2["player_id"]==342229]

Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value_in_eur_y,highest_market_value_in_eur,contract_expiration_date,current_club_name,current_club_domestic_competition_id,date,age,contract_days_left,month,year
72966,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,50000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2015-12-02,17,232,12,2015
72967,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,250000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2016-02-04,17,232,2,2016
72968,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,1000000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2016-04-09,17,232,4,2016
72969,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,1500000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2016-07-13,18,232,7,2016
72970,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,4000000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2016-11-01,18,232,11,2016
72971,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,10000000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2017-01-16,18,232,1,2017
72972,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,35000000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2017-06-01,18,232,6,2017
72973,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,90000000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2017-10-12,19,232,10,2017
72974,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,120000000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2018-01-24,19,232,1,2018
72975,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,120000000,200000000.0,2024-06-30,Paris Saint-Germain,FR1,2018-06-04,19,232,6,2018


In [45]:
#look at merged data
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91452 entries, 0 to 91451
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   player_id                91452 non-null  int64  
 1   player_name              91452 non-null  object 
 2   year                     91452 non-null  int64  
 3   goals                    91452 non-null  int64  
 4   assists                  91452 non-null  int64  
 5   yellow_cards             91452 non-null  int64  
 6   red_cards                91452 non-null  int64  
 7   minutes_played           91452 non-null  int64  
 8   player_club_id           91452 non-null  int64  
 9   domestic_competition_id  89007 non-null  object 
 10  net_transfer_record      89007 non-null  object 
 11  national_team_players    89007 non-null  float64
dtypes: float64(1), int64(8), object(3)
memory usage: 9.1+ MB


In [46]:
#set year to integer
merged_data["year"] = merged_data["year"].astype(int)

#merge the dataframes
fdf3 = pd.merge(fdf2, merged_data, how="left", on=["player_id", "year"])

#test on mbappe
fdf3[fdf3["player_id"]==342229]


Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value_in_eur_y,...,player_name,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,domestic_competition_id,net_transfer_record,national_team_players
41513,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,50000,...,Kylian Mbappé,0.0,1.0,0.0,0.0,56.0,162.0,FR1,€-24.50m,15.0
41514,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,250000,...,Kylian Mbappé,4.0,6.0,3.0,0.0,792.0,162.0,FR1,€-24.50m,15.0
41515,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,1000000,...,Kylian Mbappé,4.0,6.0,3.0,0.0,792.0,162.0,FR1,€-24.50m,15.0
41516,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,1500000,...,Kylian Mbappé,4.0,6.0,3.0,0.0,792.0,162.0,FR1,€-24.50m,15.0
41517,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,4000000,...,Kylian Mbappé,4.0,6.0,3.0,0.0,792.0,162.0,FR1,€-24.50m,15.0
41518,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,10000000,...,Kylian Mbappé,30.0,13.0,3.0,0.0,3311.0,162.0,FR1,€-24.50m,15.0
41519,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,35000000,...,Kylian Mbappé,30.0,13.0,3.0,0.0,3311.0,162.0,FR1,€-24.50m,15.0
41520,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,90000000,...,Kylian Mbappé,30.0,13.0,3.0,0.0,3311.0,162.0,FR1,€-24.50m,15.0
41521,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,120000000,...,Kylian Mbappé,21.0,10.0,4.0,1.0,2550.0,583.0,FR1,€-146.50m,18.0
41522,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,120000000,...,Kylian Mbappé,21.0,10.0,4.0,1.0,2550.0,583.0,FR1,€-146.50m,18.0


We need to clean the data by dropping null values and duplicate values.

In [47]:
fdf3 = fdf3.dropna()

In [48]:
fdf3.sort_values(by="market_value_in_eur_y", ascending=False).head(10)

Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value_in_eur_y,...,player_name,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,domestic_competition_id,net_transfer_record,national_team_players
41525,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,200000000,...,Kylian Mbappé,21.0,10.0,4.0,1.0,2550.0,583.0,FR1,€-146.50m,18.0
41526,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,200000000,...,Kylian Mbappé,38.0,12.0,4.0,0.0,2818.0,583.0,FR1,€-146.50m,18.0
41527,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,200000000,...,Kylian Mbappé,38.0,12.0,4.0,0.0,2818.0,583.0,FR1,€-146.50m,18.0
41535,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,180000000,...,Kylian Mbappé,41.0,16.0,9.0,0.0,3545.0,583.0,FR1,€-146.50m,18.0
41536,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,180000000,...,Kylian Mbappé,24.0,4.0,4.0,0.0,2230.0,583.0,FR1,€-146.50m,18.0
41528,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,180000000,...,Kylian Mbappé,21.0,12.0,2.0,0.0,2286.0,583.0,FR1,€-146.50m,18.0
41529,342229,Kylian Mbappé,2023,583,France,Centre-Forward,Attack,right,178.0,180000000,...,Kylian Mbappé,21.0,12.0,2.0,0.0,2286.0,583.0,FR1,€-146.50m,18.0
161543,68290,Neymar,2022,583,Brazil,Left Winger,Attack,right,175.0,180000000,...,Neymar,24.0,12.0,5.0,0.0,2215.0,583.0,FR1,€-146.50m,18.0
161544,68290,Neymar,2022,583,Brazil,Left Winger,Attack,right,175.0,180000000,...,Neymar,24.0,12.0,5.0,0.0,2215.0,583.0,FR1,€-146.50m,18.0
161545,68290,Neymar,2022,583,Brazil,Left Winger,Attack,right,175.0,180000000,...,Neymar,24.0,12.0,5.0,0.0,2215.0,583.0,FR1,€-146.50m,18.0


In [49]:
fdf3.drop_duplicates()

Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value_in_eur_y,...,player_name,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,domestic_competition_id,net_transfer_record,national_team_players
10,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0.0,0.0,1.0,0.0,720.0,31.0,GB1,€-111.30m,16.0
11,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0.0,0.0,1.0,0.0,720.0,31.0,GB1,€-111.30m,16.0
12,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0.0,0.0,0.0,0.0,450.0,31.0,GB1,€-111.30m,16.0
13,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0.0,0.0,0.0,0.0,450.0,31.0,GB1,€-111.30m,16.0
14,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0.0,0.0,0.0,0.0,466.0,31.0,GB1,€-111.30m,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262395,371851,Jaka Bijol,2023,410,Slovenia,Centre-Back,Defender,right,190.0,10000000,...,Jaka Bijol,1.0,2.0,6.0,0.0,2511.0,410.0,IT1,+€8.56m,8.0
262405,586756,Festy Ebosele,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,...,Festy Ebosele,0.0,0.0,1.0,0.0,53.0,410.0,IT1,+€8.56m,8.0
262406,586756,Festy Ebosele,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,...,Festy Ebosele,0.0,0.0,1.0,0.0,53.0,410.0,IT1,+€8.56m,8.0
262407,586756,Festy Ebosele,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,...,Festy Ebosele,0.0,1.0,3.0,0.0,870.0,410.0,IT1,+€8.56m,8.0


In [50]:
fdf3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 123594 entries, 10 to 262412
Data columns (total 29 columns):
 #   Column                                Non-Null Count   Dtype         
---  ------                                --------------   -----         
 0   player_id                             123594 non-null  int64         
 1   name                                  123594 non-null  object        
 2   last_season_x                         123594 non-null  int64         
 3   current_club_id_x                     123594 non-null  int64         
 4   country_of_citizenship                123594 non-null  object        
 5   sub_position                          123594 non-null  object        
 6   position                              123594 non-null  object        
 7   foot                                  123594 non-null  object        
 8   height_in_cm                          123594 non-null  float64       
 9   market_value_in_eur_y                 123594 non-null  int

In [51]:
fdf3[fdf3["player_id"]==8198]["market_value_in_eur_y"]

28253     90000000
28254    100000000
28255    100000000
28256    100000000
28257    100000000
28258    100000000
28259    120000000
28260    120000000
28261    120000000
28262    110000000
28263    110000000
28264    110000000
28265    110000000
28266    100000000
28267    100000000
28268    120000000
28269    100000000
28270    100000000
28271     90000000
28272     75000000
28273     60000000
28274     60000000
28275     60000000
28276     50000000
28277     45000000
28278     35000000
28279     30000000
28280     20000000
28281     20000000
Name: market_value_in_eur_y, dtype: int64

In [52]:
#change all the datatypes
fdf3["goals"] = fdf3["goals"].astype(int)
fdf3["assists"] = fdf3["assists"].astype(int)
fdf3["yellow_cards"] = fdf3["yellow_cards"].astype(int)
fdf3["red_cards"] = fdf3["red_cards"].astype(int)
fdf3["minutes_played"]=fdf3["minutes_played"].astype(int)
fdf3["player_club_id"]=fdf3["player_club_id"].astype(int)
fdf3["national_team_players"]=fdf3["national_team_players"].astype(int)






In [53]:
fdf3[fdf3["contract_days_left"]<0]


Unnamed: 0,player_id,name,last_season_x,current_club_id_x,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value_in_eur_y,...,player_name,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,domestic_competition_id,net_transfer_record,national_team_players
10,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,1,0,720,31,GB1,€-111.30m,16
11,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,1,0,720,31,GB1,€-111.30m,16
12,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,0,0,450,31,GB1,€-111.30m,16
13,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,0,0,450,31,GB1,€-111.30m,16
14,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,0,0,466,31,GB1,€-111.30m,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262368,288499,Ewandro Costa,2017,410,Brazil,Centre-Forward,Attack,left,176.0,1000000,...,Ewandro Costa,0,1,0,0,54,410,IT1,+€8.56m,8
262369,288499,Ewandro Costa,2017,410,Brazil,Centre-Forward,Attack,left,176.0,1000000,...,Ewandro Costa,0,1,0,0,54,410,IT1,+€8.56m,8
262370,288499,Ewandro Costa,2017,410,Brazil,Centre-Forward,Attack,left,176.0,1000000,...,Ewandro Costa,0,1,0,0,54,410,IT1,+€8.56m,8
262371,288499,Ewandro Costa,2017,410,Brazil,Centre-Forward,Attack,left,176.0,1000000,...,Ewandro Costa,1,1,0,0,956,1465,PO1,+€8.00m,2


In [54]:
#anonymize everything by removing name
fdf3.drop(["name"],axis=1)

Unnamed: 0,player_id,last_season_x,current_club_id_x,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value_in_eur_y,highest_market_value_in_eur,...,player_name,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,domestic_competition_id,net_transfer_record,national_team_players
10,4042,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,2000000.0,...,Brad Jones,0,0,1,0,720,31,GB1,€-111.30m,16
11,4042,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,2000000.0,...,Brad Jones,0,0,1,0,720,31,GB1,€-111.30m,16
12,4042,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,2000000.0,...,Brad Jones,0,0,0,0,450,31,GB1,€-111.30m,16
13,4042,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,2000000.0,...,Brad Jones,0,0,0,0,450,31,GB1,€-111.30m,16
14,4042,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,2000000.0,...,Brad Jones,0,0,0,0,466,31,GB1,€-111.30m,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262395,371851,2023,410,Slovenia,Centre-Back,Defender,right,190.0,10000000,10000000.0,...,Jaka Bijol,1,2,6,0,2511,410,IT1,+€8.56m,8
262405,586756,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,3500000.0,...,Festy Ebosele,0,0,1,0,53,410,IT1,+€8.56m,8
262406,586756,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,3500000.0,...,Festy Ebosele,0,0,1,0,53,410,IT1,+€8.56m,8
262407,586756,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,3500000.0,...,Festy Ebosele,0,1,3,0,870,410,IT1,+€8.56m,8


In [55]:
#rename collumns
fdf3 = fdf3.rename(columns={"current_club_id_x": "most_recent_club_id","market_value_in_eur_y": "market_value","highest_market_value_in_eur":"highest_ever_market_value","domestic_competition_id":"league_id"})

In [56]:
#reset the index
fdf3.reset_index(drop=True)


Unnamed: 0,player_id,name,last_season_x,most_recent_club_id,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value,...,player_name,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,league_id,net_transfer_record,national_team_players
0,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,1,0,720,31,GB1,€-111.30m,16
1,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,1,0,720,31,GB1,€-111.30m,16
2,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,0,0,450,31,GB1,€-111.30m,16
3,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,0,0,450,31,GB1,€-111.30m,16
4,4042,Brad Jones,2017,234,Australia,Goalkeeper,Goalkeeper,left,194.0,1500000,...,Brad Jones,0,0,0,0,466,31,GB1,€-111.30m,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123589,371851,Jaka Bijol,2023,410,Slovenia,Centre-Back,Defender,right,190.0,10000000,...,Jaka Bijol,1,2,6,0,2511,410,IT1,+€8.56m,8
123590,586756,Festy Ebosele,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,...,Festy Ebosele,0,0,1,0,53,410,IT1,+€8.56m,8
123591,586756,Festy Ebosele,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,...,Festy Ebosele,0,0,1,0,53,410,IT1,+€8.56m,8
123592,586756,Festy Ebosele,2023,410,Ireland,Right-Back,Defender,right,180.0,3500000,...,Festy Ebosele,0,1,3,0,870,410,IT1,+€8.56m,8


In [57]:
#Drop useless collumns
fdf3=fdf3.drop(["contract_expiration_date","current_club_domestic_competition_id","current_club_name","player_name","name"],axis=1)

Ive dropped these columns as I"ve deemed them useless for my further analysis as I"ve got collumns that are either a better representation of the data or they just dont serve any purpose.

In [58]:
fdf3["net_transfer_record"].head()

10    €-111.30m
11    €-111.30m
12    €-111.30m
13    €-111.30m
14    €-111.30m
Name: net_transfer_record, dtype: object

In [59]:
fdf3["net_transfer_record"].sample(30)

253663     +€9.63m
206317      €-450k
188329    €-49.42m
110470    +€10.30m
68065      +€1.80m
80325     +€34.10m
256294    +€43.46m
47232     +€57.30m
24575     +€10.80m
260488    €-15.87m
137666    +€88.20m
13401          +-0
222792         +-0
144350         +-0
84775       +€950k
208680    +€87.87m
246153      €-500k
182998    +€31.80m
256815    +€42.02m
11989       +€850k
227855     +€2.52m
7320      +€43.46m
169833     +€8.35m
247655     €-2.10m
181591         +-0
28089          +-0
111396     +€3.61m
222126      +€100k
245280     +€3.95m
34289      +€9.43m
Name: net_transfer_record, dtype: object

My `net_tranfer_record` collumn seems to be in the wrong format. It looks like a string rather than the float datatype it should be so i'll have to convert them.

I'll first have to remove the value units symbol which should be "€" for all non zero values.

In [60]:
#check how many values contain the euro symbol
print(fdf3["net_transfer_record"].str.contains("€").sum())

#check pound sign too
print(fdf3["net_transfer_record"].str.contains("£").sum())


105581
0


In [61]:
#check how many zero values i have
fdf3[fdf3["net_transfer_record"]=="+-0"].shape

(18013, 24)

In [62]:
fdf3.shape

(123594, 24)

105581 + 18013 adds up to 123594 which matches up perfectly with my row number which is good news as it means that all values are either 0 or set to a standardized euro value.

In [63]:
#remove euro symbol from my values
fdf3["net_transfer_record"]=fdf3["net_transfer_record"].str.replace("€","")

In [64]:
fdf3["net_transfer_record"].sample(20)

208016         +-0
195076       +325k
146071         +-0
233240       +586k
156842    -167.04m
199662     +14.00m
194253      +1.10m
233621     -45.00m
59786          +-0
120464         +-0
225804    -126.40m
114067      -2.05m
72230      +10.04m
49741       -1.36m
222745      -5.00m
190465     -54.76m
215707     -11.60m
95704     -146.50m
101312     +40.52m
68066       +1.80m
Name: net_transfer_record, dtype: object

That seems to have worked well.

In [65]:
#display count of where "m" is represented
print(fdf3["net_transfer_record"].str.contains("m").sum())

#display count of where "k" is represented
print(fdf3["net_transfer_record"].str.contains("k").sum())

85991
19590


Again this adds up to 105581 indicating that all non-zero values are represented with an m (millions) or k (thousands).

In [66]:
#replace the k with a 1000x multiplier
fdf3["net_transfer_record"] = fdf3["net_transfer_record"].str.replace('k', 'e3', regex=True)

#replace the m with a 1,000,000x multiplier
fdf3["net_transfer_record"] = fdf3["net_transfer_record"].str.replace('m', 'e6', regex=True)

# Remove the "+-" signs before 0
fdf3["net_transfer_record"] = fdf3["net_transfer_record"].str.replace('[+-](?=0$)', '', regex=True)
fdf3["net_transfer_record"] = pd.to_numeric(fdf3["net_transfer_record"])  # Convert to float

In [67]:
fdf3["net_transfer_record"].head(10)

10   -111300000.0
11   -111300000.0
12   -111300000.0
13   -111300000.0
14   -111300000.0
15   -111300000.0
18      2680000.0
19      2680000.0
20      -500000.0
21      -500000.0
Name: net_transfer_record, dtype: float64

In [68]:
#look at my cleaned dataset on cristiano ronaldo
fdf3[fdf3["player_id"]==8198]

Unnamed: 0,player_id,last_season_x,most_recent_club_id,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value,highest_ever_market_value,...,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,league_id,net_transfer_record,national_team_players
28253,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,90000000,120000000.0,...,2012,23,4,6,0,2268,418,ES1,-122500000.0,19
28254,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,100000000,120000000.0,...,2012,23,4,6,0,2268,418,ES1,-122500000.0,19
28255,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,100000000,120000000.0,...,2013,59,17,10,1,4218,418,ES1,-122500000.0,19
28256,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,100000000,120000000.0,...,2013,59,17,10,1,4218,418,ES1,-122500000.0,19
28257,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,100000000,120000000.0,...,2014,56,21,7,1,4309,418,ES1,-122500000.0,19
28258,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,100000000,120000000.0,...,2014,56,21,7,1,4309,418,ES1,-122500000.0,19
28259,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,120000000,120000000.0,...,2014,56,21,7,1,4309,418,ES1,-122500000.0,19
28260,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,120000000,120000000.0,...,2015,54,18,5,1,4578,418,ES1,-122500000.0,19
28261,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,120000000,120000000.0,...,2015,54,18,5,1,4578,418,ES1,-122500000.0,19
28262,8198,2022,985,Portugal,Centre-Forward,Attack,right,187.0,110000000,120000000.0,...,2015,54,18,5,1,4578,418,ES1,-122500000.0,19


### Data Dictionary

1. `player_id` (int64): A unique identifier for each football player.

2. `last_season_x` (int64): The last season the player participated in.

3. `country_of_citizenship` (object): The player's country of citizenship.

4. `position` (object): The player's primary playing position (e.g., forward, midfielder, defender).

5. `sub_position` (object): The specific position the player plays within a broader position category.

6. `foot` (object): The preferred foot for playing (e.g., left, right).

7. `height_in_cm` (float64): The player's height in centimeters.

8. `market_value_in_eur` (int64): The market value of the player in euros, which is the target variable for your analysis.

9. `highest_market_value_in_eur` (float64): The player's highest market value in euros.

10. `most_recent_club_id` (object): The ID of the player's most recent club.

11. `date` (datetime64): A date associated with the player's record.

12. `age` (int32): The player's age, calculated from their date of birth.

13. `contract_days_left` (int64): The number of days left on the player's current contract.

14. `year` (int64): The year associated with the player's record.

15. `goals` (int32): The number of goals scored by the player that year.

16. `assists` (int32): The number of assists made by the player that year.

17. `yellow_cards` (int32): The number of yellow cards received by the player that year.

18. `red_cards` (int32): The number of red cards received by the player that year.

19. `minutes_played` (int32): The total number of minutes the player has played that year.

20. `net_transfer_record` (float64): The net transfer record of that specific club.

21. `national_team_players` (int32): The number of national team players for that specific club.

22. `month` (int64): The month associated with the player's record.

23. `player_club_id` (int64): The ID of the player's club.

24. `league_id` (object): The ID of the league.


### Data Export

In [69]:
import pickle
fdf3.to_pickle("data1/fdf3.pkl")
