# Preprocessing

In this notebook, I'll be preprocessing all my data to get it ready to be put into a model.

In [1]:
#load in
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
pd.set_option('display.max_columns', 100)
import numpy as np


## Data Loading

Just checking to see if my data has been correctly imported.

In [2]:
#load in dataset
df=pd.read_csv("data1/my_dataset.csv")
#look at sample
df.sample(5)

Unnamed: 0,player_id,last_season_x,most_recent_club_id,country_of_citizenship,sub_position,position,foot,height_in_cm,market_value,highest_ever_market_value,date,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,league_id,net_transfer_record,national_team_players
8794,107377,2018,28643,Croatia,Attacking Midfield,Midfield,right,189.0,3000000,4000000.0,2018-06-03,27,50,6,2018,1,3,4,0,1215,157,BE1,2520000.0,8
20847,624436,2022,1038,Spain,Defensive Midfield,Midfield,right,170.0,500000,500000.0,2023-06-15,21,597,6,2023,0,0,0,0,26,1038,IT1,1750000.0,4
24046,4391,2023,383,Netherlands,Goalkeeper,Goalkeeper,right,188.0,1250000,3000000.0,2012-01-16,28,232,1,2012,0,0,0,0,1947,383,NL1,-3530000.0,10
77775,181380,2023,114,DR Congo,Left-Back,Defender,left,179.0,10000000,10000000.0,2019-09-12,26,597,9,2019,0,1,6,0,1340,379,GB1,18840000.0,13
70353,204049,2023,3209,Albania,Attacking Midfield,Midfield,both,178.0,4000000,5000000.0,2022-03-31,25,597,3,2022,4,5,2,0,1521,2293,TR1,1800000.0,4


Check if datatypes have been 

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123594 entries, 0 to 123593
Data columns (total 24 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   player_id                  123594 non-null  int64  
 1   last_season_x              123594 non-null  int64  
 2   most_recent_club_id        123594 non-null  int64  
 3   country_of_citizenship     123594 non-null  object 
 4   sub_position               123594 non-null  object 
 5   position                   123594 non-null  object 
 6   foot                       123594 non-null  object 
 7   height_in_cm               123594 non-null  float64
 8   market_value               123594 non-null  int64  
 9   highest_ever_market_value  123594 non-null  float64
 10  date                       123594 non-null  object 
 11  age                        123594 non-null  int64  
 12  contract_days_left         123594 non-null  int64  
 13  month                      12

- The `player_id` collumn seems relatively useless in the case of predicting market value as its just an identifier.
- The `most_recent_club_id` collumn also seems useless as I already have the player_club_id collumn making it useless.
- We can also drop the `highest_ever_market_value` collumn as its way too correlated (see EDA notebook) with the target variable.

We can drop them both.

In [4]:
df=df.drop(["player_id","most_recent_club_id","highest_ever_market_value"],axis=1)

In [5]:
df.describe()

Unnamed: 0,last_season_x,height_in_cm,market_value,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players
count,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0,123594.0
mean,2021.424948,182.265814,5106035.0,25.03183,443.834749,6.428346,2018.691587,1.973793,1.501165,2.737706,0.068798,1308.635727,3175.928945,213839.7,6.551669
std,2.29838,6.695002,10619030.0,3.985102,471.763162,3.587796,2.976855,3.695028,2.450115,2.855257,0.265803,1010.055198,7846.777161,38197460.0,5.306323
min,2012.0,159.0,10000.0,15.0,-314.0,1.0,2012.0,0.0,0.0,0.0,0.0,1.0,3.0,-195100000.0,0.0
25%,2021.0,178.0,500000.0,22.0,232.0,3.0,2017.0,0.0,0.0,0.0,0.0,439.0,294.0,-1930000.0,2.0
50%,2023.0,183.0,1500000.0,25.0,232.0,6.0,2019.0,1.0,1.0,2.0,0.0,1143.0,873.0,150000.0,6.0
75%,2023.0,187.0,4500000.0,28.0,597.0,10.0,2021.0,2.0,2.0,4.0,0.0,1990.0,2457.0,7600000.0,10.0
max,2023.0,207.0,200000000.0,44.0,3154.0,12.0,2023.0,59.0,33.0,23.0,3.0,5070.0,83678.0,102000000.0,22.0


Everything looks good, Lets get going with the preprocessing.

First, to avoid data leakage we will do our train test split before any proprocessing.

## Splitting

In [6]:
#Split my X and y
y=df["market_value"]
X=df.drop(["market_value"],axis=1)

In [7]:
#import splitter
from sklearn.model_selection import train_test_split

#split between train and val_test
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.33, random_state=100)

In [8]:
#split between test and val
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=100)

In [9]:
#check shape
X_train.shape

(82807, 20)

In [10]:
#check shape
X_val.shape

(20393, 20)

In [11]:
#check shape
X_test.shape

(20394, 20)

my splitting seems to have been correctly done. All the rows add up to the original size of the dataframe.

## Feature Encoding

Since I've got the month and year collumn i dont need the date collumn so I'll drop that.

In [12]:
X_train=X_train.drop(["date"],axis=1)
X_test=X_test.drop(["date"],axis=1)
X_val=X_val.drop(["date"],axis=1)

I need to encode the following collumns to be processed:

- `country_of_citizenship`
- `sub_position`
- `date`
- `position` 
- `foot` 
- `league_id`
  

We'll first look at the cardinality of all our categorical collumns to decide what type of encoding we should do.

In [13]:
df.nunique()

last_season_x               12
country_of_citizenship     155
sub_position                13
position                     4
foot                         3
height_in_cm                48
market_value               223
date                      2141
age                         30
contract_days_left          89
month                       12
year                        12
goals                       51
assists                     31
yellow_cards                23
red_cards                    4
minutes_played            4079
player_club_id             426
league_id                   14
net_transfer_record        272
national_team_players       23
dtype: int64

the `foot`,`position`,`subposition` and `domestic competition id` collumns all have a low enough cardinality so one hot encoding would definitely be suitable.

In [14]:
#for train
encodecols = ["foot", "position", "sub_position", "league_id"]
X_train = pd.get_dummies(X_train, columns=encodecols, prefix=encodecols)

#for val
X_val=pd.get_dummies(X_val,columns=encodecols,prefix=encodecols)

#for test
X_test=pd.get_dummies(X_test,columns=encodecols,prefix=encodecols)

In [15]:
X_train.sample(2)

Unnamed: 0,last_season_x,country_of_citizenship,height_in_cm,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players,foot_both,foot_left,foot_right,position_Attack,position_Defender,position_Goalkeeper,position_Midfield,sub_position_Attacking Midfield,sub_position_Central Midfield,sub_position_Centre-Back,sub_position_Centre-Forward,sub_position_Defensive Midfield,sub_position_Goalkeeper,sub_position_Left Midfield,sub_position_Left Winger,sub_position_Left-Back,sub_position_Right Midfield,sub_position_Right Winger,sub_position_Right-Back,sub_position_Second Striker,league_id_BE1,league_id_DK1,league_id_ES1,league_id_FR1,league_id_GB1,league_id_GR1,league_id_IT1,league_id_L1,league_id_NL1,league_id_PO1,league_id_RU1,league_id_SC1,league_id_TR1,league_id_UKR1
63954,2023,Morocco,181.0,31,232,12,2021,1,1,2,0,889,1095,0.0,5,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
14215,2023,Netherlands,196.0,32,232,6,2021,15,2,1,0,1968,2282,21850000.0,9,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [16]:
X_test.sample(2)

Unnamed: 0,last_season_x,country_of_citizenship,height_in_cm,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players,foot_both,foot_left,foot_right,position_Attack,position_Defender,position_Goalkeeper,position_Midfield,sub_position_Attacking Midfield,sub_position_Central Midfield,sub_position_Centre-Back,sub_position_Centre-Forward,sub_position_Defensive Midfield,sub_position_Goalkeeper,sub_position_Left Midfield,sub_position_Left Winger,sub_position_Left-Back,sub_position_Right Midfield,sub_position_Right Winger,sub_position_Right-Back,sub_position_Second Striker,league_id_BE1,league_id_DK1,league_id_ES1,league_id_FR1,league_id_GB1,league_id_GR1,league_id_IT1,league_id_L1,league_id_NL1,league_id_PO1,league_id_RU1,league_id_SC1,league_id_TR1,league_id_UKR1
13438,2022,Germany,185.0,21,597,6,2023,1,0,0,0,241,44,6700000.0,7,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
35990,2023,France,187.0,23,1327,6,2023,1,4,4,0,1710,23826,88200000.0,17,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [17]:
X_val.sample(2)

Unnamed: 0,last_season_x,country_of_citizenship,height_in_cm,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players,foot_both,foot_left,foot_right,position_Attack,position_Defender,position_Goalkeeper,position_Midfield,sub_position_Attacking Midfield,sub_position_Central Midfield,sub_position_Centre-Back,sub_position_Centre-Forward,sub_position_Defensive Midfield,sub_position_Goalkeeper,sub_position_Left Midfield,sub_position_Left Winger,sub_position_Left-Back,sub_position_Right Midfield,sub_position_Right Winger,sub_position_Right-Back,sub_position_Second Striker,league_id_BE1,league_id_DK1,league_id_ES1,league_id_FR1,league_id_GB1,league_id_GR1,league_id_IT1,league_id_L1,league_id_NL1,league_id_PO1,league_id_RU1,league_id_SC1,league_id_TR1,league_id_UKR1
22114,2023,Spain,176.0,29,232,10,2020,4,4,8,0,3150,13,57300000.0,14,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
73159,2023,England,175.0,20,1327,12,2021,0,0,0,0,24,405,-60450000.0,14,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


This seems to have worked well.The only collumn i need to deal with now is the `country_of_citizenship` collumn which consists of 155 unique values which on its own is far too many to one hot encode. The likely solution to this would be to get the top N-countries of citizenship in terms of numbers and have one designated collumn named other for every other nation other than them. This seems viable as nationalities in this dataset arent uniformly distributed as seen in our EDA in the previous notebook as the top 10 nationalities accounted for a disproportionate amount of the dataset.

In [18]:
#get the value counts for each nation
country_counts = df["country_of_citizenship"].value_counts()

#get the top 50 and its index
top_countries=country_counts.head(25).index



In [19]:
top_countries

Index(['Spain', 'France', 'Netherlands', 'Brazil', 'Turkey', 'Germany',
       'Portugal', 'Italy', 'Russia', 'England', 'Ukraine', 'Belgium',
       'Denmark', 'Greece', 'Argentina', 'Scotland', 'Serbia', 'Croatia',
       'Senegal', 'Nigeria', 'Sweden', 'Morocco', 'Uruguay', 'Cote d'Ivoire',
       'Poland'],
      dtype='object')

I need to now set everything not in that index to "Other".

In [20]:
#create lambda function setting "Other" if not in index for train,test and val
X_train["country_of_citizenship"]=X_train["country_of_citizenship"].apply(lambda x: x if x in top_countries else 'Other')
X_val["country_of_citizenship"]=X_val["country_of_citizenship"].apply(lambda x: x if x in top_countries else 'Other')
X_test["country_of_citizenship"]=X_test["country_of_citizenship"].apply(lambda x: x if x in top_countries else 'Other')

In [21]:
X_train.head()

Unnamed: 0,last_season_x,country_of_citizenship,height_in_cm,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players,foot_both,foot_left,foot_right,position_Attack,position_Defender,position_Goalkeeper,position_Midfield,sub_position_Attacking Midfield,sub_position_Central Midfield,sub_position_Centre-Back,sub_position_Centre-Forward,sub_position_Defensive Midfield,sub_position_Goalkeeper,sub_position_Left Midfield,sub_position_Left Winger,sub_position_Left-Back,sub_position_Right Midfield,sub_position_Right Winger,sub_position_Right-Back,sub_position_Second Striker,league_id_BE1,league_id_DK1,league_id_ES1,league_id_FR1,league_id_GB1,league_id_GR1,league_id_IT1,league_id_L1,league_id_NL1,league_id_PO1,league_id_RU1,league_id_SC1,league_id_TR1,league_id_UKR1
76575,2022,Italy,184.0,24,597,6,2018,0,0,0,0,23,5,-45000000.0,15,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
115759,2021,Spain,180.0,20,-134,1,2015,1,0,0,0,161,150,43500000.0,7,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
38683,2022,Other,187.0,26,232,6,2018,2,0,6,0,810,618,12850000.0,3,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
44111,2023,Spain,183.0,18,597,2,2020,0,0,1,0,281,1108,-6300000.0,3,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
90042,2022,Other,181.0,22,1327,10,2017,0,0,4,0,1211,618,12850000.0,3,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


I can now perform one hot encoding without excessively increasing dimensionality.

In [22]:


#train
X_train = pd.get_dummies(X_train, columns=["country_of_citizenship"], prefix="Country")

#val
X_val = pd.get_dummies(X_val, columns=["country_of_citizenship"], prefix="Country")

#test
X_test = pd.get_dummies(X_test, columns=["country_of_citizenship"], prefix="Country")


In [23]:
X_train.sample(5)

Unnamed: 0,last_season_x,height_in_cm,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players,foot_both,foot_left,foot_right,position_Attack,position_Defender,position_Goalkeeper,position_Midfield,sub_position_Attacking Midfield,sub_position_Central Midfield,sub_position_Centre-Back,sub_position_Centre-Forward,sub_position_Defensive Midfield,sub_position_Goalkeeper,sub_position_Left Midfield,sub_position_Left Winger,sub_position_Left-Back,sub_position_Right Midfield,sub_position_Right Winger,sub_position_Right-Back,sub_position_Second Striker,league_id_BE1,league_id_DK1,league_id_ES1,league_id_FR1,league_id_GB1,league_id_GR1,league_id_IT1,league_id_L1,league_id_NL1,league_id_PO1,league_id_RU1,league_id_SC1,league_id_TR1,league_id_UKR1,Country_Argentina,Country_Belgium,Country_Brazil,Country_Cote d'Ivoire,Country_Croatia,Country_Denmark,Country_England,Country_France,Country_Germany,Country_Greece,Country_Italy,Country_Morocco,Country_Netherlands,Country_Nigeria,Country_Other,Country_Poland,Country_Portugal,Country_Russia,Country_Scotland,Country_Senegal,Country_Serbia,Country_Spain,Country_Sweden,Country_Turkey,Country_Ukraine,Country_Uruguay
2293,2023,181.0,26,597,5,2019,1,2,8,1,2635,265,-7000000.0,10,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
94005,2023,184.0,27,962,3,2022,0,1,0,0,1080,468,-650000.0,2,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
114119,2019,187.0,32,-134,12,2020,0,0,2,0,753,589,9430000.0,4,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
36830,2018,197.0,28,202,10,2015,0,0,1,1,1992,720,34100000.0,7,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
50964,2023,185.0,24,962,7,2020,1,0,2,0,1338,157,2520000.0,8,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


I've now completed my feature encoding and can move onto the next step of my preprocessing.

## Scaling Data

I'll use the StandardScaler as there will be quite a few outliers as bigger teams in better leagues have a disproportionate amount of funds to spend on players which min-max scaling is very sensitive to.

In [24]:
from sklearn.preprocessing import StandardScaler
#instantiate
scaler=StandardScaler()

#fit and transform
#train
X_train_s=scaler.fit_transform(X_train)

#val
X_val_s=scaler.transform(X_val)


#test
X_test_s=scaler.transform(X_test)




They're now in array form and I want it back in dataframe format.

In [25]:
#train
X_train_s = pd.DataFrame(X_train_s, columns=X_train.columns)

#val
X_val_s = pd.DataFrame(X_val_s, columns=X_val.columns)


#test
X_test_s = pd.DataFrame(X_test_s, columns=X_test.columns)



In [26]:
X_train_s.sample(10)

Unnamed: 0,last_season_x,height_in_cm,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players,foot_both,foot_left,foot_right,position_Attack,position_Defender,position_Goalkeeper,position_Midfield,sub_position_Attacking Midfield,sub_position_Central Midfield,sub_position_Centre-Back,sub_position_Centre-Forward,sub_position_Defensive Midfield,sub_position_Goalkeeper,sub_position_Left Midfield,sub_position_Left Winger,sub_position_Left-Back,sub_position_Right Midfield,sub_position_Right Winger,sub_position_Right-Back,sub_position_Second Striker,league_id_BE1,league_id_DK1,league_id_ES1,league_id_FR1,league_id_GB1,league_id_GR1,league_id_IT1,league_id_L1,league_id_NL1,league_id_PO1,league_id_RU1,league_id_SC1,league_id_TR1,league_id_UKR1,Country_Argentina,Country_Belgium,Country_Brazil,Country_Cote d'Ivoire,Country_Croatia,Country_Denmark,Country_England,Country_France,Country_Germany,Country_Greece,Country_Italy,Country_Morocco,Country_Netherlands,Country_Nigeria,Country_Other,Country_Poland,Country_Portugal,Country_Russia,Country_Scotland,Country_Senegal,Country_Serbia,Country_Spain,Country_Sweden,Country_Turkey,Country_Ukraine,Country_Uruguay
41815,-1.923271,0.709356,0.240506,-1.223367,-1.51199,-0.230274,-0.532255,-0.205736,0.445709,-0.258501,-0.187111,-0.32362,-0.002528,-0.673275,-0.203311,1.731702,-1.56557,-0.635055,1.407452,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,3.568263,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,2.970924,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,3.394943,-0.100928,-0.23248,-0.19552,-0.099514
54035,0.686577,0.559756,-0.511092,1.872191,0.718179,1.113205,0.547121,0.201472,-0.960436,-0.258501,1.10377,-0.305559,-2.842402,1.026196,-0.203311,-0.577466,0.638745,-0.635055,-0.710504,-0.297908,1.54358,-0.278415,2.686375,-0.471025,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,3.256646,-0.239946,-0.322019,-0.305838,-0.294943,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,4.990949,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
26869,0.251602,-1.085852,2.996366,-0.447889,1.275721,1.113205,-0.532255,0.201472,1.500317,-0.258501,0.227001,-0.019541,-0.002528,-1.050935,-0.203311,-0.577466,0.638745,-0.635055,-0.710504,-0.297908,1.54358,-0.278415,2.686375,-0.471025,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,4.342556,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
81154,-1.053322,1.906162,1.49317,-1.223367,1.554492,0.105595,-0.532255,0.201472,-0.257364,-0.258501,1.142408,-0.036705,-0.002528,-0.862105,-0.203311,-0.577466,0.638745,-0.635055,1.407452,-0.297908,-0.647845,-0.278415,-0.372249,2.123028,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,3.656669,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,9.147116,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
46079,-1.053322,0.709356,0.491039,-1.223367,-0.118134,-1.909624,-0.262411,-0.612944,-0.6089,-0.258501,-0.971745,-0.38856,-0.381701,1.403856,-0.203311,-0.577466,0.638745,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,-0.299443,-0.252851,-0.186729,3.3398,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,4.301436,-0.19552,-0.099514
13221,0.686577,1.45736,-0.761625,1.098831,-0.118134,-0.566144,-0.532255,-0.205736,-0.960436,-0.258501,-0.670572,-0.334891,-0.193422,1.026196,-0.203311,-0.577466,0.638745,-0.635055,1.407452,-0.297908,-0.647845,-0.278415,-0.372249,2.123028,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,3.269702,-0.294943,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,9.778766,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
27718,0.686577,1.008558,-0.511092,1.098831,1.554492,0.777335,-0.532255,-0.612944,0.797245,-0.258501,-0.088041,-0.392915,-0.066595,-0.673275,4.918564,-0.577466,-1.56557,-0.635055,1.407452,-0.297908,-0.647845,-0.278415,-0.372249,2.123028,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,3.269702,-0.294943,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,8.257044,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
48679,-0.183373,-0.188248,1.743703,-1.223367,-0.118134,-0.230274,0.007433,-0.205736,1.148781,-0.258501,0.434058,-0.349877,-0.875933,1.403856,4.918564,-0.577466,-1.56557,-0.635055,1.407452,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,3.568263,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,3.10541,-0.305838,-0.294943,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,6.644625,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
19503,0.686577,-0.038647,-1.26269,0.325471,-0.118134,-0.566144,-0.532255,-0.612944,-0.960436,-0.258501,-1.217438,-0.32362,-0.002528,-0.673275,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,2.970924,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,3.394943,-0.100928,-0.23248,-0.19552,-0.099514
31533,-0.618347,-1.085852,-0.010027,-1.223367,-1.233219,-0.902014,1.626497,1.015889,-0.257364,-0.258501,1.115659,-0.354104,-0.094837,0.648535,-0.203311,-0.577466,0.638745,-0.635055,-0.710504,-0.297908,1.54358,3.591767,-0.372249,-0.471025,-0.409022,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,3.390486,-0.299443,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,4.311077,-0.102262,-0.519941,-0.094699,-0.223601,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514


In [27]:
#single out Cristiano Ronaldo using scaled data
X_train_s.loc[
    (X_train_s['position_Attack'] > 1.0) & (X_train_s['position_Attack'] < 2.0) &
    (X_train_s['Country_Portugal'] > 4.0) & (X_train_s['Country_Portugal'] < 5.0) &
    (X_train_s['foot_right'] > -1.6) & (X_train_s['foot_right'] < -1.5)&
    (X_train_s["height_in_cm"]>0.7) & (X_train_s["height_in_cm"]<0.71)
]



Unnamed: 0,last_season_x,height_in_cm,age,contract_days_left,month,year,goals,assists,yellow_cards,red_cards,minutes_played,player_club_id,net_transfer_record,national_team_players,foot_both,foot_left,foot_right,position_Attack,position_Defender,position_Goalkeeper,position_Midfield,sub_position_Attacking Midfield,sub_position_Central Midfield,sub_position_Centre-Back,sub_position_Centre-Forward,sub_position_Defensive Midfield,sub_position_Goalkeeper,sub_position_Left Midfield,sub_position_Left Winger,sub_position_Left-Back,sub_position_Right Midfield,sub_position_Right Winger,sub_position_Right-Back,sub_position_Second Striker,league_id_BE1,league_id_DK1,league_id_ES1,league_id_FR1,league_id_GB1,league_id_GR1,league_id_IT1,league_id_L1,league_id_NL1,league_id_PO1,league_id_RU1,league_id_SC1,league_id_TR1,league_id_UKR1,Country_Argentina,Country_Belgium,Country_Brazil,Country_Cote d'Ivoire,Country_Croatia,Country_Denmark,Country_England,Country_France,Country_Germany,Country_Greece,Country_Italy,Country_Morocco,Country_Netherlands,Country_Nigeria,Country_Other,Country_Poland,Country_Portugal,Country_Russia,Country_Scotland,Country_Senegal,Country_Serbia,Country_Spain,Country_Sweden,Country_Turkey,Country_Ukraine,Country_Uruguay
1122,0.686577,0.709356,-0.761625,1.098831,0.160637,-1.573754,-0.262411,-0.612944,0.445709,-0.258501,-0.669582,-0.092679,0.204056,-0.673275,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
3336,0.686577,0.709356,-0.010027,1.098831,-0.118134,-0.566144,1.356653,0.60868,0.094172,-0.258501,0.156662,-0.265468,-0.177732,-0.106785,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
10245,0.686577,0.709356,-0.761625,1.098831,-1.233219,-1.237884,-0.532255,-0.612944,-0.6089,-0.258501,-1.047038,-0.092679,0.204056,-0.673275,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
14562,0.686577,0.709356,-0.010027,1.098831,1.275721,-0.566144,1.356653,0.60868,0.094172,-0.258501,0.156662,-0.265468,-0.177732,-0.106785,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
26416,0.686577,0.709356,1.49317,1.098831,-0.118134,1.449075,2.436029,2.237513,0.797245,-0.258501,0.490527,-0.360124,1.814888,0.270875,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
27580,0.686577,0.709356,0.491039,1.098831,-0.118134,0.105595,3.515405,0.60868,0.797245,-0.258501,0.956156,-0.265468,-0.177732,-0.106785,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
40603,0.686577,0.709356,0.741572,1.098831,0.439408,0.441465,5.134469,3.459137,1.500317,-0.258501,1.822028,-0.265468,-0.177732,-0.106785,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
55824,0.686577,0.709356,0.992104,1.098831,-0.396905,0.777335,2.705873,2.644721,2.20339,-0.258501,1.802214,-0.265468,-0.177732,-0.106785,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
58852,0.686577,0.709356,1.242637,1.098831,-0.396905,1.113205,3.245561,2.644721,1.851853,-0.258501,1.121603,-0.360124,1.814888,0.270875,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514
69160,0.686577,0.709356,1.242637,1.098831,-0.954448,1.449075,2.436029,2.237513,0.797245,-0.258501,0.490527,-0.360124,1.814888,0.270875,-0.203311,1.731702,-1.56557,1.574668,-0.710504,-0.297908,-0.647845,-0.278415,-0.372249,-0.471025,2.444859,-0.308428,-0.297908,-0.08579,-0.270477,-0.280248,-0.088741,-0.272835,-0.297018,-0.08162,-0.273473,-0.214427,-0.336596,-0.302735,-0.307064,-0.239946,-0.322019,-0.305838,-0.294943,3.339533,-0.252851,-0.186729,-0.299419,-0.210077,-0.150498,-0.195354,-0.230279,-0.09586,-0.109324,-0.188041,-0.200363,-0.250382,-0.222138,-0.182426,-0.215073,-0.101536,-0.231961,-0.102262,-0.519941,-0.094699,4.472249,-0.205683,-0.141303,-0.10675,-0.121109,-0.294556,-0.100928,-0.23248,-0.19552,-0.099514


Data looks correctly scaled. We can now move onto our Baseline linear Model.

### Data Export

In [29]:
import pickle
df.to_pickle("data1/df.pkl")
X_train_s.to_pickle('data1/X_train_s.pkl')
X_val_s.to_pickle('data1/X_val_s.pkl')
X_test_s.to_pickle('data1/X_test_s.pkl')
y_train.to_pickle('data1/y_train.pkl')
y_val.to_pickle('data1/y_val.pkl')
y_test.to_pickle('data1/y_test.pkl')

In [29]:
import joblib

# Save the trained model to a file
scaler_filename = 'myscaler.joblib'
joblib.dump(scaler, scaler_filename)

print(f"Scaler saved to {scaler_filename}")


Scaler saved to myscaler.joblib
