# Imports

In [1]:
from util import *
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In the cleaning phase we have cleaned the dataset and made it usable for an XGB model. Here we are going to do further analysis considering some of the categorical features which have too many categories. Let's load the dataset and briefly look at it.

In [2]:
data = pd.read_csv("./Dataset/cleaned_data.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16685 entries, 0 to 16684
Data columns (total 90 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   sofifa_id                    16685 non-null  int64  
 1   name                         16685 non-null  object 
 2   player_positions             16685 non-null  object 
 3   overall                      16685 non-null  int64  
 4   potential                    16685 non-null  int64  
 5   value_eur                    16685 non-null  float64
 6   wage_eur                     16685 non-null  float64
 7   club_team_id                 16685 non-null  float64
 8   league_name                  16685 non-null  object 
 9   league_level                 16685 non-null  float64
 10  club_position                16685 non-null  object 
 11  club_jersey_number           16685 non-null  float64
 12  club_contract_valid_until    16685 non-null  float64
 13  nationality_id  

# String columns

Firstly let's take a look at the number of unique values in each string column.

In [3]:
object_columns = data.select_dtypes(include=['object']).columns

for col in object_columns:
    print(f"Value counts for column '{col}':")
    print(data[col].nunique())
    print()

Value counts for column 'name':
15805

Value counts for column 'player_positions':
668

Value counts for column 'league_name':
55

Value counts for column 'club_position':
29

Value counts for column 'preferred_foot':
2

Value counts for column 'work_rate':
9

Value counts for column 'body_type':
10

Value counts for column 'real_face':
2



# Checking player_positions effect

As we can see, the player_positions column has so many unique values. So we will try to train a model with and without it to check the metrics and see wether it has a siginificant effect. For all the string variables we will try both label encoder and one hot encoding to get more solid results.

### Checking with label encoder

In [4]:
# Create a copy of the DataFrame without the 'player_positions' column

data_no_player_positions = data.drop(columns=['player_positions'])

data_encoded = encode_object_columns(data.copy(), data.select_dtypes('object'))
data_no_player_positions_encoded = encode_object_columns(data_no_player_positions.copy(), data_no_player_positions.select_dtypes('object'))

# Split the datasets into train and test sets
X = data_encoded.drop(columns=['value_eur', 'sofifa_id', 'name'])
y = data_encoded['value_eur']

X_no_player_positions = data_no_player_positions_encoded.drop(columns=['value_eur', 'sofifa_id', 'name'])
y_no_player_positions = data_no_player_positions_encoded['value_eur']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_no_pp, X_test_no_pp, y_train_no_pp, y_test_no_pp = train_test_split(X_no_player_positions, y_no_player_positions, test_size=0.2, random_state=42)

# Train XGBoost models and evaluate performance
model_with_pp = xgb.XGBRegressor()
model_without_pp = xgb.XGBRegressor()

model_with_pp.fit(X_train, y_train)
model_without_pp.fit(X_train_no_pp, y_train_no_pp)

y_pred_with_pp = model_with_pp.predict(X_test)
y_pred_without_pp = model_without_pp.predict(X_test_no_pp)

mae_with_pp = mean_absolute_error(y_test, y_pred_with_pp)
mae_without_pp = mean_absolute_error(y_test_no_pp, y_pred_without_pp)

r2_with_pp = r2_score(y_test, y_pred_with_pp)
r2_without_pp = r2_score(y_test_no_pp, y_pred_without_pp)

print(f"Model with 'player_positions': MAE = {mae_with_pp:.4f}, R² = {r2_with_pp:.4f}")
print(f"Model without 'player_positions': MAE = {mae_without_pp:.4f}, R² = {r2_without_pp:.4f}")

Model with 'player_positions': MAE = 168174.0907, R² = 0.9807
Model without 'player_positions': MAE = 164076.6727, R² = 0.9821


In [5]:
(mae_with_pp - mae_without_pp) / data['value_eur'].mean()

np.float64(0.0013645758216826991)

As we can see, the difference is very little and it is almost insignificant. So with this observation it is okay to completely remove this column.

### Cheking with dummy variables

In [6]:
data_no_player_positions = data.drop(columns=['player_positions'])

# One-Hot Encoding the Object Columns
data_encoded = pd.get_dummies(data.drop(columns=['name']), drop_first=True)
data_no_player_positions_encoded = pd.get_dummies(data_no_player_positions.drop(columns=['name']), drop_first=True)

X = data_encoded.drop(columns=['value_eur', 'sofifa_id'])
y = data_encoded['value_eur']

X_no_player_positions = data_no_player_positions_encoded.drop(columns=['value_eur', 'sofifa_id'])
y_no_player_positions = data_no_player_positions_encoded['value_eur']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_no_pp, X_test_no_pp, y_train_no_pp, y_test_no_pp = train_test_split(X_no_player_positions, y_no_player_positions, test_size=0.2, random_state=42)

model_with_pp = xgb.XGBRegressor()
model_without_pp = xgb.XGBRegressor()

model_with_pp.fit(X_train, y_train)
model_without_pp.fit(X_train_no_pp, y_train_no_pp)

y_pred_with_pp = model_with_pp.predict(X_test)
y_pred_without_pp = model_without_pp.predict(X_test_no_pp)

mae_with_pp = mean_absolute_error(y_test, y_pred_with_pp)
mae_without_pp = mean_absolute_error(y_test_no_pp, y_pred_without_pp)

r2_with_pp = r2_score(y_test, y_pred_with_pp)
r2_without_pp = r2_score(y_test_no_pp, y_pred_without_pp)

print(f"Model with 'player_positions': MAE = {mae_with_pp:.4f}, R² = {r2_with_pp:.4f}")
print(f"Model without 'player_positions': MAE = {mae_without_pp:.4f}, R² = {r2_without_pp:.4f}")

Model with 'player_positions': MAE = 170564.1238, R² = 0.9817
Model without 'player_positions': MAE = 162074.1981, R² = 0.9815


In [7]:
(mae_with_pp - mae_without_pp) / data['value_eur'].mean()

np.float64(0.00282742628335976)

Like the prevoius section, we can see that the difference is insignificant so we can delete this column with 100% confidence that it wouldn't affect our model.

There is another important observation that we have. Label Encoder performs slightly better than one-hot encoder. So we will use label encoder for the training and testing phase.

In [8]:
final_df = data.drop(columns=['player_positions'])

# Consumption VS. Non-Consumption

Now we have to check a very important thing in our dataset. We want to apply PCA to our data so that we can use it to predict its' label. The problem is that PCA only applies to integer columns and doesn't support the categorical ones. There is another metric to support catedorical columns named MCA. In order to understand what method fits a specific column we will design a metric to evaluate that. Ofcourse all the float and integer values are consumption (not the ones that have been encoded to integers. actual integers).

The method that we want to use to check if a varibale is consumtion or non-consumption is to get all the unique values that it has, leave one of them out and fit two PCAs on the two subsets that are created from the main dataset. We will do this for all the unique values and check the distance between their eigen vectors. If it is higher than a threshold for any of them, that variable is considered to be non-consumption and we will fit an MCA on it. Otherwise we will fit a PCA.

Note that in this method, we will only consider the numerical columns and the string columns are not applied in the fitted PCA.

Now let's check which columns are consumption and which are not.

In [9]:
object_columns = final_df.select_dtypes(include=['object']).columns
for col in object_columns:
    if col == 'name':
        continue
    print(f"{col} is{'' if is_consumption(final_df, col, 'value_eur', threshold=0.3) else ' not'} a consumption column")

league_name is a consumption column
club_position is a consumption column
preferred_foot is a consumption column
work_rate is a consumption column
body_type is a consumption column
real_face is a consumption column


Now we will convert the consumption columns to integer.

In [10]:
final_df = encode_object_columns(final_df, object_columns)

# Save

In [11]:
final_df.to_csv('./Dataset/analyzed_data.csv', index=False)