# FIFA PLAYER PRICE PREDICTION

Authors :

- Daniel FU : fu.2121690@studenti.uniroma1.it / Matricola : 2121690
- Marvin BERGER : berger.2117502@studenti.uniroma1.it / Matricola : 2117502
- Lukas WEIGMANN : weigmann.2117702@studenti.uniroma1.it / Matricola : 2117702
- Srinjan GHOSH : ghosh.2053796@studenti.uniroma1.it / Matricola : 2053796
- Agnese LORELLI : lorelli.1966415@studenti.uniroma1.it / Matricola : 1966415

## Abstract

## Introduction

This is the final project of the course **"Foundations of Data Science"**. Our goal is to predict the market value of a player in the game FIFA. We chose this topic because of our interest in soccer and the fact that the data for the FIFA games is very closely related to market data from reality. This way we can develop an approach for predicting players' market values while still having a lot of other data on top for experimenting. We found a Dataset on **Kaggle** : ​​​[FIFA](https://www.kaggle.com/datasets/stefanoleone992/fifa-23-complete-player-dataset?select=male_players.csv). This dataset contains players' characteristics from FIFA 15 to FIFA 23, it has 110 features and over 1 million instances.
We want to optimize several models on the prediction of the player price, compare the models considering the resulting Root Mean Squared Error and R^2 values and be able to make conclusion about the importance of the features for such a prediction.

## Team assignments

We decided to split the workload :

- Daniel FU : XGBoost + Feature Importance
- Marvin BERGER : Feature Engineering + Linear Regression
- Lukas WEIGMANN : Feature Engineering + Linear Regression + Plot
- Srinjan GHOSH : Feature Engineering + Linear and Polynomial Regression
- Agnese LORELLI : Decision Tree + Random Forest

### Import Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

## 1. Dataset Loading and Pre-processing

In [2]:
DATASET_PATH = "./male_players.csv"
dataset_df = pd.read_csv(DATASET_PATH)
dataset_df.head()

FileNotFoundError: [Errno 2] No such file or directory: './male_players.csv'

Lets have a look at the columns we have to work with

In [None]:
dataset_df.columns.to_list()

We have to deal with **Feature Engineering** because the dataset has too many features, we need to reduce it:
Taking the mean of the related attributes in order to form a summary of each attribute.

In [None]:
dataset_df['attacking_mean'] = np.mean(dataset_df[['attacking_crossing',
                                       'attacking_finishing',
                                       'attacking_heading_accuracy',
                                       'attacking_short_passing',
                                       'attacking_volleys']], axis=1)

dataset_df['skill_mean'] = np.mean(dataset_df[['skill_dribbling',
                                   'skill_curve',
                                   'skill_fk_accuracy',
                                   'skill_long_passing',
                                   'skill_ball_control']], axis=1)

dataset_df['movement_mean'] = np.mean(dataset_df[['movement_acceleration',
                                      'movement_sprint_speed',
                                      'movement_agility',
                                      'movement_reactions',
                                      'movement_balance']], axis=1)

dataset_df['power_mean'] = np.mean(dataset_df[['power_shot_power',
                                   'power_jumping',
                                   'power_stamina',
                                   'power_strength',
                                   'power_long_shots']], axis=1)

dataset_df['mentality_mean'] = np.mean(dataset_df[['mentality_aggression',
                                       'mentality_interceptions',
                                       'mentality_positioning',
                                       'mentality_vision',
                                       'mentality_penalties',
                                       'mentality_composure']], axis=1)

dataset_df['defending_mean'] = np.mean(dataset_df[['defending_marking_awareness',
                                       'defending_standing_tackle',
                                       'defending_sliding_tackle']], axis=1)

dataset_df['goalkeeping_mean'] = np.mean(dataset_df[['goalkeeping_diving',
                                         'goalkeeping_handling',
                                         'goalkeeping_kicking',
                                         'goalkeeping_positioning',
                                         'goalkeeping_reflexes',
                                         'goalkeeping_speed']], axis=1)

columns_to_remove = ['attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy',
                      'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve',
                      'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control',
                      'movement_acceleration', 'movement_sprint_speed', 'movement_agility',
                      'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping',
                      'power_stamina', 'power_strength', 'power_long_shots', 'mentality_aggression',
                      'mentality_interceptions', 'mentality_positioning', 'mentality_vision',
                      'mentality_penalties', 'mentality_composure', 'defending_marking_awareness',
                      'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving',
                      'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning',
                      'goalkeeping_reflexes', 'goalkeeping_speed']

# Remove the original columns
dataset_df.drop(columns=columns_to_remove, inplace=True)

# Display the modified DataFrame
dataset_df.head()


We use label encoding to encode categorical variables which might decide the market value of a player

In [None]:
dataset_df['preferred_foot'] = LabelEncoder().fit_transform(dataset_df['preferred_foot'])
dataset_df['work_rate'] = LabelEncoder().fit_transform(dataset_df['work_rate'])
dataset_df['body_type'] = LabelEncoder().fit_transform(dataset_df['body_type'])

Let's have a look at the values of the changed attrbiutes

In [None]:
dataset_df[['preferred_foot', 'work_rate', 'body_type']]

Let's now visualize how the attributes might be correlated with our dependent variable **value_eur**.

In [None]:
selected_column = 'value_eur'

correlations= dataset_df.corrwith(dataset_df[selected_column])


plt.figure(figsize=(20, 10))
sns.barplot(x=correlations.index, y=correlations.values, palette='viridis')
plt.xticks(rotation=45, ha="right")

# Set plot labels and title
plt.xlabel('Columns')
plt.ylabel('Correlation Coefficient')
plt.title(f'Correlation of Column {selected_column} with Other Columns')

Notice that some features disappeared, it has removed features whose values are String (e.g. player_url, player_face_url ...). We have now **39 features**

Let's remove the uncessary columns above a particular threshold

In [None]:
def keepOnlyDataOverAThreshold(data, selected_column, threshold):
    correlations = dataset_df.corrwith(dataset_df[selected_column])
    columns_to_keep = correlations[correlations.abs()>threshold].index.to_list()
    columns_to_delete = list(set(dataset_df.columns.to_list()) - set(columns_to_keep))
    return data.drop(columns=columns_to_delete)

In [None]:
dataset_df = keepOnlyDataOverAThreshold(dataset_df, 'value_eur', 0.1)
dataset_df.head()

In [None]:
dataset_df.shape

Let's remove all the rows which have N/A values --> **250k instances**

In [None]:
dataset_df = dataset_df.dropna()

In [None]:
dataset_df.shape

We drop this feature because it has too much correlation with the target (almost 1) and thus we want to avoid a scenario where we predict a player price basically already using the market value. It has been a debate inside the team if we should keep it or not, but thanks to our results, we know that it is better to remove it.

In [None]:
dataset_df = dataset_df.drop(columns="release_clause_eur")

Let's scale the dataset

In [None]:
from sklearn.preprocessing import StandardScaler

columns = dataset_df.columns.to_list()

scaler = StandardScaler()
dataset_df[columns] = scaler.fit_transform(dataset_df[columns])

dataset_df.head()

## 2. Training Various Models

We decided to use 4 approaches for our task.
First we wanted to use regression. We expect the regression to maybe not perform that well since we have a lot of features that are not really correlated with our label. However we have a lot of features and training data.
Furthermore we decided to use regression trees. In the first variant we only want to train one regression tree for prediction and in the second variant a Random Forest of regression trees. Especially the random forest we expect to perform well.
Lastly we use a gradient boosting model. Again we expect it to perform very well.


First let's perform the train test split

In [None]:
y = dataset_df["value_eur"]
X = dataset_df.drop("value_eur", axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data shape: {X_train}")
print(f"Test data shape: {X_test}")

#### 2.1. Linear Regression

In [None]:
def linearReg(X_train, X_test, y_train, y_test):
  model = LinearRegression()
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  mse = mean_squared_error(y_test, y_pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(y_test, y_pred)

  print(f'Mean Squared Error: {mse}')
  print(f'Root Mean Squared Error: {rmse}')
  print(f'R-squared: {r2}')

  return rmse,r2

In [None]:
rmse_lin,r2_lin = linearReg(X_train, X_test, y_train, y_test)

#### 2.2. Stochastic Gradient Descent Regressor

In [None]:
def sgdRegressor(X_train, X_test, y_train, y_test, iterations):
  model = SGDRegressor(max_iter=iterations)
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  mse = mean_squared_error(y_test, y_pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(y_test, y_pred)

  print(f'Mean Squared Error: {mse}')
  print(f'Root Mean Squared Error: {rmse}')
  print(f'R-squared: {r2}')

In [None]:
sgdRegressor(X_train, X_test, y_train, y_test, 10000)

#### 2.3. Polynomial Regression

In [None]:
small_X = X
poly_features = PolynomialFeatures(degree=3, include_bias=False)\
                                            .fit_transform(small_X)

X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.2, random_state=42)

Linear Regression with polynomial features

In [None]:
linearReg(X_train, X_test, y_train, y_test)

Stochastic Regression with polynomial features

In [None]:
sgdRegressor(X_train, X_test, y_train, y_test)

### 2.4. Regression Tree

Let's first reinitialize the train and test datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# create a regressor object 
regressor = DecisionTreeRegressor(random_state = 0)  
  
# fit the regressor with X and Y data 
regressor.fit(X_train, y_train) 

#test the regressor
y_reg = regressor.predict(X_test)

# evaluate model
rmse_DT = mean_squared_error(y_test, y_reg, squared=False)
r2_DT = r2_score(y_test, y_reg)
print(f'RMSE= {rmse_DT}, R2= {r2_DT}')

In [None]:
label_encoder = LabelEncoder()
x_categorical = dataset_df.select_dtypes(include=['object']).apply(label_encoder.fit_transform)
x_numerical = dataset_df.select_dtypes(exclude=['object']).values
x = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1).values

# Fitting Random Forest Regression to the dataset
regressor_rf = RandomForestRegressor(n_estimators=10, random_state=0, oob_score=True)

# Fit the regressor with x and y data
regressor_rf.fit(X_train, y_train)

#test the regressor
y_reg_rf = regressor_rf.predict(X_test)

# evaluate model
rmse_RF = mean_squared_error(y_test, y_reg_rf, squared=False)
r2_RF = r2_score(y_test, y_reg_rf)
print(f'RMSE= {rmse_RF}, R2= {r2_RF}')

### 2.5. XGBoost

In [None]:
# Split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Hyperparameter grid
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1],
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 4, 5]
}

# Create a XGBRegressor instance
xgb_model = xgb.XGBRegressor(objective='reg:squarederror',random_state=42)

# Create GridSearchCV object
grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Get the best model
best_xgb_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred_best = best_xgb_model.predict(X_test)

# Metrics
rmse_xgb = mean_squared_error(y_test, y_pred_best, squared=False)
r2_xgb = r2_score(y_test, y_pred_best)

print(f"RMSE: {rmse}")
print(f"R² Score: {r2}")

### 2.5.1 Feature Importance

In [None]:
#Use Matplotlib and Importance

import matplotlib.pyplot as plt

# Get Importances from features
feature_importances = best_xgb_model.feature_importances_

# X axis
feature_names = X.columns 

# Create a DataFrame
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})

# Sort by value (descending)
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print(feature_importance_df)

# Plot graph
plt.figure(figsize=(8, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importances')
plt.show()

## 3. Results and Metrics

### Feature importance
We consider the feature importances of the gradient boosting model. By far the most important feature is the overall value which is a FIFA game specific value. The most important features after this one are potential, the remaining time the contract at the club is valid, the weak foot and shooting ability.

### Model errors
In the following plots we can see the Random Mean Squared Error and the R^2 values of all models compared. The linear regression performed the worst with RMSE of almost 0.5 and R^2 of around 0.8. Gradient boosting had a RMSE of around 0.1 and R^2 of almost 1. The Regression tree and Random Forest performed the best with an extremely low RMSE and R^2 even higher than the one of the Gradient Boosting model.


In [None]:
# Function to plot performance measures
# Params: values as list, title, categories as list
def plotMeasures(values, title, categories):
    values = np.array(values)
    bar_width = 0.2

    for i in range(len(categories)):
        plt.bar(i * bar_width, values[i], width=bar_width, label=categories[i])


    plt.xticks([])
    plt.xlabel('Model')
    plt.title(title)


    plt.legend(loc='lower center')
    plt.show()

In [None]:
plotMeasures([rmse_lin, rmse_DT, rmse_RF,rmse_xgb], 'RMSE', ['Linear Regression', 'Regression Tree', 'Random Forest','XGBoost'])
plotMeasures([r2_lin, r2_DT, r2_RF,r2_xgb], 'R^2', ['Linear Regression', 'Regression Tree', 'Random Forest','XGBoost'])

## Conclusion
We obtained really good results for predicting the market value of the FIFA players. Especially the Random Forest performed extraordinarily well. However we could also observe that the most important feature for the Gradient Boosting was the value 'overall' with an importance of over 0.6. This is a very specific ingame value that was again derived by developers from the player's statistics (such as also market price) just for the game. Considering this fact the very good results might be attributed to the fact that it's data for a video game. Hypothetically using the same approach on real life FIFA data only would maybe not lead to such low error rates. However, the general approach that we developed within this project can be used for similar predictions of continuos values.