In this project, we will build a machine learning model to predict the `OVRALL` rating of a football player based on their individual attributes. We will use the `FIFA 23 complete player` dataset to train our model.

## **1. Data exploration**

In [None]:
import pandas as pd
#from google.colab import drive
#drive.mount('/content/drive')
#df_players = pd.read_csv('drive/MyDrive/male_players (legacy).csv')
df_players = pd.read_csv('male_players (legacy).csv')

To get an idea about the dimensions of our data, we will apply the ".shape" method to our dataset.

In [4]:
df_players.shape

(161583, 110)

Let's Apply the `.info()` method of Pandas on the data to get an idea about the variables, column types, and the number of non-null values in each column.

In [5]:
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161583 entries, 0 to 161582
Columns: 110 entries, player_id to player_face_url
dtypes: float64(18), int64(45), object(47)
memory usage: 135.6+ MB


Let's display the data description for all variables.

In [6]:
# We will use the "include='all'" argument to include all variables in the output
print(df_players.describe(include='all'))

            player_id                          player_url   fifa_version  \
count   161583.000000                              161583  161583.000000   
unique            NaN                              161583            NaN   
top               NaN  /player/158023/lionel-messi/150002            NaN   
freq              NaN                                   1            NaN   
mean    214484.722353                                 NaN      19.125514   
std      34928.608856                                 NaN       2.559318   
min          2.000000                                 NaN      15.000000   
25%     199159.000000                                 NaN      17.000000   
50%     220621.000000                                 NaN      19.000000   
75%     236958.000000                                 NaN      21.000000   
max     271817.000000                                 NaN      23.000000   

        fifa_update fifa_update_date    short_name   long_name  \
count      161583.0  

This time we will apply the `.describe()` method of Pandas on the data without passing the `include` parameter to display only continuous variables.

In [7]:

print(df_players.describe())

           player_id   fifa_version  fifa_update        overall  \
count  161583.000000  161583.000000     161583.0  161583.000000   
mean   214484.722353      19.125514          2.0      65.699071   
std     34928.608856       2.559318          0.0       7.040855   
min         2.000000      15.000000          2.0      40.000000   
25%    199159.000000      17.000000          2.0      61.000000   
50%    220621.000000      19.000000          2.0      66.000000   
75%    236958.000000      21.000000          2.0      70.000000   
max    271817.000000      23.000000          2.0      94.000000   

           potential     value_eur       wage_eur            age  \
count  161583.000000  1.595300e+05  159822.000000  161583.000000   
mean       70.744008  2.326770e+06   10855.409768      25.123181   
std         6.259121  6.005746e+06   21941.656285       4.670207   
min        40.000000  1.000000e+03     500.000000      16.000000   
25%        66.000000  3.250000e+05    2000.000000      2

##  **2. Data Preparation**
###  **2.1. Identification and Removal of Missing Values**

The goal of our project is to predict the overall rating 'OVERALL' of a player based on their individual attributes such as pace, shooting, passing, dribbling, defending, and physicality. The focus in this step will be only on the columns containing these variables.

Here are the names of the columns for these variables in the dataset:
`'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic'`

Here is the name of the target variable:
`overall`

##### **2.1.1. Identification of Missing Values**

Let's apply the `info()` method to see if there are any missing values in these columns.

In [8]:
df_players[['overall','pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161583 entries, 0 to 161582
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   overall    161583 non-null  int64  
 1   pace       143614 non-null  float64
 2   shooting   143614 non-null  float64
 3   passing    143614 non-null  float64
 4   dribbling  143614 non-null  float64
 5   defending  143614 non-null  float64
 6   physic     143614 non-null  float64
dtypes: float64(6), int64(1)
memory usage: 8.6 MB


According to the result, we notice that our relevant columns contain missing values.
<br><br>

To calculate the number of missing values in each column, we will apply the `isna()` and `sum()` methods simultaneously on our dataset, specifying the names of these columns.


In [9]:
df_players[['overall','pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']].isna().sum()

overall          0
pace         17969
shooting     17969
passing      17969
dribbling    17969
defending    17969
physic       17969
dtype: int64

We can see that we have 17969 missing values in each column.

#####  **2.1.2. Removal of Missing Values**

To remove all rows corresponding to the columns with missing values, we will apply the `dropna(subset=columns)` method on the dataset. The `subset` parameter allows specifying the columns on which the deletion will be applied.

In [10]:
cols_val_nan = ['overall','pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']

df_players_cols_drop_nan = df_players.dropna(subset=cols_val_nan)

<br>
Let's check if the missing values have been successfully removed.

In [11]:
df_players_cols_drop_nan[['overall','pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']].isna().sum()

overall      0
pace         0
shooting     0
passing      0
dribbling    0
defending    0
physic       0
dtype: int64

Great! The missing values have been successfully removed.

###  **2.2. Dimensionality Reduction**

For building our model, we will not need all variables from the dataset. As we have already specified the variables we will work with: `'overall', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic'`, with `overall` being the target variable.

Therefore, we will create a new reduced DataFrame containing only the variables we need, based on the `df_players` DataFrame.

In [12]:
# Let's specify the columns we want to keep
cols_to_keep = ['overall','pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']

# Let's Create the new DataFrame containing only the specified columns
df_players_reduced = df_players_cols_drop_nan[cols_to_keep]

# Let's Display the columns of the new created DataFrame
print(df_players_reduced.columns)

Index(['overall', 'pace', 'shooting', 'passing', 'dribbling', 'defending',
       'physic'],
      dtype='object')


So there you go, our reduced DataFrame has been created under the name `df_players_reduced`.

###  **2.3. Variable Transformation**

It is a good practice to scale numerical variables of different magnitudes (scaling / normalization).

The goal of the scaling we are going to do is to have values between 0 and 1.

`min-max scaling` is one method among others that allows normalization.

To perform `min-max scaling` on our variables, we will use the `sklearn.preprocessing` library to be able to use the `MinMaxScaler` transformer.

In [13]:
from sklearn.preprocessing import MinMaxScaler

# Let's Create a copy of our reduced DataFrame
scaled_data = df_players_reduced

# Select the columns to scale
cols_to_scale = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']

# Let's Initialize the MinMaxScaler transformer
scaler = MinMaxScaler()

# Let's Fit and transform the data
scaled_data[cols_to_scale] = scaler.fit_transform(scaled_data[cols_to_scale])

# Let's Display the scaled data
scaled_data.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  scaled_data[cols_to_scale] = scaler.fit_transform(scaled_data[cols_to_scale])


Unnamed: 0,overall,pace,shooting,passing,dribbling,defending,physic
count,143614.0,143614.0,143614.0,143614.0,143614.0,143614.0,143614.0
mean,65.875256,0.618719,0.477885,0.506002,0.541647,0.482251,0.581759
std,6.93658,0.146001,0.174024,0.143624,0.139207,0.216099,0.149801
min,40.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,61.0,0.539474,0.35,0.410959,0.459459,0.285714,0.476923
50%,66.0,0.631579,0.5,0.520548,0.554054,0.545455,0.6
75%,70.0,0.723684,0.6125,0.60274,0.635135,0.649351,0.692308
max,94.0,1.0,1.0,1.0,1.0,1.0,1.0


Our numerical variables now have values between 0 and 1.

###  **2.4. Data Splitting**

It is important to split the data into training and test sets to build the model.

We will use the `train_test_split()` function from the `scikit learn` Machine Learning library. We will reserve 20% of the data for testing.

In [14]:
from sklearn.model_selection import train_test_split

data_train,data_test,  = train_test_split(scaled_data, test_size = 0.2, random_state=0)
print(data_train.shape)
print(data_test.shape)

(114891, 7)
(28723, 7)


Let's separate the features from the target variable `overall`.

Our goal is to predict the variable `overall`.

In [15]:
# Let's Create a new variable X_train containing the same data as data_train
X_train = data_train

# Let's Drop the 'overall' column from X_train
X_train = X_train.drop(columns=['overall'])

# Let's Create a new variable y_train containing only the values of the 'overall' column from data_train
y_train = data_train['overall']

# Let's Display the dimensions of X_train and y_train
print(X_train.shape)
print(y_train.shape)

(114891, 6)
(114891,)


In [16]:
# Let's Create a new variable X_test containing the same data as data_test
X_test = data_test

# Let's Drop the 'overall' column from X_test
X_test = X_test.drop(columns=['overall'])

# Let's Create a new variable y_test from 'data_test' dataset containing only the values of the 'overall' column
y_test = data_test['overall']

# Let's Display the dimensions of X_test and y_test
print(X_test.shape)
print(y_test.shape)

(28723, 6)
(28723,)


##  **3. Linear Regression Model Building**

We want to build a linear regression model to predict the variable `overall`. To do so, we need to follow the following steps:

1. Instantiate a LinearRegression object
2. Train the model on the training data
3. Calculate predictions on the test data

####  **3.1. Building the Model**

To perform the first step, we will need to use the `scikit learn` Machine Learning library.

In [17]:
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

# Let's Instantiate a LinearRegression object
lin_reg_model = LinearRegression()

The `lin_reg_model` model is now ready to be trained on training data.<br><br>

To train our model on the training set, we will use the `.fit` function.

In [18]:
lin_reg_model.fit(X_train, y_train)

The model has now been built from the training data and can be used to make predictions.<br><br>

Let's use the built model to calculate predictions on the test data.

In [19]:
y_pred = lin_reg_model.predict(X_test)

####  **3.2. Model Evaluation**
To evaluate our model, we will use three evaluation metrics:

*   The coefficient of determination `R^2`.
*   The Mean Absolute Error `MAE`.
*   The Root Mean Squared Error `RMSE`.

These metrics help evaluate the quality of the model's predictions by comparing the predicted values `y_pred` to the actual values `y_test`.

In [20]:
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
import numpy as np

# Let's calculate and display the model evaluation metrics
print("R^2 : ", r2_score(y_test, y_pred))
print("MAE :", mean_absolute_error(y_test,y_pred))
print("RMSE:",np.sqrt(mean_squared_error(y_test, y_pred)))

R^2 :  0.7077035261014379
MAE : 2.9713555904924753
RMSE: 3.73210599041024


Based on these results, it seems that our linear regression model has moderate performance on the `FIFA 23 complete players` dataset. The coefficient of determination, `R^2`, is 0.71, indicating that the model explains approximately 71% of the variance in the data. The closer this value is to 1, the better the model performance.

The other error measures, `MAE` and `RMSE`, provide an indication of the magnitude of the model's prediction errors. In our case, these values are relatively high, suggesting that the model's predictions are not very accurate.

Given the unsatisfactory results of our linear regression model, we will explore other algorithms to see if we can build a more effective model.

##  **4. Decision Tree Regression Model Building**

To build a Decision Tree Regression model, we need to follow the following steps:

1. Initialize a Decision Tree Regression model
2. Train the model on the training data
3. Calculate predictions on the test data

In [21]:
from sklearn.tree import DecisionTreeRegressor

# Let's initialize a Decision Tree Regression model with a random state of 0
DTregressor = DecisionTreeRegressor(random_state=0)

# Let's train the model on the training data
DTregressor.fit(X_train, y_train)

# Let's make predictions on the test data
y_pred = DTregressor.predict(X_test)

####  **4.2. Model Evaluation**

In [22]:
# Let's calculate and display the model evaluation metrics
print("R^2 : ", r2_score(y_test, y_pred))
print("MAE :", mean_absolute_error(y_test,y_pred))
print("RMSE:",np.sqrt(mean_squared_error(y_test, y_pred)))

R^2 :  0.9248954657466217
MAE : 1.3770259606122852
RMSE: 1.891800167274791


The results we obtained indicate that our Decision Tree Regression model exhibits relatively good performance. The coefficient of determination, `R^2`, is 0.92, signifying that the model explains approximately 92% of the variance in the data. The closer this value is to 1, the better the model's performance.

The other error measures, `MAE` and `RMSE`, provide an indication of the magnitude of the model's prediction errors. In this case, these values are relatively low, suggesting that the model's predictions are quite accurate.

## **Conclusion**


Our exploration of different algorithms to build an effective prediction model for predicting the overall rating of players from the 'FIFA 23 complete player' database has led us to compare a linear regression model and a decision tree model.

Although our first model (linear regression)  did not yield the expected results, our second model (decision tree regression) showed better ability to predict test data. In comparison, the decision tree model appears to be a more suitable solution to our problem.