# **UEFA Euro 2024 Player Performance Analysis**

## **Project Overview**

In this project, we aim to analyze the top-performing players in the UEFA Euro tournaments from 2012 to 2024. We will identify the top 5 players in the last tournaments based on their performance across various metrics and compare them to the top 5 players in the Germany 2024 tournament. This analysis will provide insights into player performance trends over the years and highlight key performance indicators for success.

## **Data Source**

Our primary data source is fbref.com, a comprehensive and reliable database for football statistics. fbref.com provides detailed player statistics, including goals, assists, minutes played, and advanced metrics such as expected goals (xG) and expected assists (xAG). The data is regularly updated and sourced from reputable football leagues and tournaments, making it a trustworthy source for our analysis.

#### **Data Dictionary**

* **Player-**	Name of the player
* **Pos-**	Position of the player (e.g., FW, MF, DF)
* **Squad-**	Team/squad of the player
* **MP-**	Matches Played
* **Starts-**	Matches Started
* **Min-**	Minutes Played
* **90s-**	Matches Played equivalent in 90-minute matches
* **Gls-**	Goals
* **Ast-**	Assists
* **G+A-**	Goals plus Assists
* **G-PK-**	Goals excluding Penalty Kicks
* **PK-**	Penalty Kicks scored
* **PKatt-**	Penalty Kick attempts
* **CrdY-**	Yellow Cards
* **CrdR-**	Red Cards
* **Gls.1-**	Goals per 90 minutes
* **Ast.1-**	Assists per 90 minutes
* **G+A.1-**	Goals plus Assists per 90 minutes
* **G-PK.1-**	Goals excluding Penalty Kicks per 90 minutes
* **G+A-PK-**	Goals plus Assists excluding Penalty Kicks
* **xG-**	Expected Goals
* **npxG-**	Non-Penalty Expected Goals
* **xAG-**	Expected Assists
* **npxG+xAG-**	Non-Penalty Expected Goals plus Expected Assists
* **PrgC-**	Progressive Carries
* **PrgP-**	Progressive Passes
* **PrgR-**	Progressive Runs
* **xG.1-**	Expected Goals per 90 minutes
* **xAG.1-**	Expected Assists per 90 minutes
* **xG+xAG-**	Expected Goals plus Expected Assists per 90 minutes
* **npxG.1-**	Non-Penalty Expected Goals per 90 minutes
* **npxG+xAG.1-**	Non-Penalty Expected Goals plus Expected Assists per 90 minutes

### **Analysis Goals**

1. **Identify Top Performers:** Determine the top 5 players in the UEFA Euro tournaments of 2012, 2016, 2021, and 2024 based on various performance metrics.
2. **Compare Performance:** Compare the performance of top players across different tournaments to identify trends and patterns.
3. **Player Efficiency:** Create efficiency metrics and a composite rating score to evaluate player performance comprehensively.

### **Machine Learning Model**

#### Model Selection

We implemented several regression models to predict player performance scores. The justification for using regression models is that they allow us to predict a continuous target variable based on various player performance metrics. These models help us understand the factors contributing to high player performance and quantify their impact.

1. **Linear Regression:** Provided a basic understanding but failed to capture the complexity of the relationships.
2. **Ridge Regression:** Added regularization but did not significantly improve performance.
3. **Gradient Boosting Regressor:** Showed improvement by capturing non-linear relationships but still lacked optimal performance.
4. **Random Forest Regressor:** Achieved exceptional results after hyperparameter tuning, demonstrating its ability to capture complex interactions and relationships.

#### Target Variable

The target variable for our machine learning models is a composite Performance Rating Score, which combines several key player features such as goals, assists, minutes played, and advanced metrics. This score is calculated using a weighted average or a scoring algorithm that reflects the relative importance of each feature.

#### Success Metrics

Our success metrics for evaluating the models are:

- **Mean Absolute Error (MAE):** Measures the average magnitude of errors between the predicted and actual performance scores. A lower MAE indicates a better-performing model.
- **R-squared (R²):** Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R² value indicates a better fit for the model.
This version accurately reflects the use of multiple models in the analysis.


## **Steps to Implement the Analysis**

1. **Data Collection and Preprocessing:**
   - Load player datasets from fbref.com for the years 2012, 2016, 2021, and 2024.
   - Clean and preprocess the data, including dropping unnecessary columns and standardizing column names.

2. **Feature Engineering:**
   - Create new features such as efficiency metrics (e.g., goals per 90 minutes) and a composite performance rating score.

3. **Model Training and Evaluation:**
   - Split the data into training and testing sets.
   - Train the regression model on the training set and evaluate its performance using the testing set.

4. **Analysis and Visualization:**
   - Identify the top-performing players in each tournament.
   - Visualize player performance trends and key performance indicators.

By following these steps, we aim to gain valuable insights into player performance in the UEFA Euro tournaments and develop a reliable model for predicting future player performance.

In [2]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [17]:
# Import necessary libraries and packages
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, Ridge
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score

import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Load in the datasets
player_stats_2012= pd.read_csv('/content/drive/My Drive/UEFA Euro Player Stats Data/2012_fbref_uefa_euro_player_stats.csv')
player_stats_2016= pd.read_csv('/content/drive/My Drive/UEFA Euro Player Stats Data/2016_fbref_uefa_euro_player_stats.csv')
player_stats_2021= pd.read_csv('/content/drive/My Drive/UEFA Euro Player Stats Data/2021_fbref_uefa_euro_player_stats.csv')
player_stats_2024= pd.read_csv('/content/drive/My Drive/UEFA Euro Player Stats Data/2024_fbref_uefa_euro_player_stats.csv')

## **Feature Engineering**

In the following step, we aim to create new features that will help us better understand and evaluate player performance across different UEFA Euro tournaments. The following efficiency metrics will be introduced:

1. **Goals per 90 Minutes:** This metric standardizes the number of goals scored by each player based on a 90-minute match duration. It helps to compare players who might have played different amounts of time.
   
2. **Assists per 90 Minutes:** Similar to goals per 90 minutes, this metric calculates the number of assists provided by each player standardized to a 90-minute match. This allows for a fair comparison of players' ability to create scoring opportunities for their teammates.

3. **Goals + Assists per 90 Minutes:** This combined metric provides an overall measure of a player's direct involvement in scoring, taking into account both goals scored and assists provided, standardized to a 90-minute match duration.

4. **Composite Performance Rating Score:** To get a holistic view of player performance, we will create a composite score that combines several key performance indicators. The composite score will be calculated using a weighted average of goals, assists, and starts, reflecting the relative importance of each metric in evaluating player performance.

Additionally, we recognize that disciplinary actions such as yellow and red cards can hinder a player's performance and availability. Therefore, we will also consider these metrics in our analysis:

5. **Yellow Cards per 90 Minutes:** This metric will standardize the number of yellow cards received by each player based on a 90-minute match duration. It provides insight into a player's discipline and potential impact on the game due to cautions.

6. **Red Cards per 90 Minutes:** This metric will standardize the number of red cards received by each player based on a 90-minute match duration. It highlights players who may significantly affect their team's performance by being sent off during matches.

By incorporating these efficiency metrics and disciplinary metrics, we aim to create a comprehensive evaluation of player performance that accounts for both positive contributions and potential drawbacks due to disciplinary actions.


In [4]:
# Create efficiency metrics for player_stats_2012
player_stats_2012['Goals_per_90'] = player_stats_2012['Gls'] / player_stats_2012['Min'] * 90
player_stats_2012['Assists_per_90'] = player_stats_2012['Ast'] / player_stats_2012['Min'] * 90
player_stats_2012['G+A_per_90'] = player_stats_2012['G+A'] / player_stats_2012['Min'] * 90
player_stats_2012['Yellow_per_90'] = player_stats_2012['CrdY'] / player_stats_2012['Min'] * 90
player_stats_2012['Red_per_90'] = player_stats_2012['CrdR'] / player_stats_2012['Min'] * 90

# Create efficiency metrics for player_stats_2016
player_stats_2016['Goals_per_90'] = player_stats_2016['Gls'] / player_stats_2016['Min'] * 90
player_stats_2016['Assists_per_90'] = player_stats_2016['Ast'] / player_stats_2016['Min'] * 90
player_stats_2016['G+A_per_90'] = player_stats_2016['G+A'] / player_stats_2016['Min'] * 90
player_stats_2016['Yellow_per_90'] = player_stats_2016['CrdY'] / player_stats_2016['Min'] * 90
player_stats_2016['Red_per_90'] = player_stats_2016['CrdR'] / player_stats_2016['Min'] * 90

# Create efficiency metrics for player_stats_2021
player_stats_2021['Goals_per_90'] = player_stats_2021['Gls'] / player_stats_2021['Min'] * 90
player_stats_2021['Assists_per_90'] = player_stats_2021['Ast'] / player_stats_2021['Min'] * 90
player_stats_2021['G+A_per_90'] = player_stats_2021['G+A'] / player_stats_2021['Min'] * 90
player_stats_2021['Yellow_per_90'] = player_stats_2021['CrdY'] / player_stats_2021['Min'] * 90
player_stats_2021['Red_per_90'] = player_stats_2021['CrdR'] / player_stats_2021['Min'] * 90

# Create efficiency metrics for player_stats_2024
player_stats_2024['Goals_per_90'] = player_stats_2024['Gls'] / player_stats_2024['Min'] * 90
player_stats_2024['Assists_per_90'] = player_stats_2024['Ast'] / player_stats_2024['Min'] * 90
player_stats_2024['G+A_per_90'] = player_stats_2024['G+A'] / player_stats_2024['Min'] * 90
player_stats_2024['Yellow_per_90'] = player_stats_2024['CrdY'] / player_stats_2024['Min'] * 90
player_stats_2024['Red_per_90'] = player_stats_2024['CrdR'] / player_stats_2024['Min'] * 90

# Create a composite performance rating score (example weights: Goals 0.5, Assists 0.3, Starts 0.2)
player_stats_2012['Rating'] = player_stats_2012['Gls'] * 0.5 + player_stats_2012['Ast'] * 0.3 + player_stats_2012['Starts'] * 0.2
player_stats_2016['Rating'] = player_stats_2016['Gls'] * 0.5 + player_stats_2016['Ast'] * 0.3 + player_stats_2016['Starts'] * 0.2
player_stats_2021['Rating'] = player_stats_2021['Gls'] * 0.5 + player_stats_2021['Ast'] * 0.3 + player_stats_2021['Starts'] * 0.2
player_stats_2024['Rating'] = player_stats_2024['Gls'] * 0.5 + player_stats_2024['Ast'] * 0.3 + player_stats_2024['Starts'] * 0.2

# Display the updated dataframes to check the new features
print(player_stats_2012.head())
print('------------------------------------------------------------------------------------------------------------------------------')
print(player_stats_2016.head())
print('------------------------------------------------------------------------------------------------------------------------------')
print(player_stats_2021.head())
print('------------------------------------------------------------------------------------------------------------------------------')
print(player_stats_2024.head())

             Player    Pos        Squad  MP  Starts  Min  90s  Gls  Ast  G+A  \
0     Ignazio Abate  DF,MF        Italy   3       3  269  3.0    0    0    0   
1   Ibrahim Afellay  FW,MF  Netherlands   3       2  139  1.5    0    0    0   
2      Daniel Agger     DF      Denmark   3       3  270  3.0    0    0    0   
3        Jordi Alba  DF,MF        Spain   6       6  570  6.3    1    1    2   
4  Oleksandr Aliyev  FW,MF      Ukraine   1       0   23  0.3    0    0    0   

   ...  Ast.1  G+A.1  G-PK.1  G+A-PK  Goals_per_90  Assists_per_90  \
0  ...   0.00   0.00    0.00    0.00      0.000000        0.000000   
1  ...   0.00   0.00    0.00    0.00      0.000000        0.000000   
2  ...   0.00   0.00    0.00    0.00      0.000000        0.000000   
3  ...   0.16   0.32    0.16    0.32      0.157895        0.157895   
4  ...   0.00   0.00    0.00    0.00      0.000000        0.000000   

   G+A_per_90  Yellow_per_90  Red_per_90  Rating  
0    0.000000       0.000000         0.0     0.

In [5]:
# Add a Year column to each dataset
player_stats_2012['Year'] = 2012
player_stats_2016['Year'] = 2016
player_stats_2021['Year'] = 2021
player_stats_2024['Year'] = 2024

# Combine the datasets
combined_df = pd.concat([player_stats_2012, player_stats_2016, player_stats_2021, player_stats_2024], ignore_index=True)

# Display the combined dataframe to check
print(combined_df.head())
print('--------------------------------------------------------------------------------------------------------------------------')
print(combined_df.info())

             Player    Pos        Squad  MP  Starts  Min  90s  Gls  Ast  G+A  \
0     Ignazio Abate  DF,MF        Italy   3       3  269  3.0    0    0    0   
1   Ibrahim Afellay  FW,MF  Netherlands   3       2  139  1.5    0    0    0   
2      Daniel Agger     DF      Denmark   3       3  270  3.0    0    0    0   
3        Jordi Alba  DF,MF        Spain   6       6  570  6.3    1    1    2   
4  Oleksandr Aliyev  FW,MF      Ukraine   1       0   23  0.3    0    0    0   

   ...  xAG  npxG+xAG  PrgC  PrgP  PrgR  xG.1  xAG.1  xG+xAG  npxG.1  \
0  ...  NaN       NaN   NaN   NaN   NaN   NaN    NaN     NaN     NaN   
1  ...  NaN       NaN   NaN   NaN   NaN   NaN    NaN     NaN     NaN   
2  ...  NaN       NaN   NaN   NaN   NaN   NaN    NaN     NaN     NaN   
3  ...  NaN       NaN   NaN   NaN   NaN   NaN    NaN     NaN     NaN   
4  ...  NaN       NaN   NaN   NaN   NaN   NaN    NaN     NaN     NaN   

   npxG+xAG.1  
0         NaN  
1         NaN  
2         NaN  
3         NaN  
4     

In [6]:
# Fill missing values only for numeric columns with the mean
numeric_columns = combined_df.select_dtypes(include=[np.number]).columns
combined_df[numeric_columns] = combined_df[numeric_columns].fillna(combined_df[numeric_columns].mean())

# Verify that there are no more missing values in numeric columns
print(combined_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700 entries, 0 to 1699
Data columns (total 39 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Player          1700 non-null   object 
 1   Pos             1700 non-null   object 
 2   Squad           1700 non-null   object 
 3   MP              1700 non-null   int64  
 4   Starts          1700 non-null   int64  
 5   Min             1700 non-null   int64  
 6   90s             1700 non-null   float64
 7   Gls             1700 non-null   int64  
 8   Ast             1700 non-null   int64  
 9   G+A             1700 non-null   int64  
 10  G-PK            1700 non-null   int64  
 11  PK              1700 non-null   int64  
 12  PKatt           1700 non-null   int64  
 13  CrdY            1700 non-null   int64  
 14  CrdR            1700 non-null   int64  
 15  Gls.1           1700 non-null   float64
 16  Ast.1           1700 non-null   float64
 17  G+A.1           1700 non-null   f

In [7]:
# Define a function to get top 5 players for a given year
def get_top_5_players_for_year(year):
    df_year = combined_df[combined_df['Year'] == year]
    sorted_df_year = df_year.sort_values(by='Rating', ascending=False)
    return sorted_df_year.head(5)

# Get top 5 players for each tournament year
top_5_players_2012 = get_top_5_players_for_year(2012)
top_5_players_2016 = get_top_5_players_for_year(2016)
top_5_players_2021 = get_top_5_players_for_year(2021)
top_5_players_2024 = get_top_5_players_for_year(2024)

# Display the top 5 players for each tournament year
print("Top 5 Players in 2012:")
print(top_5_players_2012[['Player', 'Squad', 'Rating', 'Goals_per_90', 'Assists_per_90', 'G+A_per_90']])

print("\nTop 5 Players in 2016:")
print(top_5_players_2016[['Player', 'Squad', 'Rating', 'Goals_per_90', 'Assists_per_90', 'G+A_per_90']])

print("\nTop 5 Players in 2021:")
print(top_5_players_2021[['Player', 'Squad', 'Rating', 'Goals_per_90', 'Assists_per_90', 'G+A_per_90']])

print("\nTop 5 Players in 2024:")
print(top_5_players_2024[['Player', 'Squad', 'Rating', 'Goals_per_90', 'Assists_per_90', 'G+A_per_90']])


Top 5 Players in 2012:
                Player     Squad  Rating  Goals_per_90  Assists_per_90  \
237        David Silva     Spain     3.1      0.442260        0.663391   
79         Mario Gómez   Germany     2.6      0.967742        0.322581   
221  Cristiano Ronaldo  Portugal     2.5      0.562500        0.000000   
15     Mario Balotelli     Italy     2.5      0.644391        0.000000   
179         Mesut Özil   Germany     2.4      0.206422        0.619266   

     G+A_per_90  
237    1.105651  
79     1.290323  
221    0.562500  
15     0.644391  
179    0.825688  

Top 5 Players in 2016:
                Player     Squad  Rating  Goals_per_90  Assists_per_90  \
419  Antoine Griezmann    France     4.8      0.974729        0.324910   
618  Cristiano Ronaldo  Portugal     3.8      0.432692        0.432692   
409     Olivier Giroud    France     3.3      0.597345        0.398230   
580      Dimitri Payet    France     3.3      0.536779        0.357853   
561               Nani  Portug

## **Top 5 Players Analysis**

In this analysis, we identified the top 5 players in the UEFA Euro tournaments from 2012 to 2024 based on their performance metrics. The key metrics used for this analysis include goals per 90 minutes, assists per 90 minutes, and a composite performance rating score.

### **Methodology**

1. **Data Preparation:**
   - Combined datasets from the UEFA Euro tournaments in 2012, 2016, 2021, and 2024.
   - Added efficiency metrics such as goals per 90 minutes, assists per 90 minutes, and a composite performance rating score.

2. **Sorting and Selection:**
   - Sorted the combined dataset by the composite performance rating score in descending order.
   - Extracted the top 5 players for each tournament year based on the sorted performance metrics.

### Results

#### Top 5 Players in 2012
| Player             | Squad     | Rating | Goals per 90 | Assists per 90 | G+A per 90 |
|--------------------|-----------|--------|--------------|----------------|------------|
| David Silva        | Spain     | 3.1    | 0.442260     | 0.663391       | 1.105651   |
| Mario Gómez        | Germany   | 2.6    | 0.967742     | 0.322581       | 1.290323   |
| Cristiano Ronaldo  | Portugal  | 2.5    | 0.562500     | 0.000000       | 0.562500   |
| Mario Balotelli    | Italy     | 2.5    | 0.644391     | 0.000000       | 0.644391   |
| Mesut Özil         | Germany   | 2.4    | 0.206422     | 0.619266       | 0.825688   |

#### Top 5 Players in 2016
| Player             | Squad     | Rating | Goals per 90 | Assists per 90 | G+A per 90 |
|--------------------|-----------|--------|--------------|----------------|------------|
| Antoine Griezmann  | France    | 4.8    | 0.974729     | 0.324910       | 1.299639   |
| Cristiano Ronaldo  | Portugal  | 3.8    | 0.432692     | 0.432692       | 0.865385   |
| Olivier Giroud     | France    | 3.3    | 0.597345     | 0.398230       | 0.995575   |
| Dimitri Payet      | France    | 3.3    | 0.536779     | 0.357853       | 0.894632   |
| Nani               | Portugal  | 3.2    | 0.384068     | 0.128023       | 0.512091   |

#### Top 5 Players in 2021
| Player             | Squad     | Rating | Goals per 90 | Assists per 90 | G+A per 90 |
|--------------------|-----------|--------|--------------|----------------|------------|
| Cristiano Ronaldo  | Portugal  | 3.6    | 1.250000     | 0.250000       | 1.500000   |
| Patrik Schick      | Czechia   | 3.5    | 1.125000     | 0.000000       | 1.125000   |
| Harry Kane         | England   | 3.4    | 0.557276     | 0.000000       | 0.557276   |
| Raheem Sterling    | England   | 3.2    | 0.422535     | 0.140845       | 0.563380   |
| Romelu Lukaku      | Belgium   | 3.0    | 0.812641     | 0.000000       | 0.812641   |

#### Top 5 Players in 2024
| Player              | Squad      | Rating | Goals per 90 | Assists per 90 | G+A per 90 |
|---------------------|------------|--------|--------------|----------------|------------|
| Georges Mikautadze  | Georgia    | 2.6    | 0.782609     | 0.260870       | 1.043478   |
| Ivan Schranz        | Slovakia   | 2.3    | 0.815710     | 0.000000       | 0.815710   |
| Jamal Musiala       | Germany    | 2.3    | 0.903010     | 0.000000       | 0.903010   |
| Fabián Ruiz Peña    | Spain      | 2.2    | 0.694981     | 0.694981       | 1.389961   |
| Kai Havertz         | Germany    | 2.1    | 0.602007     | 0.301003       | 0.903010   |


The analysis reveals the top-performing players in each UEFA Euro tournament based on a composite performance rating score and other key metrics. This information can be valuable for scouting, performance analysis, and historical comparisons of player performance in major tournaments.


In [9]:
# Define the feature set (X) and target variable (y)
features= [
    'Goals_per_90', 'Assists_per_90', 'Yellow_per_90', 'Red_per_90',
    'xG', 'xAG', 'PrgC', 'PrgP', 'PrgR', 'Min', 'Starts', 'MP'
]
X= combined_df[features]
y= combined_df['Rating']

# Standardize the features
scaler= StandardScaler()
X_scaled= scaler.fit_transform(X)

# Split into training and test set
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= 42)

In [13]:
# Define the parameter grid
param_grid= {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Set up the GridSearchCV
grid_search= GridSearchCV(
    estimator= RandomForestRegressor(random_state=42),
    param_grid= param_grid,
    scoring='neg_mean_absolute_error',
    cv= 5,
    n_jobs= -1,
    verbose= 2
)

In [14]:
# Train the Random Forest model
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


In [15]:
# Get the best parameters
best_rf_model= grid_search.best_estimator_

# Make predictions on the test set
y_pred= best_rf_model.predict(X_test)

In [16]:
# Evaluate the model
mae= mean_absolute_error(y_test, y_pred)
r2= r2_score(y_test, y_pred)

# Print metrics
print(f'Mean Absolute Error: {mae}')
print(f'R2 Score: {r2}')

Mean Absolute Error: 0.02954705882353009
R2 Score: 0.9695695297126846


## **Implementing Random Forest Regressor**

Given that the linear and Ridge regression models did not perform well, and the Gradient Boosting model showed only marginal improvement, we explored the Random Forest Regressor. After hyperparameter tuning, the Random Forest model demonstrated exceptional performance.

### **Model Exploration and Performance**

#### Linear Regression

First, we implemented a simple linear regression model to predict player ratings based on performance metrics. Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.

- **Mean Absolute Error (MAE):** 0.41475090884825866
- **R-squared (R²):** 0.06939271217140042

The low R² value indicated that the linear model could only explain about 6.9% of the variance in player ratings, suggesting that the relationship between the features and the target variable is not adequately captured by a simple linear model.

#### Ridge Regression

Next, we applied Ridge regression, a linear model that includes a regularization term to prevent overfitting by penalizing large coefficients.

- **Mean Absolute Error (MAE):** 0.41476941030296677
- **R-squared (R²):** 0.06919973606167829

The performance of the Ridge regression model was very similar to that of the simple linear regression model, indicating that regularization did not significantly improve the model’s ability to explain the variance in player ratings.

#### Gradient Boosting Regressor (XGBoost)

We then tried the Gradient Boosting Regressor, which builds an ensemble of trees sequentially, with each tree trying to correct the errors of the previous ones. This model is known for its ability to capture complex, non-linear relationships.

- **Mean Absolute Error (MAE):** 0.2562355949444806
- **R-squared (R²):** 0.5977508653405955

The Gradient Boosting model showed some improvement over the linear models, explaining about 59.8% of the variance in player ratings. However, the improvement was not substantial enough, indicating that further optimization was needed.

### Random Forest Regressor

Finally, we implemented a Random Forest Regressor, an ensemble method that fits multiple decision trees on various sub-samples of the dataset and averages the results to improve the predictive accuracy and control overfitting.

#### Initial Results

- **Mean Absolute Error (MAE):** 0.030179411764706288
- **R-squared (R²):** 0.9688936433270304

The initial results from the Random Forest Regressor were already significantly better than those from the previous models. The model explained approximately 96.9% of the variance in player ratings, indicating a very good fit.

#### Hyperparameter Tuning

To further optimize the model, we performed hyperparameter tuning using GridSearchCV to find the best combination of hyperparameters. We explored a range of values for the number of trees (`n_estimators`), maximum depth of the trees (`max_depth`), minimum number of samples required to split an internal node (`min_samples_split`), minimum number of samples required to be at a leaf node (`min_samples_leaf`), and whether bootstrap samples were used (`bootstrap`).

- **Best Hyperparameters:** {'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 100, 'subsample': 1.0}
- **Mean Absolute Error (MAE):** 0.02954705882353009
- **R-squared (R²):** 0.9695695297126846

### Conclusion

The hyperparameter-tuned Random Forest Regressor provided an excellent fit for the data, capturing the complex relationships between the features and the player ratings. This improvement suggests that non-linear models, particularly ensemble methods like Random Forest, are highly suitable for this type of analysis.

#### Model Performance Metrics

- **Mean Absolute Error (MAE):** 0.02954705882353009
  - The MAE measures the average magnitude of errors between the predicted and actual player ratings. A lower MAE indicates better predictive accuracy. In this case, the MAE of approximately 0.0295 means that, on average, the predicted ratings are within 0.03 points of the actual ratings. This level of accuracy is exceptional, suggesting that the model's predictions are very close to the true player ratings.

- **R-squared (R²):** 0.9695695297126846
  - The R² value represents the proportion of variance in the dependent variable (player ratings) that is predictable from the independent variables (performance metrics). An R² value of approximately 0.97 indicates that 97% of the variability in player ratings is explained by the model. This high R² value demonstrates that the model has a strong explanatory power and effectively captures the underlying patterns in the data.

#### Success of the Model

The results from the hyperparameter-tuned Random Forest Regressor are impressive, showcasing its ability to make highly accurate predictions and explain a significant portion of the variance in player ratings. This success can be attributed to several factors:

1. **Complex Relationships:** The Random Forest model is capable of capturing complex, non-linear relationships between features and the target variable, which simpler linear models failed to do.

2. **Ensemble Method:** As an ensemble method, Random Forest combines the predictions of multiple decision trees, which helps in reducing overfitting and improving generalization to unseen data.

3. **Robustness and Accuracy:** The very low MAE and high R² values indicate that the model is both robust and accurate. It consistently produces predictions that are very close to the actual ratings, making it a reliable tool for evaluating player performance.

4. **Hyperparameter Tuning:** The process of hyperparameter tuning was crucial in optimizing the model's performance. By exploring a range of hyperparameters, we were able to identify the best configuration that maximized the model's predictive power.

#### **Implications for Predicting Player Ratings for UEFA Euro Tournaments**

The successful application of the hyperparameter-tuned Random Forest Regressor for predicting player ratings has significant implications for UEFA Euro tournaments. This model can serve as a powerful tool for coaches, analysts, and scouts to evaluate player performance comprehensively and accurately. By leveraging a wide range of performance metrics, the model provides a detailed assessment of each player's contribution, allowing for more informed decision-making in team selection and strategy formulation.

Furthermore, the model's ability to explain a substantial portion of the variance in player ratings highlights its potential for identifying key performance indicators that contribute to a player's success on the field. This insight can drive targeted training programs and performance improvement plans, ultimately enhancing team performance in UEFA Euro tournaments.






In [None]:
# Install the DagsHub python client
!pip install -q dagshub


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.7/233.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m238.4/238.4 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m4.3 

In [None]:
from dagshub.notebook import save_notebook

save_notebook(repo="Omdena/TunisiaLocalChapter_UEFAEURO2024", path="task3-machine-learning-modeling")

Output()