In [2]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Project Overview

With UEFA EURO 2024 ongoing, the potential to harness advanced analytics in sports is immense. This project aims to leverage data science and machine learning to analyze various aspects of football, providing valuable insights for teams, coaches, analysts, and stakeholders. The integration of comprehensive datasets encompassing match results, player performances, and other relevant data can revolutionize how teams prepare, strategize, and compete.

### Question to be Analyzed

**How does home advantage influence match results in UEFA EURO tournaments?**

Home advantage is a well-known phenomenon in sports, where teams playing on their home ground are believed to have a higher chance of winning. This analysis aims to quantify the impact of home advantage on match outcomes in UEFA EURO tournaments. By understanding this effect, teams can better prepare for matches, whether they are playing at home or away.

### Model Justification

To quantify the impact of home advantage on match outcomes, we will use regression models. Specifically, we will start with Linear Regression due to its simplicity and interpretability. Linear Regression is suitable for this analysis as it will help us understand the relationship between the home advantage and match results.

Additionally, we will also consider using Lasso Regression as a secondary model. Lasso Regression can handle potential multicollinearity and perform feature selection by shrinking some coefficients to zero. This characteristic can be beneficial if we have multiple features and want to identify the most significant ones influencing match outcomes.

By comparing the performance of these two models, we can determine which one provides better insights into the impact of home advantage on match results.


In [3]:
# Import necessary libraries and packages
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Load in the dataset
# Load match results dataset from DagsHub repository
match_results = pd.read_csv('https://dagshub.com/carlosrod723/TunisiaLocalChapter_UEFAEURO2024/raw/main/Datasets/results.csv')

# Display the first few rows of the dataset to ensure it is loaded correctly
print("Match Results:\n", match_results.head())

Match Results:
          date home_team away_team  home_score  away_score tournament     city  \
0  1872-11-30  Scotland   England         0.0         0.0   Friendly  Glasgow   
1  1873-03-08   England  Scotland         4.0         2.0   Friendly   London   
2  1874-03-07  Scotland   England         2.0         1.0   Friendly  Glasgow   
3  1875-03-06   England  Scotland         2.0         2.0   Friendly   London   
4  1876-03-04  Scotland   England         3.0         0.0   Friendly  Glasgow   

    country  neutral  
0  Scotland    False  
1   England    False  
2  Scotland    False  
3   England    False  
4  Scotland    False  


In [5]:
# Explore the data
print(match_results.shape)
print(match_results.columns)
print(match_results.dtypes)

(47379, 9)
Index(['date', 'home_team', 'away_team', 'home_score', 'away_score',
       'tournament', 'city', 'country', 'neutral'],
      dtype='object')
date           object
home_team      object
away_team      object
home_score    float64
away_score    float64
tournament     object
city           object
country        object
neutral          bool
dtype: object


In [6]:
# Check for missing values
match_results.isnull().sum()

date           0
home_team     25
away_team     25
home_score    73
away_score    73
tournament     0
city           0
country        0
neutral        0
dtype: int64

In [7]:
# Drop rows with missing values in critical columns
cleaned_match_results= match_results.dropna(subset= ['home_team', 'away_team', 'home_score', 'away_score'])

# Verify that there are no missing values in new dataset and check the shape of the dataset
print(cleaned_match_results.isnull().sum())
print(cleaned_match_results.shape)

date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64
(47306, 9)


In [8]:
# Convert 'date' to datetime format
cleaned_match_results['date'] = pd.to_datetime(cleaned_match_results['date'], errors='coerce')

# Convert 'home_score' and 'away_score' to integers
cleaned_match_results['home_score'] = cleaned_match_results['home_score'].astype(int)
cleaned_match_results['away_score'] = cleaned_match_results['away_score'].astype(int)

# Verify changes
print("Updated Data Types:\n", cleaned_match_results.dtypes)
print("Sample Data:\n", cleaned_match_results.head())

Updated Data Types:
 date          datetime64[ns]
home_team             object
away_team             object
home_score             int64
away_score             int64
tournament            object
city                  object
country               object
neutral                 bool
dtype: object
Sample Data:
         date home_team away_team  home_score  away_score tournament     city  \
0 1872-11-30  Scotland   England           0           0   Friendly  Glasgow   
1 1873-03-08   England  Scotland           4           2   Friendly   London   
2 1874-03-07  Scotland   England           2           1   Friendly  Glasgow   
3 1875-03-06   England  Scotland           2           2   Friendly   London   
4 1876-03-04  Scotland   England           3           0   Friendly  Glasgow   

    country  neutral  
0  Scotland    False  
1   England    False  
2  Scotland    False  
3   England    False  
4  Scotland    False  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_match_results['date'] = pd.to_datetime(cleaned_match_results['date'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_match_results['home_score'] = cleaned_match_results['home_score'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_match_results['a

## **Define Success Metric and Target Variable**

Before proceeding with managing data types and further preprocessing, it's essential to define the success metric and the target variable for our analysis.

## **Success Metric**

For this analysis, we are interested in understanding how home advantage influences match outcomes. A suitable success metric would be the match outcome (win, loss, draw) for the home team. We can create a categorical variable that captures this information.

## **Target Variable**

We will create a target variable, home_result, that categorizes the match outcomes for the home team:

* win: If the home team’s score is greater than the away team’s score.
* loss: If the home team’s score is less than the away team’s score.
* draw: If the home team’s score is equal to the away team’s score.

In [9]:
# Create the target variable 'home_result'
def determine_result(row):
  if row['home_score'] > row['away_score']:
    return 'win'
  elif row['home_score'] < row['away_score']:
    return 'loss'
  else:
    return 'draw'

cleaned_match_results['home_result']= cleaned_match_results.apply(determine_result, axis=1)

# View the new column
cleaned_match_results.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_match_results['home_result']= cleaned_match_results.apply(determine_result, axis=1)


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,home_result
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,draw
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,win
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,win
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,draw
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,win
5,1876-03-25,Scotland,Wales,4,0,Friendly,Glasgow,Scotland,False,win
6,1877-03-03,England,Scotland,1,3,Friendly,London,England,False,loss
7,1877-03-05,Wales,Scotland,0,2,Friendly,Wrexham,Wales,False,loss
8,1878-03-02,Scotland,England,7,2,Friendly,Glasgow,Scotland,False,win
9,1878-03-23,Scotland,Wales,9,0,Friendly,Glasgow,Scotland,False,win


In [10]:
# Generate summary statistics for numerical columns
cleaned_match_results.describe()

Unnamed: 0,date,home_score,away_score
count,47306,47306.0,47306.0
mean,1993-01-27 03:06:45.005707520,1.760791,1.183359
min,1872-11-30 00:00:00,0.0,0.0
25%,1979-07-25 06:00:00,1.0,0.0
50%,1999-07-01 00:00:00,1.0,1.0
75%,2011-12-02 00:00:00,2.0,2.0
max,2024-06-19 00:00:00,31.0,21.0
std,,1.775957,1.402171


## **Summary Statistics**

The summary statistics of the dataset provide a comprehensive overview of the numerical variables, specifically focusing on the home_score and away_score columns. The dataset contains a total of 47,306 matches, spanning from November 30, 1872, to June 19, 2024.

On average, home teams score approximately 1.76 goals per match, whereas away teams score about 1.18 goals per match. This difference suggests a potential home advantage in scoring. The standard deviation for home and away scores is 1.78 and 1.40, respectively, indicating variability in the number of goals scored in different matches.

The 25th percentile for home teams is 1 goal, while the 25th percentile for away teams is 0 goals, suggesting that in a significant number of matches, home teams are more likely to score at least once while away teams might not score at all.

The median (50th percentile) score for both home and away teams is 1 goal, reflecting a central tendency around this value. The 75th percentile for home teams is 2 goals, and for away teams, it is also 2 goals, showing that higher-scoring games are not uncommon.

The maximum score recorded is 31 goals for home teams and 21 goals for away teams, indicating some matches with extremely high goal counts, which are likely outliers. Overall, these summary statistics suggest that home teams generally have a scoring advantage over away teams, which aligns with the concept of home advantage in sports.

### **Home Advantage Feature Explanation**

In the dataset, we have created a new feature called `home_advantage` to indicate whether a match is played at home or at a neutral venue. We did not take into account the host country of the tournament specifically. Instead, we focused on the home team playing home games. This approach simplifies the analysis by concentrating on the direct influence of playing at home versus in neutral venues in the UEFA Euro tournaments throughout the years.The `home_advantage` feature is encoded as follows:

- **1**: The match is played at home (the home team is playing in their home city).
- **2**: The match is played at a neutral venue (neither team is playing in their home city).

This feature helps in analyzing the influence of the location of the match on the match outcome. By including the `home_advantage` feature in our analysis, we can better understand if playing at home provides a significant advantage to the home team.


In [16]:
# Create a binary feature indicating if the match is played at home (1) or not (0)
cleaned_match_results['home_advantage'] = np.where(cleaned_match_results['neutral'] == True, 2, 1)

# Display the resulting DataFrame
cleaned_match_results.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_match_results['home_advantage'] = np.where(cleaned_match_results['neutral'] == True, 2, 1)


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,home_result,home_advantage
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,draw,1
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,win,1
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,win,1
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,draw,1
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,win,1


In [None]:
# Create the goal_difference column
cleaned_match_results['goal_difference'] = cleaned_match_results['home_score'] - cleaned_match_results['away_score']

# Verify the new column
cleaned_match_results.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_match_results['goal_difference'] = cleaned_match_results['home_score'] - cleaned_match_results['away_score']


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,home_result,home_advantage,goal_difference
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,draw,1,0
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,win,1,2
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,win,1,1
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,draw,1,0
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,win,1,3


### **Machine Learning Models**

Our machine learning model will quantitatively analyze the relationship between home advantage and match outcomes. Specifically, the model aims to:

1. **Quantify the Influence:** Measure the extent to which playing at home affects the likelihood of winning, losing, or drawing a match.

2. **Predict Match Outcomes:** Predict the probability of different match outcomes (win, loss, draw) based on whether a team is playing at home or away.

3. **Identify Key Features:** Identify and quantify the importance of various features (such as home advantage, tournament type, and other contextual factors) that influence match outcomes.

### Goals of the Machine Learning Model

1. **Quantification of Home Advantage:**
    * Use regression models to determine the impact of home advantage on the goal difference (home_score - away_score).
    * Use classification models to predict the likelihood of match outcomes (win, loss, draw) based on whether a team is playing at home or away.

2. **Prediction of Match Outcomes:**
    * Develop a classification model (e.g., Logistic Regression) to predict the match outcome (win, loss, draw) using features such as home/away status, team strength, and other relevant variables.

3. **Feature Importance Analysis:**
    * Use models like Lasso Regression to perform feature selection and identify which factors contribute most significantly to match outcomes.

### Proposed Models

1. **Linear Regression:** To quantify the impact of home advantage on the goal difference.

2. **Logistic Regression:** To predict the likelihood of match outcomes (win, loss, draw) based on home/away status.

3. **Lasso Regression:** To perform feature selection and quantify the importance of various features.

### Expected Outcomes

1. **Regression Analysis:** A regression model will provide coefficients that quantify the impact of home advantage on the goal difference.

2. **Classification Analysis:** A classification model will predict the probabilities of win, loss, and draw for matches, allowing us to see how home advantage influences these probabilities.

3. **Feature Importance:** The Lasso Regression model will highlight the most important features influencing match outcomes, including the significance of playing at home.

By building these models, we aim to provide a comprehensive analysis of how home advantage influences match outcomes in UEFA EURO tournaments and quantify its impact in a statistically rigorous manner.

In [None]:
# Define features and target feature for Linear Regression Model
features= cleaned_match_results[['home_advantage']]
target= cleaned_match_results[['goal_difference']]

# Verify the feature and the target
print('Features:\n', features.head())
print('Target:\n', target)

Features:
    home_advantage
0               1
1               1
2               1
3               1
4               1
Target:
        goal_difference
0                    0
1                    2
2                    1
3                    0
4                    3
...                ...
47301                2
47302                0
47303                0
47304                0
47305                8

[47306 rows x 1 columns]


In [None]:
# Split the data for the Linear Regression model
X_train, X_test, y_train, y_test= train_test_split(features, target, test_size= 0.2, random_state= 42)

# Verify the split
print('Training Set Size:', X_train.shape[0])
print('Testing Set Size:', X_test.shape[0])

Training Set Size: 37844
Testing Set Size: 9462


In [None]:
# Initialize the Linear Regression Model
lin_reg= LinearRegression()

# Train the model
lin_reg.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred= lin_reg.predict(X_test)

In [None]:
# Evaluate the Linear Regression model
mse= mean_squared_error(y_test, y_pred)
r2= r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

# Display the model coefficients
print("Coefficients:", lin_reg.coef_)
print("Intercept:", lin_reg.intercept_)

Mean Squared Error: 5.742265203183715
R-squared: 0.004699908299546474
Coefficients: [[0.35882926]]
Intercept: [0.31523046]


## **Evaluation of the Linear Regression Model**

The Linear Regression model was trained to quantify the impact of home advantage on the goal difference in UEFA EURO tournaments. The performance of the model was evaluated using the Mean Squared Error (MSE) and the R-squared value.

### Model Performance Metrics

**Mean Squared Error (MSE)**: 5.742265203183715

The MSE measures the average squared difference between the actual and predicted values. A lower MSE indicates better model performance. In this case, the MSE value of approximately 5.74 suggests that there is some error in the model's predictions, but without a benchmark, it's challenging to determine if this is a good or bad value.

**R-squared (R²)**: 0.004699908299546474

The R-squared value indicates the proportion of the variance in the dependent variable (goal difference) that is predictable from the independent variable (home advantage). An R-squared value close to 1 indicates that a large proportion of the variance is explained by the model, while a value close to 0 indicates that the model does not explain much of the variance. In this case, the R-squared value is approximately 0.0047, suggesting that home advantage explains only a very small portion of the variance in the goal difference.

**Model Coefficient:** 0.35882926

The coefficient for the home_advantage feature is approximately 0.36. This means that playing at home is associated with an increase of about 0.36 goals in the goal difference, on average. While this indicates a positive impact of home advantage on the goal difference, the effect size is relatively small.

**Intercept:** 0.31523046

The intercept value is approximately 0.32. This represents the predicted goal difference when the home advantage is zero (i.e., when the match is played at a neutral venue). It indicates that even without home advantage, teams are predicted to have a slight positive goal difference.

The Linear Regression model indicates that home advantage has a positive impact on the goal difference in UEFA EURO tournaments, with a coefficient of approximately 0.36. However, the low R-squared value suggests that home advantage alone does not explain much of the variability in the goal difference. This implies that other factors also play significant roles in determining the outcome of matches.

In [None]:
# Prepare the data for Logistic Regression. Encode the target variable
cleaned_match_results['home_result_encoded']= cleaned_match_results['home_result'].map({'win': 1, 'draw': 0, 'loss': -1})

# Select features and target for Logistic Regresion. We'll start with 'home_advantage' and add more features later if needed
features_classification= cleaned_match_results[['home_advantage']]
target_classification= cleaned_match_results[['home_result_encoded']]

# Verify the features and target
print("Features:\n", features_classification.head())
print("Target:\n", target_classification.head())

Features:
    home_advantage
0               1
1               1
2               1
3               1
4               1
Target:
    home_result_encoded
0                    0
1                    1
2                    1
3                    0
4                    1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_match_results['home_result_encoded']= cleaned_match_results['home_result'].map({'win': 1, 'draw': 0, 'loss': -1})


In [None]:
# Split the data for the Logistic Regression model
X_train_class, X_test_class, y_train_class, y_test_class= train_test_split(features_classification, target_classification, test_size= 0.2, random_state= 42)

# Verify the split
print("Training Set Size:", X_train_class.shape[0])
print("Testing Set Size:", X_test_class.shape[0])

Training Set Size: 37844
Testing Set Size: 9462


In [None]:
# Initialize the Logistic Regression model
log_reg= LogisticRegression()

# Train and fit the model
log_reg.fit(X_train_class, y_train_class)

  y = column_or_1d(y, warn=True)


In [None]:
# Make predictions on the test set
y_pred_class= log_reg.predict(X_test_class)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test_class, y_pred_class)
conf_matrix = confusion_matrix(y_test_class, y_pred_class)
class_report = classification_report(y_test_class, y_pred_class)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Accuracy: 0.49323610230395265
Confusion Matrix:
 [[   0    0 2698]
 [   0    0 2097]
 [   0    0 4667]]
Classification Report:
               precision    recall  f1-score   support

          -1       0.00      0.00      0.00      2698
           0       0.00      0.00      0.00      2097
           1       0.49      1.00      0.66      4667

    accuracy                           0.49      9462
   macro avg       0.16      0.33      0.22      9462
weighted avg       0.24      0.49      0.33      9462



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Logistic Regression Model

The Logistic Regression model was trained to predict match outcomes (win, loss, draw) based on home advantage. The performance of the model was evaluated using accuracy, confusion matrix, and classification report. The model achieved an accuracy of approximately 49.32%, indicating that it correctly predicts the match outcome in about half of the cases.

The confusion matrix reveals that the model failed to correctly predict any loss or draw outcomes, as it predicted all matches as win. This is evident from the counts in the confusion matrix, where there are 0 true positive, true negative, or false positive predictions for loss and draw classes. All predictions fell into the win category.

The classification report provides further insights into the model's performance. The precision for the win class is 0.49, meaning that 49% of the matches predicted as wins are actually wins. However, the precision for loss and draw is 0.00, indicating that the model did not predict any matches as losses or draws. The recall for the win class is 1.00, meaning that the model correctly identified all actual wins. Conversely, the recall for loss and draw is 0.00, indicating that the model failed to identify any losses or draws. The F1-score, which is the harmonic mean of precision and recall, is 0.66 for the win class but 0.00 for loss and draw.

The model's performance indicates a strong bias towards predicting wins and failing to capture losses and draws, highlighting the need for additional features and possibly more complex models to improve prediction accuracy across all match outcomes.

The Logistic Regression model did not perform as expected and failed to predict outcomes accurately using the available features. This indicates that the features used (home advantage and match scores) are either insufficient or not appropriately capturing the nuances needed to predict match outcomes.

In [None]:
# Prepare the data for the Lasso Regression Model. Select relevant features
features_lasso= cleaned_match_results[['home_advantage', 'home_score', 'away_score']]

# Standardize the features
scaler= StandardScaler()
features_lasso_scaled= scaler.fit_transform(features_lasso)

# Define the target variable for Lasso Regression
target_lasso= target_classification

# Verify the new features
print("New Features:\n", features_lasso.head())
print("Target:\n", target_lasso.head())

New Features:
    home_advantage  home_score  away_score
0               1           0           0
1               1           4           2
2               1           2           1
3               1           2           2
4               1           3           0
Target:
    home_result_encoded
0                    0
1                    1
2                    1
3                    0
4                    1


In [None]:
# Split the data
X_train_lasso, X_test_lasso, y_train_lasso, y_test_lasso = train_test_split(features_lasso_scaled, target_lasso, test_size=0.2, random_state=42)

# Verify the split
print("Training Set Size:", X_train_lasso.shape[0])
print("Testing Set Size:", X_test_lasso.shape[0])

Training Set Size: 37844
Testing Set Size: 9462


In [None]:
# Initialize the Lasso Regression model
lasso_reg= Lasso(alpha= 0.1)

# Train and fit the model
lasso_reg.fit(X_train_lasso, y_train_lasso)

In [None]:
# Make predictions on the test set
y_pred_lasso= lasso_reg.predict(X_test_lasso)

In [None]:
# Evaluate the model
mse = mean_squared_error(y_test_lasso, y_pred_lasso)
r2 = r2_score(y_test_lasso, y_pred_lasso)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

# Display the model coefficients
print("Coefficients:", lasso_reg.coef_)
print("Intercept:", lasso_reg.intercept_)

Mean Squared Error: 0.3029158389491835
R-squared: 0.5879104854577007
Coefficients: [ 0.          0.33682368 -0.3672747 ]
Intercept: [0.20771025]


### **Conclusion**

**Best Model**

Based on the evaluation metrics, the Lasso Regression model emerged as the best-performing model for this analysis. The model achieved an R-squared value of approximately 0.59, indicating that about 59% of the variability in the goal difference can be explained by the model's predictors. This level of explanatory power highlights the importance of home and away scores as significant predictors in understanding goal differences. However, the remaining 41% of unexplained variability suggests the presence of other influential factors not captured by the model.

**Answering the Target Question**

The primary objective of this analysis was to determine whether home advantage significantly influences match outcomes in UEFA EURO tournaments. The Lasso Regression model's R-squared value of 0.59 indicates a moderate level of explanatory power. However, the R-squared value also implies that home advantage, while being a factor, is not a predominant determinant of match outcomes.

The analysis suggests that home advantage alone does not significantly influence match outcomes in UEFA EURO tournaments. This finding is further supported by the model's accuracy, which confirms that other factors such as team performance, strategy, and contextual variables play a more crucial role in determining match outcomes. The moderate explanatory power of the model indicates that factors beyond home and away scores are likely impacting the goal differences observed in matches.

This insight can be invaluable for coaches, players, and team strategists. It suggests that focusing solely on the location of the match (home or neutral) is not sufficient to predict or influence match outcomes significantly. Instead, more emphasis should be placed on improving overall team performance, developing effective strategies, and considering contextual variables such as player conditions, weather, and tactical decisions. By broadening the focus beyond home advantage, teams can better prepare and position themselves for success in future UEFA EURO tournaments.

## Run this to save the notebook to DagsHub 👇

In [18]:
# Install the DagsHub python client
!pip install -q dagshub


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.7/233.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m238.4/238.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m6.0 M

In [19]:
from dagshub.notebook import save_notebook

save_notebook(repo="carlosrod723/TunisiaLocalChapter_UEFAEURO2024", path=".")

Output()



Open the following link in your browser to authorize the client:
https://dagshub.com/login/oauth/authorize?state=80532f26-9519-4250-8c4c-94b401258fad&client_id=32b60ba385aa7cecf24046d8195a71c07dd345d9657977863b52e7748e0f0f28&middleman_request_id=36e5e82f92d00b8738c47822d44f38590327801d96867b633ce2ffb7f0053251


