# **UEFA EURO DATA**

---

**NOTEBOOK 2: UEFA EURO DATA ANALYSIS**

---

**AUTHOR**

---

- Elmander
- Edifon Emmanuel Jimmy



**TABLE OF CONTENT**

---

1.   STEP 1: MACHINE LEARNING.

     STEP 1 - PHASE 1: Train Model.

     STEP 1 - PHASE 2: Manage Oversampling.

     STEP 1 - PHASE 3: Peform GridSearch.

     STEP 1 - PHASE 4: Peform Cross-Validation.

     STEP 1 - PHASE 5: Check Data Description.

     STEP 1 - PHASE 6: Test Model Accuracy.

     STEP 1 - PHASE 7: Forecast Future results.

2.   STEP 2: MODEL OUTPUT VISUALIZATION.
     
     STEP 2 - PHASE 1: Create Data Time Frame.

     STEP 2 - PHASE 2: Load Teams Within Data Time Frame.

     STEP 2 - PHASE 3: Drop Teams With Inconsistent Data.

     STEP 2 - PHASE 4: Create Widgets to Display Data.

     STEP 2 - PHASE 5: Display Team Data With Widgets.

**PROBLEM STATEMENT**

---

- Feature Engineering of UEFA Euro Data for Machine Learning.
- Machine Learning of UEFA Euro Data for Determining if Home Team Has Advantage.

**SOLUTION**

---

- We are going to use the following schematic to guide us on our work.

---

![Workflow.png](attachment:Workflow.png)

---

*   List item
*   List item



**REFERENCES**

---

For more information about the datasets used in this notebook, read the full documentation through the following links;

**DOCUMENT 1**: [GITHUB DATA](https://github.com/martj42/international_results)

**DOCUMENT 2**: [KAGGLE DATA](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017?select=results.csv)

**TO DO LIST**

---

1. Train Model. ✅
2. Manage Oversampling. ✅
3. Peform GridSearch. ✅
4. Peform Cross-Validation. ✅
5. Check Data Description. ✅
6. Test Model Accuracy. ✅
7. Forecast Future results. ✅
8. Create Data Time Frame. ✅
9. Load Teams Within Data Time Frame. ✅
10. Drop Teams With Inconsistent Data. ✅
11. Create Widgets to Display Data. ✅
12. Display Team Data With Widgets. ✅

---

In [1]:
# Import necessary libraries

import requests
import numpy as np
import pandas as pd
import joblib, os, warnings
from bs4 import BeautifulSoup
import plotly.express as px
import ipywidgets as widgets
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from IPython.display import display
from datetime import datetime, timedelta
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score
from ipywidgets import GridBox, Layout, HTML
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score

In [2]:
def GetData(url, data_name:str="clean-data.csv"):
    """
    Retrieves a CSV file from a specified URL based on the provided data_name,
    downloads it locally, and returns its contents as a pandas DataFrame.
    """
    def DownloadFile(url, file_name):
        """
        Downloads a file from a given URL and saves it locally with the specified file_name.
        Returns True if successful, False otherwise.
        """
        response = requests.get(url)
        if response.status_code == 200:
            with open(file_name, "wb") as file:
                file.write(response.content)
            print(f"{file_name} downloaded successfully!")
            return True
        else:
            print(f"Failed to download {file_name}. Status code: {response.status_code}")
            return False

    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    for i in soup.find_all("div", class_="name"):
        a_tag = i.find("a")
        if a_tag is not None:
            if a_tag.text == data_name:
                raw_file_url = "https://dagshub.com/Omdena/TunisiaLocalChapter_UEFAEURO2024/raw/b923aa044d0f9b8abf024c402fdee1217eeeb19b/task2-data-analysis/Cleaned_Datsets/{}".format(data_name)
                if DownloadFile(raw_file_url, data_name):
                    return pd.read_csv(data_name)

In [3]:
# Load the preprocessed data
url = "https://dagshub.com/Omdena/TunisiaLocalChapter_UEFAEURO2024/src/main/task2-data-analysis/Cleaned_Datsets"
data = GetData(url)

# Drop unnecessary columns
data = data.drop(['Tournament'], axis=1)

# Label Encode categorical variables
label_encoder = LabelEncoder()
data['HomeTeamLabel'] = label_encoder.fit_transform(data['HomeTeam'])
print(label_encoder.classes_)
data['AwayTeamLabel'] = label_encoder.fit_transform(data['AwayTeam'])
print(label_encoder.classes_)
data['CityVenue'] = label_encoder.fit_transform(data['CityVenue'])
print(label_encoder.classes_)
data['CountryVenue'] = label_encoder.fit_transform(data['CountryVenue'])
print(label_encoder.classes_)
data['VenueNeutrality'] = label_encoder.fit_transform(data['VenueNeutrality'])
print(label_encoder.classes_)

output_path = 'Data/model-data/model-data.csv'
os.makedirs(os.path.dirname(output_path), exist_ok=True)
data.to_csv(output_path, index=False)
data.head(5)

clean-data.csv downloaded successfully!
['Albania' 'Austria' 'Belgium' 'Bulgaria' 'Croatia' 'Czech Republic'
 'Czechoslovakia' 'Denmark' 'England' 'Finland' 'France' 'Germany'
 'Greece' 'Hungary' 'Iceland' 'Italy' 'Latvia' 'Netherlands'
 'Northern Ireland' 'Norway' 'Poland' 'Portugal' 'Republic of Ireland'
 'Romania' 'Russia' 'Scotland' 'Serbia' 'Slovakia' 'Slovenia' 'Spain'
 'Sweden' 'Switzerland' 'Turkey' 'Ukraine' 'Wales' 'Yugoslavia']
['Albania' 'Austria' 'Belgium' 'Bulgaria' 'Croatia' 'Czech Republic'
 'Czechoslovakia' 'Denmark' 'England' 'Finland' 'France' 'Georgia'
 'Germany' 'Greece' 'Hungary' 'Iceland' 'Italy' 'Latvia' 'Netherlands'
 'North Macedonia' 'Northern Ireland' 'Norway' 'Poland' 'Portugal'
 'Republic of Ireland' 'Romania' 'Russia' 'Scotland' 'Serbia' 'Slovakia'
 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'Turkey' 'Ukraine' 'Wales'
 'Yugoslavia']
['Amsterdam' 'Antwerp' 'Arnhem' 'Aveiro' 'Baku' 'Barcelona' 'Basel'
 'Belgrade' 'Berlin' 'Berne' 'Birmingham' 'Bordeaux' 'Bra

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,CityVenue,CountryVenue,VenueNeutrality,HomeTeamLabel,AwayTeamLabel
0,2024-06-19,Croatia,Albania,2.0,2.0,1,36,6,1,4,0
1,2024-06-19,Germany,Hungary,2.0,0.0,2,77,6,0,11,14
2,2024-06-19,Scotland,Switzerland,1.0,1.0,1,20,6,1,25,33
3,2024-06-18,Portugal,Czech Republic,2.0,1.0,2,43,6,1,21,5
4,2024-06-18,Turkey,Georgia,3.0,1.0,2,23,6,1,32,11


In [4]:
# Remove warnings
warnings.filterwarnings('ignore')

# Read machine learning data
data = pd.read_csv('Data/model-data/model-data.csv')

# Prepare the features and target variable
X = data[['HomeTeamLabel', 'AwayTeamLabel', 'FTHG', 'FTAG', 'CityVenue', 'CountryVenue', 'VenueNeutrality']]
y = data['FTR']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply over-sampling using SMOTE to handle class imbalance
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Define the parameters for grid search
param_grid = {'C': [0.1, 1, 10]}

# Perform grid search using GridSearchCV
model = LogisticRegression()
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Calculate and print the accuracy using cross-validation
accuracy = cross_val_score(grid_search.best_estimator_, X_test, y_test, cv=5).mean() * 100
print("Accuracy: {:.2f}%".format(accuracy))

# Create Models path
output_path = 'Model/model.pkl'
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Save the trained model
joblib.dump(grid_search.best_estimator_, 'Model/model.pkl')

Accuracy: 86.81%


['Model/model.pkl']

In [5]:
# Load the trained model
model = joblib.load('Model/model.pkl')

# Load the future data for prediction
future_data1 = pd.read_csv('Data/model-data/model-data.csv')
future_data2 = pd.read_csv('clean-data.csv')

# Get the required columns for prediction
prediction_features = ['HomeTeamLabel', 'AwayTeamLabel', 'FTHG', 'FTAG', 'CityVenue', 'CountryVenue', 'VenueNeutrality']
future_data = future_data1[prediction_features]

# Make predictions on the future data
predictions = model.predict(future_data)

# Add the 'winner' column to the future_data DataFrame
future_data['Winner'] = predictions

# Get the names of the home and away teams from the original future_data DataFrame
team_names = future_data1.merge(
    pd.DataFrame(future_data.index),
    left_index=True,
    right_index=True
)[['Date', 'HomeTeam', 'AwayTeam']]

# Replace values in the 'Winner' column based on the specified conditions
future_data.loc[future_data['Winner'] == 0, 'Winner'] = team_names['AwayTeam']
future_data.loc[future_data['Winner'] == 2, 'Winner'] = team_names['HomeTeam']
future_data.loc[future_data['Winner'] == 1, 'Winner'] = 'Draw'

# # Concatenate the team names with the future_data DataFrame
output = pd.concat([team_names, future_data], axis=1)

# Rearrange the columns
column_order = ['Date', 'HomeTeamLabel', 'HomeTeam', 'FTHG', 'FTAG', 'AwayTeam', 'AwayTeamLabel', 'Winner']
output = output[column_order]

# Print the predicted winners along with the other columns
output.head(5)

Unnamed: 0,Date,HomeTeamLabel,HomeTeam,FTHG,FTAG,AwayTeam,AwayTeamLabel,Winner
0,2024-06-19,4,Croatia,2.0,2.0,Albania,0,Draw
1,2024-06-19,11,Germany,2.0,0.0,Hungary,14,Germany
2,2024-06-19,25,Scotland,1.0,1.0,Switzerland,33,Draw
3,2024-06-18,21,Portugal,2.0,1.0,Czech Republic,5,Portugal
4,2024-06-18,32,Turkey,3.0,1.0,Georgia,11,Turkey


In [6]:
# Count the occurrences of each team in the 'Winner' column, excluding 'Draw' values
winner_counts = future_data.loc[future_data['Winner'] != 'Draw', 'Winner'].value_counts()

# Get all unique team names from the 'Winner' column
all_teams = pd.unique(future_data['Winner'])

# Create a DataFrame with the team names and their counts
team_counts = pd.DataFrame({'Team': winner_counts.index, 'Wins': winner_counts.values})

# Sort the DataFrame in descending order based on the number of wins
team_counts = team_counts.sort_values('Wins', ascending=False)

# Reset the index of the DataFrame and set it to start counting from 1
team_counts.index = range(1, len(team_counts) + 1)

# Calculate the total number of appearances for each team
total_appearances = output['HomeTeam'].value_counts() + output['AwayTeam'].value_counts()

# Calculate the Win Rate (%) for each team
team_counts['Appearances'] = total_appearances[team_counts['Team']].values
team_counts['Win Rate (%)'] = round((team_counts['Wins'] / team_counts['Appearances']) * 100, 2)

# Calculate the weighted average Win Rate (%)
max_appearances = team_counts['Appearances'].max()
team_counts['Weighted Win Rate (%)'] = round(((team_counts['Wins'] / team_counts['Appearances']) * (team_counts['Appearances'] / max_appearances)) * 100, 2)

# Sort the DataFrame in descending order based on the Weighted Win Rate (%)
team_counts = team_counts.sort_values('Weighted Win Rate (%)', ascending=False)

# Reset the index of the DataFrame and set it to start counting from 1
team_counts.index = range(1, len(team_counts) + 1)

team_counts.head(5)

Unnamed: 0,Team,Wins,Appearances,Win Rate (%),Weighted Win Rate (%)
1,Germany,29,55.0,52.73,52.73
2,France,22,44.0,50.0,40.0
3,Spain,22,47.0,46.81,40.0
4,Italy,22,46.0,47.83,40.0
5,Netherlands,21,40.0,52.5,38.18


In [7]:
# Create a new instance of the AdaBoostRegressor class
model = AdaBoostRegressor()

# Set the start Date and end Date for the forecast
years = 30
start_date = datetime.now().replace(day=1, month=1) + timedelta(days=365)
end_date = datetime.strptime(output['Date'].max(), '%Y-%m-%d') + timedelta(days=365 * years)

# Create a date range between the start Date and end Date with a frequency of 1 month
date_range = pd.date_range(start=start_date, end=end_date, freq='MS')

# Create an empty DataFrame to store the forecasted values
forecast = pd.DataFrame(index=date_range)

# Iterate over each team in the team_counts DataFrame
for team in team_counts['Team']:
    # Get the historical data for the current team
    historical_data = output.loc[(output['HomeTeam'] == team) | (output['AwayTeam'] == team)]

    # Convert the 'Date' column to a datetime object
    historical_data['Date'] = pd.to_datetime(historical_data['Date'])

    # Set the 'Date' column as the index of the DataFrame
    historical_data = historical_data.set_index('Date')

    # Resample the historical data to a monthly frequency and count the number of wins for each month
    historical_data = historical_data.resample('MS')['Winner'].apply(lambda x: (x == team).sum())

    # Check if there are at least two values in the historical data
    if len(historical_data) > 1:
        # Create a DataFrame with the historical data and a column of ones
        X = pd.DataFrame({'ones': 1, 'x': range(len(historical_data))})
        y = historical_data.values

        # Fit a random forest regressor to the historical data
        model = AdaBoostRegressor()
        model.fit(X, y)

        # Create a DataFrame with the Date range and a column of ones
        X_new = pd.DataFrame({'ones': 1, 'x': range(len(date_range))})

        # Forecast the number of wins for each month in the Date range
        forecast[team] = model.predict(X_new)
    else:
        # Set all forecasted values to zero if there are not enough observations
        forecast[team] = 0

In [8]:
try:
    import google.colab
    from google.colab import output
    output.enable_custom_widget_manager()
except ImportError:
    print("Running code locally")


# Create a list of years from 2024 to 2050
years = list(range(2024, 2051))

# Set the start year to the first year in the list of years
start_year = years[0]
start_date = pd.to_datetime(f'{start_year}-01-01')

# Create a slider widget for the year input
year_slider = widgets.SelectionSlider(
    options=years,
    value=years[0],
    description='Year:',
    continuous_update=False
)

# Create a line chart for the forecast data
forecast_fig = go.FigureWidget()

# Create a bubble map for the team wins data
team_wins_fig = go.FigureWidget()

# Create an HTML widget for the top 5 countries
top_countries_html = widgets.HTML()

# Create a sample forecast DataFrame for demonstration purposes
dates = pd.date_range(start=start_date, periods=365*(years[-1] - start_year + 1))
teams = ['Team A', 'Team B', 'Team C', 'Team D']
forecast_data = {team: pd.Series(range(len(dates)), index=dates) for team in teams}
forecast = pd.DataFrame(forecast_data)

# Create a function to update the forecast chart, bubble map, and top countries table based on the selected year
def update_dashboard(year):
    # Calculate the end date for the forecast based on the selected year
    end_date = start_date + timedelta(days=365 * (year - start_year))

    # Filter the forecast DataFrame based on the end date
    forecast_filtered = forecast.loc[forecast.index <= end_date]

    # Sum the forecasted wins for each team
    total_wins = forecast_filtered.sum()

    # Create a DataFrame with the team names and their forecasted wins
    team_wins = pd.DataFrame({'Team': total_wins.index, 'Wins': total_wins.values})

    # Sort the DataFrame in descending order based on the number of wins
    team_wins = team_wins.sort_values('Wins', ascending=False)

    # Reset the index of the DataFrame and set it to start counting from 1
    team_wins.reset_index(drop=True, inplace=True)
    team_wins.index += 1

    # Update the line chart with the filtered forecast data
    forecast_fig.data = []
    excluded_teams = ['Wales', 'Czech Republic', 'iceland']
    for col in forecast_filtered.columns:
        if col not in excluded_teams:
            forecast_fig.add_scatter(x=forecast_filtered.index, y=forecast_filtered[col], name=col)
    forecast_fig.update_layout(title='Forecasted Wins for Each Team')

    # Update the bubble map with the updated team_wins DataFrame
    team_wins_fig.data = []
    team_wins_fig.add_scattergeo(locations=team_wins.loc[~team_wins['Team'].isin(excluded_teams), 'Team'], locationmode='country names',
                                 marker=dict(size=team_wins.loc[~team_wins['Team'].isin(excluded_teams), 'Wins'], sizemode='diameter', color=team_wins.loc[~team_wins['Team'].isin(excluded_teams), 'Wins'], colorscale='Viridis', showscale=True),
                                 text=team_wins.loc[~team_wins['Team'].isin(excluded_teams), 'Team'] + ': ' + team_wins.loc[~team_wins['Team'].isin(excluded_teams), 'Wins'].astype(str) + ' Wins', hoverinfo='text')

    # Customize the layout
    team_wins_fig.update_layout(
        title='European Teams and Their Forecasted Wins',
        geo=dict(showframe=False, showcoastlines=False, projection_type='orthographic',
                 showcountries=True, showland=True, landcolor='rgb(243, 243, 243)',
                 showocean=True, oceancolor='rgb(10, 200, 255)'),
                 height=600, width=800
    )

    # Update the top countries HTML widget with the top 5 countries excluding excluded teams
    top_countries_html.value = '<h3>Top 5 Countries</h3><table style="width:100%;margin:auto;text-align:center"><tr style="background-color:darkblue;color:white"><th style="text-align:center">Rank</th><th style="text-align:center">Team</th><th style="text-align:center">Wins</th></tr>' + ''.join(['<tr><td style="text-align:center">{}</td><td style="text-align:center">{}</td><td style="text-align:center">{}</td></tr>'.format(rank,row['Team'],round(row['Wins'])) for rank,row in team_wins.loc[~team_wins['Team'].isin(excluded_teams)].head(5).iterrows()]) + '</table>'

# Create an interactive output widget to display the dashboard
dashboard = widgets.interactive_output(update_dashboard, {'year': year_slider})

# Display the year slider and dashboard
display(year_slider, dashboard)

# Display everything in a GridBox container with reduced margin between widgets and vertical layout
display(GridBox([top_countries_html,
                 team_wins_fig,
                 forecast_fig],
                layout=Layout(grid_template_columns='repeat(1, minmax(250px, 1fr))')))

SelectionSlider(continuous_update=False, description='Year:', options=(2024, 2025, 2026, 2027, 2028, 2029, 203…

Output()

GridBox(children=(HTML(value='<h3>Top 5 Countries</h3><table style="width:100%;margin:auto;text-align:center">…