## Hi There!

This is just a personal project that I worked on. Was keen on doing some data analysis using python and them attempting to predict the Euro 2024 results.
My Goals:
1. Predict at least 50% of the group stage winners
2. Have the predicted winner make it to atleast the QF

Issues I faced:
Wikipedia uses a syntax to identify the Home nation and as a result it impacted the points in the stage. Additionally, it is Georgia and Serbia's first time in the Euros, as a result their points were impacted due to no historical data.

### Extracting tables from the upcoming Euros

In [2]:
#Libraries
import pandas as pd
from string import ascii_uppercase as alphabet
import pickle
from bs4 import BeautifulSoup
import requests
from scipy.stats import poisson

#Import the wikipedia page:
all_tables = pd.read_html('https://en.wikipedia.org/wiki/UEFA_Euro_2024')

#Understanding the correct tables:
all_tables[18]
all_tables[25]
all_tables[32]
all_tables[39]
all_tables[46]
all_tables[53]



Unnamed: 0,Pos,Teamvte,Pld,W,D,L,GF,GA,GD,Pts,Qualification
0,1,Turkey,0,0,0,0,0,0,0,0,Advance to knockout stage
1,2,Georgia,0,0,0,0,0,0,0,0,Advance to knockout stage
2,3,Portugal,0,0,0,0,0,0,0,0,Possible knockout stage based on ranking
3,4,Czech Republic,0,0,0,0,0,0,0,0,


In [3]:
#Create dictionary + assign the group letters to the groups
dict_table = {}
for letter, i in zip(alphabet, range(18,60, 7)):
    df = all_tables[i]
    #Rename from "Teamvte" to Team
    df.rename(columns={df.columns[1]: 'Team'}, inplace=True)
    #Remove the qual column from the group
    df.pop("Qualification")
    dict_table[f'Group {letter}'] = df

In [4]:
dict_table['Group F']

Unnamed: 0,Pos,Team,Pld,W,D,L,GF,GA,GD,Pts
0,1,Turkey,0,0,0,0,0,0,0,0
1,2,Georgia,0,0,0,0,0,0,0,0
2,3,Portugal,0,0,0,0,0,0,0,0
3,4,Czech Republic,0,0,0,0,0,0,0,0


In [5]:
#Use pickle to export the dictionary
with open('dict_table', 'wb') as output:
    pickle.dump(dict_table, output)

### Extracting Football Matches

In [6]:
years = [1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020]

def get_matches(year):
    web = f'https://en.wikipedia.org/wiki/UEFA_Euro_{year}'
    response = requests.get(web)
    content = response.text
    soup = BeautifulSoup(content, 'lxml')

    matches = soup.find_all('div', class_= 'footballbox')

    home = []
    score = []
    away = []


    for match in matches:
        home.append(match.find('th', class_='fhome').get_text())
        score.append(match.find('th', class_='fscore').get_text())
        away.append(match.find('th', class_='faway').get_text())

    dict_football = {'home': home, 'score':score, 'away': away}
    df_football = pd.DataFrame(dict_football)
    df_football['year'] = year
    return df_football

In [32]:
euro = [get_matches(year) for year in years]
df_euro = pd.concat(euro, ignore_index=True)
df_euro.to_csv('euro_historical_data.csv', index=False)

In [10]:
#Fixtures for 2024
df_fixture = get_matches(2024)
df_fixture.to_csv('euro_2024_fixtures.csv', index=False)

### Data Cleaning and Transformation

In [7]:
df_historical_data = pd.read_csv('euro_historical_data.csv')
df_fixture = pd.read_csv('euro_2024_fixtures.csv')

In [8]:
#Cleaning
df_fixture['home'] = df_fixture['home'].str.strip()
df_fixture['away'] = df_fixture['away'].str.strip()

df_historical_data['home'] = df_historical_data['home'].str.strip()
df_historical_data['away'] = df_historical_data['away'].str.strip()

#getting rid of the (a.e.t)
df_historical_data['score'] = df_historical_data['score'].str.replace('[^\d–]', '', regex=True)

In [9]:
#Cleaning the scores and giving them to home and away
df_historical_data[['HomeGoal', 'AwayGoal']] = df_historical_data['score'].str.split('–', expand=True)

In [10]:
df_historical_data.drop('score', axis=1, inplace=True)

In [11]:
df_historical_data.drop('HomeGoal', axis=1, inplace=True )
df_historical_data.drop('AwayGoal', axis=1, inplace=True )

In [12]:
#Renaming Columns + Converting goals from object to int
df_historical_data.rename(columns={'home': 'Home Team', 'away': 'Away Team', 'year': 'Year'}, inplace=True)
df_historical_data = df_historical_data.astype({'Home Goal': int, 'Away Goal': int, 'Year': int})

KeyError: "Only a column name can be used for the key in a dtype mappings argument. 'Home Goal' not found in columns."

In [13]:
#Creating new column for total goals
df_historical_data['Total Goals'] =df_historical_data['Home Goal'] + df_historical_data['Away Goal']

KeyError: 'Home Goal'

### Exporting the Cleaned DF

In [14]:
df_historical_data.to_csv('cleaned_euro_historical_data.csv', index=False)

In [15]:
df_fixture.to_csv('cleaned_euro_2024_fixtures.csv', index=False)

### Time to build the model

In [36]:
dict_table = pickle.load(open('dict_table', 'rb'))
df_historical_data = pd.read_csv('cleaned_euro_historical_data_modified.csv')
df_fixture = pd.read_csv('cleaned_euro_2024_fixtures.csv')

Calculating Team Strength

In [37]:
#split df into df_home and df_away
df_home = df_historical_data[['Home Team', 'Home Goal', 'Away Goal']]
df_away = df_historical_data[['Away Team', 'Home Goal', 'Away Goal']]

In [38]:
df_home = df_home.rename(columns={'Home Team': 'Team','Home Goal': 'Goals Scored', 'Away Goal': 'Goals Conceded' })
df_away = df_away.rename(columns={'Away Team': 'Team','Home Goal': 'Goals Conceded', 'Away Goal': 'Goals Scored' })

In [39]:
#Building the team strength
df_team_strength = pd.concat([df_home, df_away ], ignore_index=True).groupby('Team').mean()
df_team_strength


Unnamed: 0_level_0,Goals Scored,Goals Conceded
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Albania,0.333333,1.0
Austria,0.7,1.2
Belgium,1.409091,1.272727
Bulgaria,0.666667,2.166667
CIS,0.333333,1.333333
Croatia,1.363636,1.272727
Czech Republic,1.241379,1.275862
Czechoslovakia,1.5,1.25
Denmark,1.272727,1.515152
England,1.342105,0.973684


Prediction

Going to be using poisson distribution 

In [40]:
def predict_points(home, away):
    if home in df_team_strength.index and away in df_team_strength.index:
        #goals scored * goals conceced
        lamb_home = df_team_strength.at[home, 'Goals Scored'] * df_team_strength.at[away, 'Goals Conceded']
        lamb_away = df_team_strength.at[away, 'Goals Scored'] * df_team_strength.at[home, 'Goals Conceded']
        prob_home, prob_away, prob_draw = 0, 0, 0
        for x in range(0,11):
            for y in range(0,11):
                p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)
                if x == y:
                    prob_draw += p
                elif x  > y:
                    prob_home += p
                else:
                    prob_away += p
        points_home = 3 * prob_home + prob_draw
        points_away = 3* prob_away + prob_draw
        return (points_home, points_away)
    else:
        return (0,0)

Predictions

In [73]:
df_fixture_group36 = df_fixture[:36].copy()
df_fixture_KO = df_fixture[36:44].copy()
df_fixture_quarter = df_fixture[44:48].copy()
df_fixture_semi = df_fixture[48:50].copy()
df_fixture_final = df_fixture[50:].copy()



In [42]:
for group in dict_table:
    print(f"Columns in {group}: {dict_table[group].columns}")

Columns in Group A: Index(['Pos', 'Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts'], dtype='object')
Columns in Group B: Index(['Pos', 'Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts'], dtype='object')
Columns in Group C: Index(['Pos', 'Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts'], dtype='object')
Columns in Group D: Index(['Pos', 'Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts'], dtype='object')
Columns in Group E: Index(['Pos', 'Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts'], dtype='object')
Columns in Group F: Index(['Pos', 'Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts'], dtype='object')


In [43]:
#running all the games in the group stage


for group in dict_table:
    teams_in_group = dict_table[group]['Team'].values
    df_fixture_group_6 = df_fixture_group36[df_fixture_group36['home'].isin(teams_in_group)]
    for index, row in df_fixture_group_6.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home, away)
        dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
        dict_table[group].loc[dict_table[group]['Team'] == away, 'Pts'] += points_away

    dict_table[group] = dict_table[group].sort_values('Pts', ascending=False).reset_index()
    dict_table[group] = dict_table[group][['Team', 'Pts']]
    dict_table[group] = dict_table[group].round(0)

#1:54:17


  dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
  dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
  dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
  dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
  dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
  dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home


In [75]:
from tabulate import tabulate

# Iterate over each group in the dict_table
for group, df in dict_table.items():
    print(f"Group {group}:")
    print(tabulate(df, headers='keys', tablefmt='psql'))  # Use 'psql' format for a nice table
    print()  # Add a newline for better readability




Group Group A:
+----+-------------+-------+
|    | Team        |   Pts |
|----+-------------+-------|
|  0 | Hungary     |     3 |
|  1 | Switzerland |     3 |
|  2 | Scotland    |     2 |
|  3 | Germany (H) |     0 |
+----+-------------+-------+

Group Group B:
+----+---------+-------+
|    | Team    |   Pts |
|----+---------+-------|
|  0 | Spain   |     5 |
|  1 | Italy   |     5 |
|  2 | Croatia |     4 |
|  3 | Albania |     2 |
+----+---------+-------+

Group Group C:
+----+----------+-------+
|    | Team     |   Pts |
|----+----------+-------|
|  0 | England  |     4 |
|  1 | Denmark  |     2 |
|  2 | Slovenia |     2 |
|  3 | Serbia   |     0 |
+----+----------+-------+

Group Group D:
+----+-------------+-------+
|    | Team        |   Pts |
|----+-------------+-------|
|  0 | Netherlands |     6 |
|  1 | France      |     5 |
|  2 | Poland      |     3 |
|  3 | Austria     |     2 |
+----+-------------+-------+

Group Group E:
+----+----------+-------+
|    | Team     |   Pts

Unnamed: 0,Team,Pts
0,Hungary,3.0
1,Switzerland,3.0
2,Scotland,2.0
3,Germany (H),0.0


In [74]:
df_fixture_KO

Unnamed: 0,home,score,away,year
36,Runner-up Group A,Match 38,Runner-up Group B,2024
37,Winner Group A,Match 37,Runner-up Group C,2024
38,Winner Group C,Match 40,3rd Group D/E/F,2024
39,Winner Group B,Match 39,3rd Group A/D/E/F,2024
40,Runner-up Group D,Match 42,Runner-up Group E,2024
41,Winner Group F,Match 41,3rd Group A/B/C,2024
42,Winner Group E,Match 43,3rd Group A/B/C/D,2024
43,Winner Group D,Match 44,Runner-up Group F,2024


In [80]:
for group in dict_table:
    group_winner = dict_table[group].loc[0, 'Team']
    runner_up = dict_table[group].loc[1, 'Team']
    #Just personal choice with the 3Rd group, 
    df_fixture_KO.replace({f'Winner {group}': group_winner,f'Runner-up {group}':runner_up, '3rd Group D/E/F': 'Turkey','3rd Group A/D/E/F': 'Germany (H)', '3rd Group A/B/C':'Croatia', '3rd Group A/B/C/D': 'Poland' }, inplace=True)

df_fixture_KO['winner'] = '?'
df_fixture_KO

Unnamed: 0,home,score,away,year,winner
36,Switzerland,Match 38,Italy,2024,?
37,Hungary,Match 37,Denmark,2024,?
38,England,Match 40,Turkey,2024,?
39,Spain,Match 39,Germany (H),2024,?
40,France,Match 42,Romania,2024,?
41,Portugal,Match 41,Croatia,2024,?
42,Belgium,Match 43,Poland,2024,?
43,Netherlands,Match 44,Czech Republic,2024,?


In [81]:
#Create winner
def get_winner(df_fixture_updated):
    for index, row in df_fixture_updated.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home,away)
        if points_home > points_away:
            winner = home
        else:
            winner = away
        df_fixture_updated.loc[index, 'winner'] = winner
    return df_fixture_updated
    

In [82]:
get_winner(df_fixture_KO)


Unnamed: 0,home,score,away,year,winner
36,Switzerland,Match 38,Italy,2024,Italy
37,Hungary,Match 37,Denmark,2024,Denmark
38,England,Match 40,Turkey,2024,England
39,Spain,Match 39,Germany (H),2024,Spain
40,France,Match 42,Romania,2024,France
41,Portugal,Match 41,Croatia,2024,Portugal
42,Belgium,Match 43,Poland,2024,Belgium
43,Netherlands,Match 44,Czech Republic,2024,Netherlands


QF

In [83]:
df_fixture_quarter

Unnamed: 0,home,score,away,year
44,Winner Match 39,Match 45,Winner Match 37,2024
45,Winner Match 41,Match 46,Winner Match 42,2024
46,Winner Match 40,Match 48,Winner Match 38,2024
47,Winner Match 43,Match 47,Winner Match 44,2024


In [84]:
def update_table(df_fixture_round_1, df_fixture_round_2):
    for index, row in df_fixture_round_1.iterrows():
        winner = df_fixture_round_1.loc[index, 'winner']
        match = df_fixture_round_1.loc[index, 'score']
        df_fixture_round_2.replace({f'Winner {match}':winner}, inplace=True)
    df_fixture_round_2['winner'] = '?'
    return df_fixture_round_2

In [85]:
update_table(df_fixture_KO, df_fixture_quarter)


Unnamed: 0,home,score,away,year,winner
44,Spain,Match 45,Denmark,2024,?
45,Portugal,Match 46,France,2024,?
46,England,Match 48,Italy,2024,?
47,Belgium,Match 47,Netherlands,2024,?


In [86]:
get_winner(df_fixture_quarter)

Unnamed: 0,home,score,away,year,winner
44,Spain,Match 45,Denmark,2024,Spain
45,Portugal,Match 46,France,2024,Portugal
46,England,Match 48,Italy,2024,Italy
47,Belgium,Match 47,Netherlands,2024,Netherlands


SF

In [87]:
update_table(df_fixture_quarter, df_fixture_semi)

Unnamed: 0,home,score,away,year,winner
48,Spain,Match 49,Portugal,2024,?
49,Netherlands,Match 50,Italy,2024,?


In [89]:
get_winner(df_fixture_semi)

Unnamed: 0,home,score,away,year,winner
48,Spain,Match 49,Portugal,2024,Spain
49,Netherlands,Match 50,Italy,2024,Italy


Final

In [90]:
update_table(df_fixture_semi, df_fixture_final)

Unnamed: 0,home,score,away,year,winner
50,Spain,Match 51,Italy,2024,?


In [91]:
get_winner(df_fixture_final)

Unnamed: 0,home,score,away,year,winner
50,Spain,Match 51,Italy,2024,Italy


### Predicted winner is ITALY 

They won the previous Euro's, surely they can't do it again.

The dataset only is based on historical Euro data and does not have an intake of any friendlies/other competitions played. 


### Personal Opinion
As a Football Fan I predict either France or England to win Euro 2024.