# FIFA Women's World Cup 2023 Prediction using Poisson Distribution

I love football and data science, and this year's FIFA Women's World Cup is the perfect opportunity for me to combine them in a project after my own heart. I'll be using a simple statistical model, the Poisson distribution to predict the winner of this year's world cup based on web-scraped historical data on FIFA matches from Wikipedia (sources are down below). 

In [1]:
#imports
import pandas as pd
from string import ascii_uppercase as alphabet
from bs4 import BeautifulSoup
import requests
from scipy.stats import poisson

## Data Collection

The FIFA Women's World Cup tournament started in 1991 and is held every 4 years. First, I'll scrape historical data of past matches from their Wikipedia pages using Beautiful Soup.

In [2]:
# data to scrape -> 1991-2019
years = [1991, 1995, 1999, 2003, 2007, 2011, 2015, 2019]

In [3]:
#function to get data on all matches for a given world cup year
def get_matches(year):
    #Scraping from webpage
    web = f'https://en.wikipedia.org/wiki/{year}_FIFA_Women%27s_World_Cup'
    response = requests.get(web)
    content = response.text
    soup = BeautifulSoup(content, 'lxml') #soup object
    matches = soup.find_all('div', class_='footballbox')

    #empty lists to store data in dictionary
    home = []
    score = []
    away = []

    #Get match details
    for match in matches:
        home.append(match.find('th', class_='fhome').get_text())
        score.append(match.find('th', class_='fscore').get_text())
        away.append(match.find('th', class_='faway').get_text())
    
    #Convert to dictionary and then a pandas dataframe
    dict_football = {'home': home, 'score': score, 'away': away}
    df_football = pd.DataFrame(dict_football)
    df_football['year'] = year
    
    return(df_football)

In [4]:
#FIFA Historical Data for all years
fifa = [get_matches(year) for year in years] 
df_fifa_history = pd.concat(fifa, ignore_index = True)
df_fifa_history.head()

Unnamed: 0,home,score,away,year
0,China,4–0,Norway,1991
1,Denmark,3–0,New Zealand,1991
2,Norway,4–0,New Zealand,1991
3,China,2–2,Denmark,1991
4,China,4–1,New Zealand,1991


In [5]:
#Verifying that all matches have been scraped
df_fifa_history['year'].value_counts()

2015    52
2019    52
1999    32
2003    32
2007    32
2011    32
1991    26
1995    26
Name: year, dtype: int64

Now that I have the historical data, I need this year's fixtures and group divisions.
Unfortunately (or not!) a few matches have already been played by the time I had the idea of doing this project. I want my prediction to work without being affected by the rapidly updating Wikipedia page, so I reset all the score values to their match numbers. We haven't reached the knockout stage yet, but the group stage data would likely be affected soon, so I reset those values as well.  

In [6]:
#Current FIFA fixture
df_fixture = get_matches('2023')
df_fixture

Unnamed: 0,home,score,away,year
0,New Zealand,1–0,Norway,2023
1,Philippines,0–2,Switzerland,2023
2,New Zealand,0–1,Philippines,2023
3,Switzerland,0–0,Norway,2023
4,Switzerland,0–0,New Zealand,2023
...,...,...,...,...
59,England,2–1,Colombia,2023
60,Spain,Match 61,Sweden,2023
61,Australia,Match 62,England,2023
62,Loser Match 61,Match 63,Loser Match 62,2023


In [7]:
#Resetting scores
score_clean = ['Match 1', 'Match 3', 'Match 17', 'Match 18', 'Match 33', 'Match 34', 'Match 2',
               'Match 4', 'Match 19', 'Match 22', 'Match 35', 'Match 36', 'Match 5', 'Match 6',
               'Match 21', 'Match 20', 'Match 37', 'Match 38', 'Match 7', 'Match 8', 'Match 25',
               'Match 26', 'Match 39', 'Match 40', 'Match 9', 'Match 10', 'Match 23', 'Match 24',
               'Match 41', 'Match 42', 'Match 11', 'Match 13', 'Match 28', 'Match 29', 'Match 43',
               'Match 44', 'Match 12', 'Match 14', 'Match 27', 'Match 30', 'Match 45', 'Match 46',
               'Match 15', 'Match 16', 'Match 32', 'Match 31', 'Match 47', 'Match 48', 'Match 49',
               'Match 50', 'Match 51', 'Match 52', 'Match 54', 'Match 53', 'Match 56', 'Match 55',
               'Match 57', 'Match 58', 'Match 59', 'Match 60', 'Match 61', 'Match 62', 'Match 63', 'Match 64']

In [8]:
df_fixture['score'] = score_clean
df_fixture

Unnamed: 0,home,score,away,year
0,New Zealand,Match 1,Norway,2023
1,Philippines,Match 3,Switzerland,2023
2,New Zealand,Match 17,Philippines,2023
3,Switzerland,Match 18,Norway,2023
4,Switzerland,Match 33,New Zealand,2023
...,...,...,...,...
59,England,Match 60,Colombia,2023
60,Spain,Match 61,Sweden,2023
61,Australia,Match 62,England,2023
62,Loser Match 61,Match 63,Loser Match 62,2023


In [9]:
#Resetting group stage data

df_fixture.loc[df_fixture.score == 'Match 49', ['home', 'away']] = ['Winners Group A', 'Runners-up Group C']
df_fixture.loc[df_fixture.score == 'Match 50', ['home', 'away']] = ['Winners Group C', 'Runners-up Group A']
df_fixture.loc[df_fixture.score == 'Match 51', ['home', 'away']] = ['Winners Group E', 'Runners-up Group G']
df_fixture.loc[df_fixture.score == 'Match 52', ['home', 'away']] = ['Winners Group G', 'Runners-up Group E']
df_fixture.loc[df_fixture.score == 'Match 54', ['home', 'away']] = ['Winners Group D', 'Runners-up Group B']
df_fixture.loc[df_fixture.score == 'Match 53', ['home', 'away']] = ['Winners Group B', 'Runners-up Group D']
df_fixture.loc[df_fixture.score == 'Match 56', ['home', 'away']] = ['Winners Group H', 'Runners-up Group F']
df_fixture.loc[df_fixture.score == 'Match 55', ['home', 'away']] = ['Winners Group F', 'Runners-up Group H']

df_fixture.loc[df_fixture.score == 'Match 57', ['home', 'away']] = ['Winners Match 49', 'Winners Match 51']
df_fixture.loc[df_fixture.score == 'Match 58', ['home', 'away']] = ['Winners Match 50', 'Winners Match 52']
df_fixture.loc[df_fixture.score == 'Match 59', ['home', 'away']] = ['Winners Match 53', 'Winners Match 55']
df_fixture.loc[df_fixture.score == 'Match 60', ['home', 'away']] = ['Winners Match 54', 'Winners Match 56']
df_fixture.loc[df_fixture.score == 'Match 61', ['home', 'away']] = ['Winners Match 57', 'Winners Match 58']
df_fixture.loc[df_fixture.score == 'Match 62', ['home', 'away']] = ['Winners Match 59', 'Winners Match 60']
df_fixture.loc[df_fixture.score == 'Match 63', ['home', 'away']] = ['Losers Match 61', 'Losers Match 62']
df_fixture.loc[df_fixture.score == 'Match 64', ['home', 'away']] = ['Winners Match 61', 'Winners Match 62']

#Checking to see if knockout stage data is properly replaced
df_fixture.loc[47:,:]


Unnamed: 0,home,score,away,year
47,Morocco,Match 48,Colombia,2023
48,Winners Group A,Match 49,Runners-up Group C,2023
49,Winners Group C,Match 50,Runners-up Group A,2023
50,Winners Group E,Match 51,Runners-up Group G,2023
51,Winners Group G,Match 52,Runners-up Group E,2023
52,Winners Group D,Match 54,Runners-up Group B,2023
53,Winners Group B,Match 53,Runners-up Group D,2023
54,Winners Group H,Match 56,Runners-up Group F,2023
55,Winners Group F,Match 55,Runners-up Group H,2023
56,Winners Match 49,Match 57,Winners Match 51,2023


In [10]:
#Fixtures done, now getting Current FIFA groups
all_tables = pd.read_html("https://en.wikipedia.org/wiki/2023_FIFA_Women's_World_Cup")

In [11]:
#First group
all_tables[10]

Unnamed: 0,Pos,Teamvte,Pld,W,D,L,GF,GA,GD,Pts,Qualification
0,1,Switzerland,3,1,2,0,2,0,+2,5,Advance to knockout stage
1,2,Norway,3,1,1,1,6,1,+5,4,Advance to knockout stage
2,3,New Zealand (H),3,1,1,1,1,1,0,4,
3,4,Philippines,3,1,0,2,1,8,−7,3,


In [12]:
#Last group
all_tables[59]

Unnamed: 0,Pos,Teamvte,Pld,W,D,L,GF,GA,GD,Pts,Qualification
0,1,Colombia,3,2,0,1,4,2,+2,6,Advance to knockout stage
1,2,Morocco,3,2,0,1,2,6,−4,6,Advance to knockout stage
2,3,Germany,3,1,1,1,8,3,+5,4,
3,4,South Korea,3,0,1,2,1,4,−3,1,


In [13]:
#To help assign letters to groups
alphabet

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [14]:
# Dictionary of Group dataframes
dict_table = {}
for letter, i in zip(alphabet, range(10, 60, 7)):
    df=all_tables[i]
    df.rename(columns={df.columns[1]: 'Team'}, inplace =True)
    df.replace({'New Zealand (H)' : 'New Zealand', 'Australia (H)' : 'Australia'}, inplace=True)
    df['Pld'] = df['W'] = df['L'] = df['GF'] = df['GA'] = df['GD'] = df['Pts'] = [0,0,0,0] #Resetting values
    df.drop('Qualification', axis=1, inplace=True) #Removing unnecessary columns
    dict_table[f'Group {letter}'] = df

In [37]:
#Checking for one group
dict_table['Group A']

Unnamed: 0,Team,Pts
0,Switzerland,5.0
1,Norway,4.0
2,New Zealand,0.0
3,Philippines,0.0


## Data Cleaning

In [16]:
#Cleaning df_fixture
df_fixture['home'] = df_fixture['home'].str.strip()
df_fixture['away'] = df_fixture['away'].str.strip()

In [17]:
#Cleaning df_fifa_history

# cleanning score and home/away columns
df_fifa_history['score'] = df_fifa_history['score'].str.replace('[^\d–]', '', regex=True)
df_fifa_history['home'] = df_fifa_history['home'].str.strip() # clean blank spaces: Yugoslavia twice
df_fifa_history['away'] = df_fifa_history['away'].str.strip()

# splitting score columns into home and away goals and dropping score column
df_fifa_history[['HomeGoals', 'AwayGoals']] = df_fifa_history['score'].str.split('–', expand=True)
df_fifa_history.drop('score', axis=1, inplace=True)

# renaming columns and changing data types
df_fifa_history.rename(columns={'home': 'HomeTeam', 'away': 'AwayTeam', 
                                   'year':'Year'}, inplace=True)
df_fifa_history = df_fifa_history.astype({'HomeGoals': int, 'AwayGoals':int, 'Year': int})

# creating new column "totalgoals"
df_fifa_history['TotalGoals'] = df_fifa_history['HomeGoals'] + df_fifa_history['AwayGoals']
df_fifa_history

Unnamed: 0,HomeTeam,AwayTeam,Year,HomeGoals,AwayGoals,TotalGoals
0,China,Norway,1991,4,0,4
1,Denmark,New Zealand,1991,3,0,3
2,Norway,New Zealand,1991,4,0,4
3,China,Denmark,1991,2,2,4
4,China,New Zealand,1991,4,1,5
...,...,...,...,...,...,...
279,Germany,Sweden,2019,1,2,3
280,England,United States,2019,1,2,3
281,Netherlands,Sweden,2019,1,0,1
282,England,Sweden,2019,1,2,3


In [18]:
#Exporting clean dfs
df_fifa_history.to_csv('fifa_wcwomen_historical_data_cleaned.csv',index=False)
df_fixture.to_csv('fifa_2023_fixture_cleaned.csv',index=False)

In [19]:
#Calculate Team Strength
df_home = df_fifa_history[['HomeTeam', 'HomeGoals', 'AwayGoals']]
df_away = df_fifa_history[['AwayTeam', 'HomeGoals', 'AwayGoals']]

df_home = df_home.rename(columns={'HomeTeam':'Team', 'HomeGoals': 'GoalsScored', 'AwayGoals': 'GoalsConceded'})
df_away = df_away.rename(columns={'AwayTeam':'Team', 'HomeGoals': 'GoalsConceded', 'AwayGoals': 'GoalsScored'})

df_team_strength = pd.concat([df_home, df_away], ignore_index=True).groupby(['Team']).mean()
df_team_strength

Unnamed: 0_level_0,GoalsScored,GoalsConceded
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Argentina,0.555556,4.111111
Australia,1.461538,1.923077
Brazil,1.941176,1.176471
Cameroon,1.5,1.5
Canada,1.259259,1.925926
Chile,0.666667,1.666667
China,1.606061,0.969697
Chinese Taipei,0.5,3.75
Colombia,0.571429,1.285714
Costa Rica,1.0,1.333333


## Poisson Distribution

In [20]:
#Function Predict Points using Poisson Distribution
def predict_points(home, away):
    if home in df_team_strength.index and away in df_team_strength.index: #Teams have played against each other
        # goals_scored * goals_conceded
        lamb_home = df_team_strength.at[home,'GoalsScored'] * df_team_strength.at[away,'GoalsConceded']
        lamb_away = df_team_strength.at[away,'GoalsScored'] * df_team_strength.at[home,'GoalsConceded']
        prob_home, prob_away, prob_draw = 0, 0, 0
        for x in range(0,11): #number of goals home team
            for y in range(0, 11): #number of goals away team
                p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)
                if x == y:
                    prob_draw += p
                elif x > y:
                    prob_home += p
                else:
                    prob_away += p
        
        #FIFA awards 3 points to winning team, and 1 point each for a draw
        points_home = 3 * prob_home + prob_draw
        points_away = 3 * prob_away + prob_draw
        return (points_home, points_away)
    else:
        return (0, 0) #Teams have never played against each other

In [21]:
#Testing points
print(predict_points('England', 'United States'))

(0.4791290285407772, 2.3861628168887195)


# Predicting World Cup

Now it's time to put the model to the test! On to predictions!

## Group Stage

In [22]:
#Dividing up fixture dataset into group stage, knockout stage, quarter finals, semi finals and finals
df_fixture_group_48 = df_fixture[:48].copy()
df_fixture_knockout = df_fixture[48:56].copy()
df_fixture_quarter = df_fixture[56:60].copy()
df_fixture_semi = df_fixture[60:62].copy()
df_fixture_final = df_fixture[62:].copy()

In [23]:
#Estimating points earned by home and away teams of all groups
for group in dict_table:
    teams_in_group = dict_table[group]['Team'].values
    df_fixture_group_6 = df_fixture_group_48[df_fixture_group_48['home'].isin(teams_in_group)]
    for index, row in df_fixture_group_6.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home, away)
        dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
        dict_table[group].loc[dict_table[group]['Team'] == away, 'Pts'] += points_away

    dict_table[group] = dict_table[group].sort_values('Pts', ascending=False).reset_index()
    dict_table[group] = dict_table[group][['Team', 'Pts']]
    dict_table[group] = dict_table[group].round(0)

In [24]:
dict_table['Group A']

Unnamed: 0,Team,Pts
0,Switzerland,5.0
1,Norway,4.0
2,New Zealand,0.0
3,Philippines,0.0


## Knockout

In [25]:
df_fixture_knockout

Unnamed: 0,home,score,away,year
48,Winners Group A,Match 49,Runners-up Group C,2023
49,Winners Group C,Match 50,Runners-up Group A,2023
50,Winners Group E,Match 51,Runners-up Group G,2023
51,Winners Group G,Match 52,Runners-up Group E,2023
52,Winners Group D,Match 54,Runners-up Group B,2023
53,Winners Group B,Match 53,Runners-up Group D,2023
54,Winners Group H,Match 56,Runners-up Group F,2023
55,Winners Group F,Match 55,Runners-up Group H,2023


In [26]:
#Assigning group winner and runners-up for each group
for group in dict_table:
    group_winner = dict_table[group].loc[0, 'Team']
    runners_up = dict_table[group].loc[1, 'Team']
    df_fixture_knockout.replace({f'Winners {group}':group_winner,
                                 f'Runners-up {group}':runners_up}, inplace=True)

df_fixture_knockout['winner'] = '?'
df_fixture_knockout

Unnamed: 0,home,score,away,year,winner
48,Switzerland,Match 49,Spain,2023,?
49,Costa Rica,Match 50,Norway,2023,?
50,United States,Match 51,Sweden,2023,?
51,Italy,Match 52,Netherlands,2023,?
52,China,Match 54,Canada,2023,?
53,Australia,Match 53,England,2023,?
54,Germany,Match 56,Brazil,2023,?
55,France,Match 55,Colombia,2023,?


In [27]:
#Function to get match winner for each match in given dataframe
def get_winner(df_fixture_updated):
    for index, row in df_fixture_updated.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home, away)
        if points_home > points_away:
            winner = home
        else:
            winner = away
        df_fixture_updated.loc[index, 'winner'] = winner
    return df_fixture_updated

In [28]:
get_winner(df_fixture_knockout)

Unnamed: 0,home,score,away,year,winner
48,Switzerland,Match 49,Spain,2023,Switzerland
49,Costa Rica,Match 50,Norway,2023,Norway
50,United States,Match 51,Sweden,2023,United States
51,Italy,Match 52,Netherlands,2023,Italy
52,China,Match 54,Canada,2023,China
53,Australia,Match 53,England,2023,England
54,Germany,Match 56,Brazil,2023,Germany
55,France,Match 55,Colombia,2023,France


## Quarter Final

In [29]:
#Function to replace names of winners in given dataframe
def update_table(df_fixture_round_1, df_fixture_round_2):
    for index, row in df_fixture_round_1.iterrows():
        winner = df_fixture_round_1.loc[index, 'winner']
        match = df_fixture_round_1.loc[index, 'score']
        df_fixture_round_2.replace({f'Winners {match}':winner}, inplace=True)
    df_fixture_round_2['winner'] = '?'
    return df_fixture_round_2

In [30]:
update_table(df_fixture_knockout, df_fixture_quarter)

Unnamed: 0,home,score,away,year,winner
56,Switzerland,Match 57,United States,2023,?
57,Norway,Match 58,Italy,2023,?
58,England,Match 59,France,2023,?
59,China,Match 60,Germany,2023,?


In [31]:
get_winner(df_fixture_quarter)

Unnamed: 0,home,score,away,year,winner
56,Switzerland,Match 57,United States,2023,United States
57,Norway,Match 58,Italy,2023,Norway
58,England,Match 59,France,2023,France
59,China,Match 60,Germany,2023,Germany


## Semifinal

In [32]:
update_table(df_fixture_quarter, df_fixture_semi)

Unnamed: 0,home,score,away,year,winner
60,United States,Match 61,Norway,2023,?
61,France,Match 62,Germany,2023,?


In [33]:
get_winner(df_fixture_semi)

Unnamed: 0,home,score,away,year,winner
60,United States,Match 61,Norway,2023,United States
61,France,Match 62,Germany,2023,Germany


## Final

In [34]:
update_table(df_fixture_semi, df_fixture_final)

Unnamed: 0,home,score,away,year,winner
62,Losers Match 61,Match 63,Losers Match 62,2023,?
63,United States,Match 64,Germany,2023,?


In [35]:
df_fixture_final.loc[df_fixture.score == 'Match 63', ['home','away']] = ['Norway','France'] #From df_fixture_semi
df_fixture_final

Unnamed: 0,home,score,away,year,winner
62,Norway,Match 63,France,2023,?
63,United States,Match 64,Germany,2023,?


In [36]:
#Moment of truth
get_winner(df_fixture_final)

Unnamed: 0,home,score,away,year,winner
62,Norway,Match 63,France,2023,Norway
63,United States,Match 64,Germany,2023,United States


And that's it! My model predicts that United States will win this year's FIFA Women's World Cup, with Germany in 2nd place and Norway in 3rd place. But no matter who wins this time, it's shaping up to be a great world cup!

<b>Sources:</b>
* https://en.wikipedia.org/wiki/2023_FIFA_Women's_World_Cup
* https://en.wikipedia.org/wiki/1991_FIFA_Women%27s_World_Cup
* https://en.wikipedia.org/wiki/1995_FIFA_Women%27s_World_Cup
* https://en.wikipedia.org/wiki/1999_FIFA_Women%27s_World_Cup
* https://en.wikipedia.org/wiki/2003_FIFA_Women%27s_World_Cup
* https://en.wikipedia.org/wiki/2007_FIFA_Women%27s_World_Cup
* https://en.wikipedia.org/wiki/2011_FIFA_Women%27s_World_Cup
* https://en.wikipedia.org/wiki/2015_FIFA_Women%27s_World_Cup
* https://en.wikipedia.org/wiki/2019_FIFA_Women%27s_World_Cup

## Thank you!