In [1]:
import pandas as pd
import pickle
from scipy.stats import poisson
import warnings 

In [2]:
dict_table = pickle.load(open('dict_table','rb'))
df_historical_data = pd.read_csv('clean_fifa_worldcup_matches.csv')
df_fixture = pd.read_csv('clean_fifa_worldcup_fixture.csv')

# 1. Calculating the Team Strength

In order to calculate the strength of each team, I first split the dataframe into a home and an away team. Then I calculated the teams strength by taking the mean of the goals scored in home matches and away matches. Then the Mean Goals scored and conceded will be used as the lambda in the poisson distribution formula. The Poisson distribution will not give an entirely accurate prediction as it makes a few assumptions which can easily be shown to be false. For instance, not every event is independent. After a team scores a goal, both teams will most likely alter their strategy. Especially in the knockout phase of the tournament, losing teams can change strategy to an ''all or nothing'' as it doesn't matter if you lose with 0-1 or 0-10. However, for the sake of making basic predictions, these interdependencies can be ignored.

In [3]:
# Split df into df_home and df_away
df_home = df_historical_data[['HomeTeam', 'HomeGoals', 'AwayGoals']]
df_away = df_historical_data[['AwayTeam', 'HomeGoals', 'AwayGoals']]


In [4]:
# Rename columns for clarity

df_home = df_home.rename(columns={'HomeTeam': 'Team', 'HomeGoals': 'GoalsScored', 'AwayGoals': 'GoalsConceded'})
df_away = df_home.rename(columns={'AwayTeam': 'Team', 'HomeGoals': 'GoalsConceded', 'AwayGoals': 'GoalsScored'})


In [5]:
# concat df_home and df_away, group by team and calculate the mean
df_team_strength = pd.concat([df_home, df_away], ignore_index=True).groupby('Team').mean()
df_team_strength

Unnamed: 0_level_0,GoalsScored,GoalsConceded
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Algeria,0.833333,1.666667
Angola,0.000000,1.000000
Argentina,1.913793,0.862069
Australia,1.000000,1.857143
Austria,2.285714,1.571429
...,...,...
Uruguay,2.125000,1.125000
Wales,2.000000,1.000000
West Germany,2.500000,0.894737
Yugoslavia,2.642857,0.642857


# 2. Adding a Poisson distribution function.

In [6]:
# Calculating the Poisson distribution in order to get the predictive scores:

def predict_points(home, away):
    if home in df_team_strength.index and away in df_team_strength.index:
        #goals_scored * goals conceded
        lamb_home = df_team_strength.at[home, 'GoalsScored'] + df_team_strength.at[away,'GoalsConceded']
        lamb_away = df_team_strength.at[away, 'GoalsScored'] + df_team_strength.at[home,'GoalsConceded']
        prob_home, prob_away, prob_draw = 0,0,0
        for x in range(0,11): 
            for y in range(0,11):
                p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)
                if x == y:
                    prob_draw += p
                elif x > y:
                    prob_home += p
                else:
                    prob_away += p
                    
        points_home = 3 * prob_home + prob_draw
        points_away = 3 * prob_away + prob_draw
        return (points_home, points_away)
    else:
        return (0, 0)

# 3. Predicting the World Cup 2022

## 3.1 Group Stage

In [7]:
# Splitting the fixtures into group, knockout, quarter, semi and final

df_fixture_group_48 = df_fixture[:48].copy()
df_fixture_knockout = df_fixture[48:56].copy()
df_fixture_quarter = df_fixture[56:60].copy()
df_fixture_semi = df_fixture[60:62].copy()
df_fixture_final = df_fixture[62:].copy()

In [8]:
# For each group we calculate the matches that are being played within that group. Each group has 4 teams and they play each other twice.
# This results in 6 matches, which are found in the fixture_group_48. Then for each match, the predicted score is compared

warnings.filterwarnings('ignore') 
for group in dict_table:
    teams_in_group = dict_table[group]['Team'].values
    df_fixture_group_6 = df_fixture_group_48[df_fixture_group_48['home'].isin(teams_in_group)]
    for index, row in df_fixture_group_6.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home, away)
        dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
        dict_table[group].loc[dict_table[group]['Team'] == away, 'Pts'] += points_away
        
    dict_table[group] = dict_table[group].sort_values('Pts', ascending=False).reset_index()
    dict_table[group] = dict_table[group][['Team', 'Pts']]
    dict_table[group] = dict_table[group].round(0)
    

In [43]:
for key in dict_table.keys():
    print("\n" +"="*40)
    print(key)
    print("-"*40)
    print(dict_table[key])


Group A
----------------------------------------
          Team  Pts
0  Netherlands  4.0
1      Ecuador  3.0
2      Senegal  2.0
3    Qatar (H)  0.0

Group B
----------------------------------------
            Team  Pts
0        England  6.0
1          Wales  5.0
2  United States  3.0
3           Iran  3.0

Group C
----------------------------------------
           Team  Pts
0     Argentina  6.0
1        Poland  5.0
2        Mexico  5.0
3  Saudi Arabia  2.0

Group D
----------------------------------------
        Team  Pts
0     France  6.0
1    Denmark  4.0
2    Tunisia  4.0
3  Australia  3.0

Group E
----------------------------------------
         Team  Pts
0     Germany  6.0
1       Spain  5.0
2  Costa Rica  3.0
3       Japan  3.0

Group F
----------------------------------------
      Team  Pts
0  Belgium  6.0
1  Croatia  5.0
2  Morocco  3.0
3   Canada  2.0

Group G
----------------------------------------
          Team  Pts
0       Brazil  7.0
1  Switzerland  5.0
2     Came

In [10]:
# We then take the number 1 and number 2 of each group and place them in the fixture for the knockout phase.

for group in dict_table:
    group_winner = dict_table[group].loc[0, 'Team']
    runner_up = dict_table[group].loc[1, 'Team']

    df_fixture_knockout.replace({f'Winners {group}': group_winner, 
                                 f'Runners-up {group}': runner_up}, inplace=True)

df_fixture_knockout['winner'] = '?'
df_fixture_knockout['loser'] = '?'
df_fixture_knockout

Unnamed: 0,home,score,away,year,winner,loser
48,Netherlands,Match 49,Wales,2022,?,?
49,Argentina,Match 50,Denmark,2022,?,?
50,France,Match 52,Poland,2022,?,?
51,England,Match 51,Ecuador,2022,?,?
52,Germany,Match 53,Croatia,2022,?,?
53,Brazil,Match 54,Uruguay,2022,?,?
54,Belgium,Match 55,Spain,2022,?,?
55,Portugal,Match 56,Switzerland,2022,?,?


In [11]:
# Now we calculate who has the higher amount of predicted points and store them as winner, making the other team the loser.
def get_winner(df_fixture_updated):
    for index, row in df_fixture_updated.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home, away)
        if points_home > points_away:
            winner = home
            loser = away
        else:
            winner = away
            loser = home
        df_fixture_updated.loc[index, 'winner'] = winner
        df_fixture_updated.loc[index, 'loser'] = loser
    return df_fixture_updated

In [12]:
get_winner(df_fixture_knockout)

Unnamed: 0,home,score,away,year,winner,loser
48,Netherlands,Match 49,Wales,2022,Wales,Netherlands
49,Argentina,Match 50,Denmark,2022,Argentina,Denmark
50,France,Match 52,Poland,2022,France,Poland
51,England,Match 51,Ecuador,2022,England,Ecuador
52,Germany,Match 53,Croatia,2022,Germany,Croatia
53,Brazil,Match 54,Uruguay,2022,Brazil,Uruguay
54,Belgium,Match 55,Spain,2022,Belgium,Spain
55,Portugal,Match 56,Switzerland,2022,Portugal,Switzerland


In [13]:
#we have to update the table and switch from one fixture to the next as the knockout phase is now played. 

def updated_table(df_fixture_round_1, df_fixture_round_2):
    for index, row in df_fixture_round_1.iterrows():
        winner = df_fixture_round_1.loc[index, 'winner']
        loser = df_fixture_round_1.loc[index, 'loser']
        match = df_fixture_round_1.loc[index, 'score']
        df_fixture_round_2.replace({f'Winners {match}':winner}, inplace=True)
        df_fixture_round_2.replace({f'Losers {match}':loser}, inplace=True)
    df_fixture_round_2['winner'] = '?'
    df_fixture_round_2['loser'] = '?'
    return df_fixture_round_2

In [14]:
updated_table(df_fixture_knockout, df_fixture_quarter)

Unnamed: 0,home,score,away,year,winner,loser
56,Germany,Match 58,Brazil,2022,?,?
57,Wales,Match 57,Argentina,2022,?,?
58,Belgium,Match 60,Portugal,2022,?,?
59,England,Match 59,France,2022,?,?


In [15]:
get_winner(df_fixture_quarter)

Unnamed: 0,home,score,away,year,winner,loser
56,Germany,Match 58,Brazil,2022,Brazil,Germany
57,Wales,Match 57,Argentina,2022,Argentina,Wales
58,Belgium,Match 60,Portugal,2022,Portugal,Belgium
59,England,Match 59,France,2022,France,England


In [16]:
updated_table(df_fixture_quarter, df_fixture_semi)

Unnamed: 0,home,score,away,year,winner,loser
60,Argentina,Match 61,Brazil,2022,?,?
61,France,Match 62,Portugal,2022,?,?


In [17]:
get_winner(df_fixture_semi)

Unnamed: 0,home,score,away,year,winner,loser
60,Argentina,Match 61,Brazil,2022,Brazil,Argentina
61,France,Match 62,Portugal,2022,Portugal,France


In [18]:
updated_table(df_fixture_semi, df_fixture_final)

Unnamed: 0,home,score,away,year,winner,loser
62,Argentina,Match 63,France,2022,?,?
63,Brazil,Match 64,Portugal,2022,?,?


In [19]:
df_final_results = get_winner(df_fixture_final)

In [20]:
df_final_results

Unnamed: 0,home,score,away,year,winner,loser
62,Argentina,Match 63,France,2022,France,Argentina
63,Brazil,Match 64,Portugal,2022,Brazil,Portugal


In [21]:
print(f"The winner is {df_final_results.loc[63,'winner']}, the runner up is {df_final_results.loc[63, 'loser']} and the third place will go to {df_final_results.loc[62, 'winner']}")

The winner is Brazil, the runner up is Portugal and the third place will go to France


And here we have it. Based on the poisson distribution applied on the goal scoring statistics from the past 100 years of world cups, Brazil is most likely to win the 2022 world cup, with Portugal as the runner up and Third place will be reserved for France.

# 4.0 Conclusion:

Given the fact that England has only won the world cup once and this was in 1966, it would make sense to say that the predictive model wasn't that accurate. Having that said, all four of the semi finalists from our predictive model made it to the quarter finals in the 2022 world cup and were it not for two very odd results, 3 of the 4 quarter finalists would have been similar to the prediction. Adding to that, the runner up of the real world cup 2022: france, was beaten in our predictive model by england who ended up becoming our predicted winner. This implies that were it not for england, france probably would have been at the quarter finals as well.

# 4.1 Improvements

First of all, I would made changes to the Calculation of the lamba's for the poisson distribution. *Pycoach* multiplies the goals scored with the goals conceded however this doesn't seem realistic to me. For example, if the average goals scored is 4, and the opponents average goals conceded is 2, it wouldn't be likely for the home team to then score 8 goals that match? Or would it be more likely that the goals scored would be 3? (as in, the average goals scored). Given that the opponents average condeded goals is 2, the away team has the skills to prevent more than 2 goals on average, and thus shouldn't be seen as 200% more likely to score for the home team. Therefore I would argue that it's more representative if the average is taken again, as this would ensure a value in between the two. 

To further illustrate a problem using Pycoach's method, if either the away goals conceded or the home goals scored is 1, the outcome is 100% dependend on the strength of the other team as he multiplies the averages. This makes no sense, as I would reckon that a team that scores 0.5 goals on average, is not going to continue scoring 0.5 goals on average if their opponents concede 1 goal on average. Again, taking the average would make a lot more sense, as it would result in an expected goal of 0.75.

**After that, I implemented a form factor.**
It's safe to say that regardless of how good a team was in 1930, it should have little effect on how good a team is today. So in order to make this prediction more accurate, I added weight to the more recent matches as opposed to matches a long time ago. Knowing that a career of good football players usually lasts about 14 years (from 18-32) and we're dealing with national teams (so all players are the best of their country), I counted those results 2 times stronger. (1/3rd of the value from the old results, and 2/3rd of the value from the new results, added together). I did not discount the old stats entirely as coaching, staff, and experience usually lingers on even after players move around.

## Things I will still improve:

I will scrape the player market value of each team from 2022 from websites in order to implement a sense of individual skill being present. In spite of results, good players can always stand up and make the difference even when everything else seems to fail. The main factor contributing to the players market value is their most recent transfer, or their most recent international results/local dominance. So by implementing the market value per competing country, form will be even more represented. 

----

Of course the analysis can be much more involved by adding elements such as passing accuracy, possession, shots per game, saves per game, etc. By adding in more metrics, a more accurate scoring and conceding prediction can be made.  