# BAMS 503 Simulation Modeling: Final Project<br> Simulation Model for 2022 FIFA World Cup

## Authors
+ Muhammad Faisal
+ Carlos Nako
+ Sotirios Valozos
+ Islam Shaalan
+ Lige Liu

## Table of Content
  - [Introduction](#introduction)
  - [Data](#data)
  - [Methodology](#methodology)
  - [Simulation Model](#simulation-model)
  - [Result](#result)
  - [Conclusion](#conclusion)
  - [Discussions](#discussions)
  - [References](#references)

## Introduction
The 2022 FIFA World Cup is scheduled to be the 22nd running of the FIFA World Cup competition, the quadrennial international men’s association football (commonly known as soccer in North America) championship contested by the national teams of the member associations of FIFA, the international governing body of the sport of soccer [1]. In 2022, 32 teams will compete in Qatar and play in a total of 64 matches for the final championship.

As the most prestigious soccer tournament and one of the most followed sporting events in the world, the game results are highly anticipated. In this project, we built a simulation model in an attempt to predict the result of the championship, including the champion, the runner-up, and the teams entering the elimination rounds. We utilized concepts of Monte Carlo Simulation in BAMS 503 and previous research in modeling soccer match outcomes to predict the outcome of each of the 64 matches.

The tournament consists of two stages: the group stage and the knockout stage. The tournament will begin with a group stage with 8 groups of 4 teams. Each group will play in a single round-robin format with a points-based ranking system to determine the top two qualifiers for the knockout stage. In the knockout stage, each team will play in an elimination format to determine the champion. The losing teams in the semi-finals will play a third-place play-off. Teams not qualifying for the knockout stage and teams lost before the semi-finals will exit the tournament.

As of April 2022, even though the final draw of the tournament has been completed, three qualifying teams have not yet been determined. Two winners of the inter-confederation play-offs and the winner of the Path A of the UEFA play-offs will qualify for the World Cup and the matches will happen in June 2022. In this project, we determined three teams to play in the World Cup arbitrarily for simplicity.

## Data
In order to model the match outcomes, we used the dataset [International football results from 1872 to 2021](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017) [2] accessed from Kaggle. The dataset includes the date, teams, scores, tournament, and location information of 43170 international soccer matches.

The groups and the schedule of the tournament come from [FIFA World Cup Qatar 2022 – Match Schedule](https://digitalhub.fifa.com/m/6a616c6cf19bc57a/original/FWC-2022-Match-Schedule.pdf) [3] published by FIFA.

## Methodology
In the group stage, the ranking of teams in the group is determined by the following rules [2]:
1.	Points obtained in all group matches;
2.	Goal difference in all group matches;
3.	Number of goals scored in all group matches;
4.	Points obtained in the matches played between the teams in question;
5.	Goal difference in the matches played between the teams in question;
6.	Number of goals scored in the matches played between the teams in question;
7.	Fair play points in all group matches (only one deduction can be applied to a player in a single match):
    + Yellow card: −1 point;
    + Indirect red card (second yellow card): −3 points;
    + Direct red card: −4 points;
    + Yellow card and direct red card: −5 points;
8.	Drawing of lots.

Therefore, in order to model the group standings, we need to model scores and disciplinary actions separately as they distribute differently.

However, referees make disciplinary decisions drastically differently in big tournaments like the World Cup with regular matches like friendlies. Referees’ decisions on yellow cards and red cards are highly subjective and influenced by many factors that are impossible to predict. Considering the fair points were the second last criteria to qualify, we adopted a simplistic approach to simulate the disciplinary actions in a match. We will model the yellow cards and red cards with a uniform distribution

In the knockout stage, if a match is tied at the end of normal playing time, extra time is played (two periods of 15 minutes each) and followed, if necessary, by a penalty shoot-out to determine the winners [4].

Therefore, we only need to model the scores in the knockout stage. Due to the difficulty to model extra time and penalty shoot-out results, we will continue to simulate the match outcome until a winner is determined.

### Score Model

To model the scores, we referenced a paper by M.J. Maher in 1982 [5]. In his paper, “Modelling association football scores”, Maher used Poisson distribution to model scores of soccer games. We referenced his model to build our model to simulate the match outcome.

In our model, for a team of interest (Team A), the X denotes the number of goals the team scored in a match-up with an opponent (Team B). The model is denoted as the following:

$X \sim Pois(\lambda),$

$where\: \lambda=(attack\: strength\: of\: Team\: A)\times(defense\: strength\: of\: Team\: B)\times(average\: goals\: scored\: of\: all\: the\: teams)$

The “attack strength” score and the “defense strength” score of the team of interest (Team A) is calculated by the following formula:

$Attack\: strength=\dfrac{average\: goals\: scored\: of\: Team\: A }{average\: goals\: scored\: of\: all\: the\: teams}$

$Defense\: strength=\dfrac{average\: goals\: conceded\: of\: Team\: A }{average\: goals\: conceded\: of\: all\: the\: teams}$

From the [International football results dataset](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017), we extracted the goals scored and goals conceded in the matches played by teams in the 2022 FIFA World Cup since 2000. We then calculated an “attack strength” score and “defense strength” score for each team and export the scores as a CSV file. The CSV file is later used as an input for the simulation model to simulate the outcome for each match.

The score model is included in the "Score Model.ipynb" file.

## Simulation Model

### Model Overview
The following image depicts the overall modeling process 
<i>P.S. Images sometimes break. Image is attached as modeling_view.png under imgs folder</i>.

![alt text](imgs/modeling_view.jpeg "Modeling View")

File 'input_file.csv' includes the name of the countries, the group that that they belong to, and their group position as it was drawn. It also contains the home attack/defense scores, away attack and defence scores. The input file is generated from the 'score_model' notebook.

File 'reference_score.csv' includes a reference value for the home and away score leverage based on the observed historical data. The reference scores are generated from the 'score_model' notebook.

All the files are available within the data folder.

### Code Structure
The following image depicts the structure and flow of the coding of the simulation model.

![alt text](imgs/code_view.jpeg "Code Structure")


### Classes that will be used throughout the modeling process

In [1]:
#define Team class with ranking and strength properties
class Team:

    def __init__(self, name, group,home_attack, home_defense, away_attack, away_defense, group_position):
        self.name = name
        self.group = group
        self.home_attack = home_attack
        self.home_defense = home_defense
        self.away_attack = away_attack
        self.away_defense = away_defense
        self.group_position = group_position

#define a Game class with 2 teams, home and away, home advantage and prediction
class Game:
    def __init__(self, game_number, home_team, away_team, home_advantage,is_group_stage,group):
        self.game_number = game_number
        self.home_team = home_team
        self.away_team = away_team
        self.home_advantage = home_advantage
        self.is_group_stage = is_group_stage
        self.group = group
'''
create a game result object that holds the result and cards of the football game
'''
class GameResult:
    def __init__(self, game, home_goals, away_goals, home_yellow_cards, away_yellow_cards, home_red_cards, away_red_cards,group):
        self.game = game
        self.home_goals = home_goals
        self.away_goals = away_goals
        self.home_yellow_cards = home_yellow_cards
        self.away_yellow_cards = away_yellow_cards
        self.home_red_cards = home_red_cards
        self.away_red_cards = away_red_cards
        self.group = group
'''
Worldcup final results class that holds the final results of the world cup
'''
class WorldcupFinalResults:
    def __init__(self, winner, runner_up, third_place, fourth_place, quarter_finalists, r16s):
        self.winner = winner
        self.runner_up = runner_up
        self.third_place = third_place
        self.fourth_place = fourth_place
        self.quarter_finalists = quarter_finalists
        self.r16s = r16s

## Number of Replications
The number of replications was defined by setting a target 95% confidence interval of +/- 0.01 for the team with the highest chance to win the competition. 
To calculate the final number of replications an initial simulation with 100 replications was done to have an estimation of the standard deviation. The final number of replications was then calculated as 7542.8. The model was then updated with the rounded up number.

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats  # for calculating confidence intervals
import math as math
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None

np.random.seed(1234)

num_replications = 7543

In [3]:
def initiate_input_data():
    wc_teams_list = []
    #reading the teams information from the csv file
    team_data = pd.read_csv("data/input_file.csv")
    
    #Create the list of team objects based on the input file
    for i in range(32):
        wc_teams_list.append(Team(team_data['Country'][i],team_data['Group'][i],team_data['home_attack_strength'][i],team_data['home_defense_strength'][i],team_data['away_attack_strength'][i],team_data['away_attack_strength'][i],team_data['Group_Position'][i]))

    #reading the reference score (output of the score_model) from the csv file
    ref_goals = pd.read_csv("data/reference_score.csv",header=None)

    return wc_teams_list,ref_goals

In [4]:
#function that generates the games list based on the draw and the groups
def generate_group_stage_games_list(wc_teams_list):
    
    temp_groups = ["A","B","C","D","E","F","G","H"]
    games_list = []
    for group in temp_groups:
        temp_list = []
        
        for i in range(32):
            if wc_teams_list[i].group == group:
                temp_list.append(wc_teams_list[i])
        #append the 6 games of the group to the games list
        games_list.append(Game(group+str(1),temp_list[0],temp_list[1],1,1,group))
        games_list.append(Game(group+str(2),temp_list[2],temp_list[3],1,1,group))
        games_list.append(Game(group+str(3),temp_list[0],temp_list[2],1,1,group))
        games_list.append(Game(group+str(4),temp_list[1],temp_list[3],1,1,group))
        games_list.append(Game(group+str(5),temp_list[0],temp_list[3],1,1,group))
        games_list.append(Game(group+str(6),temp_list[1],temp_list[2],1,1,group))

    return games_list

In [5]:
#return the result of a game
def get_game_result(game):
    
    team1 = game.home_team
    team2 = game.away_team
    
    overal_scored_home = ref_goals.iloc[2,0]
    overal_scored_away = ref_goals.iloc[2,0]

    overall_score = (overal_scored_home+overal_scored_away)/2

    team1_attack = (team1.home_attack+team1.away_attack)/2
    team1_defense = (team1.home_defense+team1.away_defense)/2
    team2_attack = (team2.home_attack+team2.away_attack)/2
    team2_defense = (team2.home_defense+team2.away_defense)/2

    lambda_1 = team1_attack * team2_defense * overall_score
    lambda_2 = team2_attack * team1_defense * overall_score

    home_goals = np.random.poisson(lambda_1*100)
    away_goals = np.random.poisson(lambda_2*100)


    home_yellow_cards = int(round (np.random.uniform(0,10),0))
    away_yellow_cards = int(round (np.random.uniform(0,10),0))
    home_red_cards = int(round (np.random.uniform(0,10),0))
    away_red_cards = int(round (np.random.uniform(0,10),0))

    return GameResult(game,home_goals,away_goals,home_yellow_cards,away_yellow_cards,home_red_cards,away_red_cards,game.group)

In [6]:
#function to simulation the games in the group stage and get the results for each game
def simulate_group_stage_games(games_list):
    game_results_list = []

    for game in games_list:
        game_results_list.append(get_game_result(game))
        
    return game_results_list

In [7]:
#Function to generate the bracket games based on the group stage results and the fifa world cup rules
def generate_bracket_games(winners, runner_ups):
    bracket_games_list = []

    #rules of creating the bracket games from the group qualified teams and game number
    game_num_index = 48
    #Winner from Group1 plays against runner_up from Group2
    Group1 = ["A","C","D","B","E","G","F","H"]
    Group2 = ["B","D","C","A","F","H","E","G"]
    
    for j in range(8):
        bracket_games_list.append(Game(game_num_index+j+1,winners[Group1[j]],runner_ups[Group2[j]],1,1,"Bracket"))

    return bracket_games_list


In [8]:

def grp_qualified_teams(game_results_list,wc_teams_list):
    groups = ["A","B","C","D","E","F","G","H"]
    
    #dictionaries to hold winners and runners
    winners = {}
    runner_ups = {}

    #iterate over groups from A to H
    for group in groups:
    #get all the games of this group
        group_games_results = [game for game in game_results_list if group == game.group]
    
        #get the teams for this group
        group_teams = [team for team in wc_teams_list if team.group == group]
        team_names = [team.name for team in group_teams]

        #now that we have the teams and the matches we can start tallying the result
        group_results = tally_group_results(team_names,group_games_results)
        
        #get the top two teams
        group_winner, group_runner_up = get_top_2_teams(group_results, group_games_results)   

        #get the wc_team objects for the winners and runner ups
        #add the winner and runnerup to the dictionary
        #get the first team object from the wc_teams_list with name matching the group_winner
        
        for team in wc_teams_list:
            if team.name == group_winner:
                winners[group] = team
            if team.name == group_runner_up:
                runner_ups[group] = team
        
        
    return winners, runner_ups
    
def get_top_2_teams(group_table, group_game_results):
    #get the group winner
    group_winner = get_team_at_n_position(group_table, group_game_results, 1)
    #get the group runner up
    group_runner_up = get_team_at_n_position(group_table, group_game_results, 2)
    
    #add the group winner and runner up to the group qualified teams list
    return group_winner,group_runner_up

#get the teams that are tied at the nth position (either first or second spot)
def get_team_at_n_position(group_table, group_game_results, group_position):
    #use index instead of group position
    n = group_position - 1
    #sort the teams
    group_table.sort_values(by=['Points','Goals Diff','Goals For'], ascending=False, inplace=True)
    
    #check if there are tied teams at this position
    equal_points = group_table['Points'] == group_table['Points'].iloc[n]
    equal_goals_diff = group_table['Goals Diff'] == group_table['Goals Diff'].iloc[n]
    equal_goals_for = group_table['Goals For'] == group_table['Goals For'].iloc[n]
    num_tied_teams = len(group_table[(equal_points) & (equal_goals_diff) & (equal_goals_for)])
    
    #if yes apply the tie breaking rules and return the team name
    if num_tied_teams > 1:
        sorted_group_results = break_the_tie(group_table[n:n+num_tied_teams], group_game_results)
        return sorted_group_results.iloc[n]['Team Name']
    
    #if not then return the team name at the required position
    return group_table.iloc[n]['Team Name']

#In case there's a tie, go through the games and the previously mentioned tie breaking rules
def break_the_tie(group_results, group_game_results):
    #get the list of all team names
    team_names = list(group_results['Team Name'])
    
    #get games between teams
    teams_games_intergames = [game for game in group_game_results if (team_names[0] in [game.game.home_team.name, game.game.away_team.name]) & (team_names[1] in [game.game.home_team.name, game.game.away_team.name])]
    
    #rules 4 to 6
    sorted_group_results = tally_group_results(team_names,teams_games_intergames)

    #check if there's still a tie
    equal_points = sorted_group_results['Points'] == sorted_group_results['Points'].iloc[0]
    equal_goals_diff = sorted_group_results['Goals Diff'] == sorted_group_results['Goals Diff'].iloc[0]
    equal_goals_for = sorted_group_results['Goals For'] == sorted_group_results['Goals For'].iloc[0]
    num_tied_teams = len(sorted_group_results[(equal_points) & (equal_goals_diff) & (equal_goals_for)])

    #if yes then apply rule 7 and find the team best fair play record
    if num_tied_teams > 1:
        sorted_group_results.sort_values(by=['Fair Play Points'], ascending=True, inplace=True)

    #check if there's still a tie
    equal_fairplay = sorted_group_results['Fair Play Points'] == sorted_group_results['Fair Play Points'].iloc[0]
    num_tied_teams = len(sorted_group_results[equal_fairplay])
    #if there's still a tie then we go to the random draw 
    if num_tied_teams > 1:
        #randomly choose one of the tied teams
        random_choice = np.random.choice(team_names)
        #create a new column where the randomly chosen team has the value 0 and the rest 1
        sorted_group_results['Random Choice'] = [1 if team_name != random_choice else 0 for team_name in sorted_group_results['Team Name']]
        #sort the table by the Random Choice column
        sorted_group_results.sort_values(by=['Random Choice'], ascending=True, inplace=True)
        #drop the Random Choice column
    
    #return the sorted group results after applying the tie breaking rules
    return sorted_group_results

#aggergate all the results from the group matches into the final group table
def tally_group_results(team_names,group_games_results):
    group_results = {}
    #Calculate Points, Goals For, Goals Against, Goals Diff, Fair Play Points
    for team in team_names:
        group_results[team] = {"Points":0,"Goals For":0,"Goals Against":0,"Goals Diff":0,"Fair Play Points":0}
        
    for game_res in group_games_results:
        home_team_name = game_res.game.home_team.name
        away_team_name = game_res.game.away_team.name
        if game_res.home_goals == game_res.away_goals:
            group_results[home_team_name]["Points"] += 1
            group_results[away_team_name]["Points"] += 1
        elif game_res.home_goals > game_res.away_goals:
            group_results[home_team_name]["Points"] += 3
        else:
            group_results[away_team_name]["Points"] += 3
        group_results[home_team_name]["Goals For"] += game_res.home_goals
        group_results[home_team_name]["Goals Against"] += game_res.away_goals
        group_results[away_team_name]["Goals For"] += game_res.away_goals
        group_results[away_team_name]["Goals Against"] += game_res.home_goals
        group_results[home_team_name]["Goals Diff"] += game_res.home_goals - game_res.away_goals
        group_results[away_team_name]["Goals Diff"] += game_res.away_goals - game_res.home_goals
        group_results[home_team_name]["Fair Play Points"] += game_res.home_yellow_cards + (game_res.home_red_cards*2)
        group_results[away_team_name]["Fair Play Points"] += game_res.away_yellow_cards + (game_res.away_red_cards*2)

    results = pd.DataFrame(group_results).transpose().reset_index()
    results.columns = ['Team Name', 'Points', 'Goals For', 'Goals Against', 'Goals Diff', 'Fair Play Points']
    return results

In [9]:
#Helper function to get the winner of a game
def get_winner_from_result(game_result):
    rand = np.random.uniform(0,1)
    if game_result.home_goals > game_result.away_goals:
        winner = game_result.game.home_team
    elif game_result.home_goals < game_result.away_goals:
        winner = game_result.game.away_team
    elif rand <= 0.5:
        winner = game_result.game.home_team
    else:
        winner = game_result.game.away_team
    return winner
#Helper function to get the loser of a game
def get_losing_team(game_result):
    rand = np.random.uniform(0,1)
    if game_result.home_goals < game_result.away_goals:
        loser = game_result.game.home_team
    elif game_result.home_goals > game_result.away_goals:
        loser = game_result.game.away_team
    elif rand <= 0.5:
        loser = game_result.game.home_team
    else:
        loser = game_result.game.away_team
    return loser

def simulate_r16(bracket_games):
    r16_game_results_list = []
    
    for game in bracket_games:
        r16_game_results_list.append(get_game_result(game))

    return r16_game_results_list

def simulate_quarter_final(r16_results):
    r16_winners = {}
    q8_games_list = []
    q8_game_results_list = []
    game_num_list = [58,57,60,59]
    Group1 = [53,49,55,51]
    Group2 = [54,50,56,52]

    #get the winners from previous r16 games
    for game_result in r16_results:
        r16_winners[game_result.game.game_number] = get_winner_from_result(game_result)

    #create the games list for quarter final
    for j in range(4):
        q8_games_list.append(Game(game_num_list[j],r16_winners[Group1[j]],r16_winners[Group2[j]],1,1,"Bracket"))

    #simulate the matches of the quarter final
    for game in q8_games_list:
        q8_game_results_list.append(get_game_result(game))
    
    #return the results
    return q8_game_results_list


def simulate_semi_final(q8_game_results_list):
    q8_winners = {}
    sf_games_list = []
    sf_game_results_list = []
    game_num_list = [61,62]
    Group1 = [58,60]
    Group2 = [57,59]
    
    #get the winners from the previous quarter final games
    for game_result in q8_game_results_list:
        q8_winners[game_result.game.game_number] = get_winner_from_result(game_result)

    #create the games list for semi final
    for j in range(2):
        sf_games_list.append(Game(game_num_list[j],q8_winners[Group1[j]],q8_winners[Group2[j]],1,1,"Bracket"))

    #simulate the semi final games
    for game in sf_games_list:
        sf_game_results_list.append(get_game_result(game))

    return sf_game_results_list

def simulate_3rd_place(sf_game_results):
    third_place_teams = []
    third_place_game = []
    third_place_result = []
    
    for game_result in sf_game_results:
        third_place_teams.append(get_losing_team(game_result))

    third_place_game = Game("third_place",third_place_teams[0],third_place_teams[1],1,1,"KO")

    third_place_result = get_game_result(third_place_game)

    winner = get_winner_from_result(third_place_result)
    loser = get_losing_team(third_place_result)

    return winner, loser

def simulate_final(sf_game_results):
    final_teams = []
    for game_result in sf_game_results:
        final_teams.append(get_winner_from_result(game_result))
    final_game = Game("final",final_teams[0],final_teams[1],1,1,"KO")
    final_result = get_game_result(final_game)

    winner = get_winner_from_result(final_result)
    loser = get_losing_team(final_result)
    
    return winner, loser

#Saving the results for final output analysis
def save_results(winner,runner_up,third_place, fourth_place,quarter_final_res,r16_res,group_qualified_teams):
    quarter_final_teams = [get_winner_from_result(res) for res in r16_res]

    r16_teams = group_qualified_teams

    worldcup_results_list.append(WorldcupFinalResults(winner,runner_up,third_place,fourth_place, quarter_final_teams,r16_teams))



### Running 1 Simulation 

In [10]:
def run_a_worldcup():
    
    #generate the group stage games
    games_list = generate_group_stage_games_list(wc_teams_list)
    #simulate the group stage games
    game_results_list = simulate_group_stage_games(games_list)
    winners, runner_ups = grp_qualified_teams(game_results_list,wc_teams_list)
    bracket_games = generate_bracket_games(winners,runner_ups)
    #simulate the r16
    r16_res = simulate_r16(bracket_games)
    #simulate the quarter final
    qf_res = simulate_quarter_final(r16_res)
    #simulate the semi final
    semi_results = simulate_semi_final(qf_res)
    #simulate the 3rd place
    third_place, fourth_place = simulate_3rd_place(semi_results)
    #simulate the final
    winner, runner_up = simulate_final(semi_results)
    qual_teams = list(winners.values()) + list(runner_ups.values())
    
    #save the results
    save_results(winner,runner_up,third_place, fourth_place, qf_res, r16_res, qual_teams)
    return 

### Running The Whole Simulation Process

In [11]:
worldcup_results_list = []
files = initiate_input_data()
wc_teams_list = files[0]
ref_goals = files[1]

for i in range(num_replications):
    run_a_worldcup()

### Aggregating Results for Output Analysis

In [12]:
fields = ['Country']
final_sim_results = pd.read_csv("data/input_file.csv", usecols=fields)
final_sim_results["R16"] = 0
final_sim_results["Quarter Final"] = 0
final_sim_results["Fourth"] = 0
final_sim_results["Third"] = 0
final_sim_results["Runner-up"] = 0
final_sim_results["Champion"] = 0
final_sim_results["lower_95_CI"] = 1000.000
final_sim_results["upper_95_CI"] = 1000.000

for i in range(32):
    #calculate the percentage of each team occured in the r16s
    count = 0
    for j in range(num_replications):
        for k in range(16):
            if final_sim_results.iloc[i,0]== worldcup_results_list[j].r16s[k].name:
                count = count +1
    final_sim_results.iloc[i,1] = count/num_replications
    
    #calculate the percentage of each team occured in the quarter finals
    count = 0
    for j in range(num_replications):
        for k in range(8):
            if final_sim_results.iloc[i,0]== worldcup_results_list[j].quarter_finalists[k].name:
                count = count +1
    final_sim_results.iloc[i,2] = count/num_replications

    #calculate the percentage of each team won the fourth place
    count = 0
    for j in range(num_replications):
        
        if final_sim_results.iloc[i,0]== worldcup_results_list[j].fourth_place.name:
            count = count +1
    final_sim_results.iloc[i,3] = count/num_replications

    #calculate the percentage of each team won the third place
    count = 0
    for j in range(num_replications):
        
        if final_sim_results.iloc[i,0]== worldcup_results_list[j].third_place.name:
            count = count +1
    final_sim_results.iloc[i,4] = count/num_replications

    #calculate the percentage of each team won the second place
    count = 0
    for j in range(num_replications):
        
        if final_sim_results.iloc[i,0]== worldcup_results_list[j].runner_up.name:
            count = count +1
    final_sim_results.iloc[i,5] = count/num_replications

    #calculate the percentage of each team won the World Cup
    count = 0
    for j in range(num_replications):
        
        if final_sim_results.iloc[i,0]== worldcup_results_list[j].winner.name:
            count = count +1
    final_sim_results.iloc[i,6] = count /num_replications
    stddev = math.sqrt(final_sim_results["Champion"][i]*(1-final_sim_results["Champion"][i]))
    standard_error = stddev/math.sqrt(num_replications)
    t_critical = stats.t.ppf(q = .975, df = num_replications-1)
    final_sim_results["lower_95_CI"][i] = final_sim_results["Champion"][i] - t_critical*standard_error
    final_sim_results["upper_95_CI"][i] = final_sim_results["Champion"][i] + t_critical*standard_error

## Result

In [13]:
final_sim_results.sort_values(by=['Champion','Third','Fourth','Quarter Final','R16'], ascending=False)

Unnamed: 0,Country,R16,Quarter Final,Fourth,Third,Runner-up,Champion,lower_95_CI,upper_95_CI
21,Brazil,1.0,0.997349,0.000663,0.023731,0.050908,0.870874,0.863305,0.878443
17,Belgium,1.0,0.872862,0.038181,0.126342,0.369614,0.060321,0.054947,0.065694
19,Morocco,0.999735,0.684608,0.039639,0.079809,0.174732,0.025189,0.021652,0.028726
11,France,1.0,0.924036,0.1453,0.267533,0.27973,0.023465,0.020049,0.026882
7,Argentina,0.999735,0.894339,0.321225,0.290601,0.028106,0.013655,0.011036,0.016274
14,Spain,1.0,0.43073,0.002254,0.003314,0.006894,0.002651,0.001491,0.003812
4,England,0.919661,0.530956,0.08604,0.064431,0.03089,0.001856,0.000885,0.002828
2,Senegal,0.999867,0.551637,0.103142,0.049185,0.013655,0.000795,0.000159,0.001432
25,Portugal,0.993371,0.714968,0.048389,0.030359,0.029829,0.000663,8.2e-05,0.001244
3,Netherlands,1.0,0.573114,0.119183,0.047992,0.011136,0.000265,-0.000102,0.000633


In [14]:
import plotly.express as px

px.bar(final_sim_results, x="Country", y="Champion", color="Champion", color_discrete_sequence=px.colors.qualitative.Alphabet, hover_data=['Champion','lower_95_CI','upper_95_CI'],labels={'Champion':'Prob. of Winning the World Cup'}, 
title="Chances by Country of Winning the World Cup")


## Conclusion
From the result above, Brazil has the highest chance to win the World Cup Champion by winning 87.1% of the simulation. Belgium, Morocco, France, and Argentina are the top five teams with the chances of winning the World Cup Champion, but all of them only won less than 7% of the simulation each.

Another factor that plays a part is the group matchups, where the luck of the draw favors teams who have a higher attack/defense score differentials within their group. This can be seen with countries such as Spain having a 99% chance of qualifying through the group stages while others like Canada have a 0% chance.

Overall, the model behaves as expected, but limited in historic match data skews its output to favor teams who have scored high and conceded less regardless of the caliber of the match.

## Discussions
From our result, Brazil has an overwhelming chance to win the World Cup Champion. Even though Brazil is one of the top favorites, it is unexpected to see such a dominating advantage.

Our score model predicts a match outcome based on historical data of past match outcomes. In essence, the scoring model is biased by the data itself, where the quality of matchups is not being taken into account. Our efforts to remedy that fact are beyond the scope and time available for this project.

Maher, the original Poisson model author, fit the model with data of teams in the same leagues with a round-robin format in one season, which is significantly different from how we modeled our data. National teams mostly play in tournaments and never play in leagues. In addition, the FIFA World Cup is a quadrennial event with different teams each year. We had to include non-World Cup games to build the model. FIFA World Cup includes a round-robin group stage and a knockout stage. Maher’s model may or may not be suitable to predict the knockout stage. Therefore, the accuracy of the score model needs to be further investigated.

We included data from 2014 and onwards. However, the sport of soccer and the players are continuously changing over the years. We here propose another way to model the match outcomes by including the players. We couldn’t achieve this as the squad for each team has not been selected.

## References
[1] Wikipedia. “2022 FIFA World Cup”. https://en.wikipedia.org/wiki/2022_FIFA_World_Cup

[2] Jürisoo, Mart. “International football results from 1872 to 2021”. *Kaggle* (2022). https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017

[3] FIFA. “FIFA World Cup Qatar 2022 – Match Schedule”. (2022). https://digitalhub.fifa.com/m/6a616c6cf19bc57a/original/FWC-2022-Match-Schedule.pdf

[4] FIFA. “Regulations – FIFA World Cup Qatar 2022”. (2021). https://digitalhub.fifa.com/m/2744a0a5e3ded185/original/FIFA-World-Cup-Qatar-2022-Regulations_EN.pdf

[5] Maher, Michael J. “Modelling association football scores”. *Statistica Neerlandica* 36.3 (1982): 109-118.