# Evaluation of the Model

This file is part of my work on Udacity's Nano Degree Programme.

As capstone project I compare the performance of a machine learning model in predicting matches of the 2020 UEFA European Football Championship with my personal bets in a football guessing game played on the platform www.kicktipp.de

In this notebook we use the trained model to make two kinds of predictions:
- At first we make predictions for all matches taken place at Euro 2020, then we compare these bets to the real results and calculate how many points we would have achieved on Kicktipp when using the results from the ML model.
- Moreover, using the ML model we make a Monte Carlo simulation of Euro 2020 to obtain candidates for the bonus bets (winners of the groups, participants of the semi-finals and winner of Euro 2020).

<span style="color:red">The execution of this notebook takes very long because of the Monte Carlo Simulation! Decreasing the number of runs of the Monte Carlo simulation lowers the execution time (but makes the result more random). The paramater could be found in the Input parameter section.</span>

In [1]:
import numpy as np
import pandas as pd

In [2]:
import os

In [3]:
import itertools
from collections import defaultdict, Counter

In [4]:
from random import choices

In [5]:
import pickle

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin

## Input Parameter

In [7]:
number_runs_monte_carlo = 2000

In [8]:
file_features_last_match = '20210825_features_last_match.xlsx'
file_matches_em = 'matches_euro_2020.xlsx'
name_model = 'model.pkl'

path_data = '../data/'
path_models = '../models/'

## Prediction and Evaluation of Matches

### Functions for Prediction and Evaluation of Matches

In [9]:
# The FeatureSelectionTransformer chooses only engineered features w.r.t. a single (weighted) average
class FeatureSelectionTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, type_feat = 'weighted_mean_10'):
        self.type_feat = type_feat
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        '''
        INPUT: Dataframe X
        OUTPUT: X restricted to the columns specified in 'type_feat' and the columns are sorted
        '''
        cols_feat = X.columns
        cols_feat_selected = [col for col in cols_feat if self.type_feat in col]
        
        cols_feat_selected.sort()
        
        return X[cols_feat_selected]

In [10]:
def get_goal_prob_for_pairings(df_feat, df_matches, model):
    '''
    INPUT:  Dataframe df_matches that contains the matches for which we want to compute the probability distribution
            of the goals of each team according to the model.
            Dataframe df_feat that contains the engineered features of the last match before the beginning of Euro 2020.
            A model that predicts the probabilities for the goals of each team based on the features in df_feat.
    OUTPUT: The probability distributions for the goals scored by team_A and team_B, respectively.
    '''
    
    df_feat_A = df_feat.copy()
    cols_A = [col + '_A' for col in df_feat_A.columns]
    df_feat_A.columns = cols_A

    df_feat_B = df_feat.copy()
    cols_B = [col + '_B' for col in df_feat_B.columns]
    df_feat_B.columns = cols_B
      
    df_feat_goals_A = pd.merge(df_matches.copy(), df_feat_A, how = 'inner', on = 'team_A')  
    df_feat_goals_A = pd.merge(df_feat_goals_A, df_feat_B, how = 'inner', on = 'team_B')
    df_feat_goals_A = pd.merge(df_matches, df_feat_goals_A, how = 'inner', on = ['team_A', 'team_B'])
    
    df_feat_goals_B = df_matches.copy()
    df_feat_goals_B.columns = ['team_B', 'team_A']
    df_feat_goals_B = pd.merge(df_feat_goals_B, df_feat_B, how = 'inner', on = 'team_B')
    df_feat_goals_B = pd.merge(df_feat_goals_B, df_feat_A, how = 'inner', on = 'team_A')
    
    df_matches_reversed = df_matches.rename(columns = {'team_A' : 'team_B',
                                                       'team_B' : 'team_A'})
    
    df_feat_goals_B = pd.merge(df_matches_reversed, df_feat_goals_B, how = 'inner', on = ['team_A', 'team_B'])

    prob_distr_goals_A = model.predict_proba(df_feat_goals_A)
    prob_distr_goals_B = model.predict_proba(df_feat_goals_B)
    
    return prob_distr_goals_A, prob_distr_goals_B

In [11]:
def get_result_with_max_expected_points(p_goals_A, p_goals_B):
    '''
    INPUT:  Probability distribution 'p_goals_A' of the scored goals of team_A, and respectively for 'p_goals_B'. 
    OUTPUT: The result that maximizes the expected points from Kicktipp.
    
    DESCRIPTION: In Kicktipp we receive 4 points for an exact result, 3 for the correct goal difference (in case of a draw
                 that is already the correct result and therefore 4 points worth), and 2 points for the right tendency
                 (win of team_A, win of team_B or draw)
    '''
    
    p_goals_A = np.reshape(p_goals_A, (10, 1))
    p_goals_B = np.reshape(p_goals_B, (1, 10))
    
    # matrix with the probability of all results
    mat_prob = np.matmul(p_goals_A, p_goals_B)
    
    # the maximum expected points of a draw are two times the probability of a draw (sum of all probabilities on the diagonal)
    # and two times the maximum probability of the diagonal
    max_expected_points_draw = 2*sum(np.diagonal(mat_prob, 0)) +2*max(np.diagonal(mat_prob, 0))
    best_draw = (np.argmax(np.diagonal(mat_prob, 0)), np.argmax(np.diagonal(mat_prob, 0)))

    # the maximum expected points for a win is a bit more complicated than for a draw
    # We have to sum up all probabilities for in the upper triangle, the sum for the minor diagonals and the maximum probability
    # on a minor diagonal.
    prob_win_A = sum([sum(np.diagonal(mat_prob, -i)) for i in range(1, 10)])
    prob_win_B = sum([sum(np.diagonal(mat_prob, i)) for i in range(1, 10)])

    max_prob_A = 0
    best_win_A = (1, 0)
    
    max_prob_B = 0
    best_win_B = (0, 1)
    
    for i in range(1, 10):
        if max(max_prob_A, sum(np.diagonal(mat_prob, -i)) + max(np.diagonal(mat_prob, -i))) > max_prob_A:
            max_prob_A = max(max_prob_A, sum(np.diagonal(mat_prob, -i)) + max(np.diagonal(mat_prob, -i)))
            best_win_A = (i + np.argmax(np.diagonal(mat_prob, -i)), np.argmax(np.diagonal(mat_prob, -i)))
        
        if max(max_prob_B, sum(np.diagonal(mat_prob, i)) + max(np.diagonal(mat_prob, i))) > max_prob_B:
            max_prob_B = max(max_prob_B, sum(np.diagonal(mat_prob, i)) + max(np.diagonal(mat_prob, i)))
            best_win_B = (np.argmax(np.diagonal(mat_prob, i)), i + np.argmax(np.diagonal(mat_prob, i)))
    
    max_expected_points_win_A = 2*prob_win_A + max_prob_A
    max_expected_points_win_B = 2*prob_win_B + max_prob_B
    
    max_expected_points = max(max_expected_points_draw, max_expected_points_win_A, max_expected_points_win_B)
    
    if max_expected_points_win_A == max_expected_points:
        return best_win_A
    elif max_expected_points_win_B == max_expected_points:
        return best_win_B
    else:
        return best_draw

In [12]:
def evaluate_bet(goals_A, goals_B, goals_A_bet, goals_B_bet):
    '''
    INPUT:  The goals of the actual result and the result of our bet.
    OUTPUT: The points we would receive in kicktipp. (We receive 4 points for the exact result, 3 points for the correct goal
            difference (in case of a draw that is already the correct result), and 2 points for the correct tendency (win
            win of team_A, win of team_B and draw))
    '''
    
    # Draw
    if goals_A == goals_B:
        if (goals_A_bet == goals_A) and (goals_B_bet == goals_B):
            return 4
        elif (goals_A_bet == goals_B_bet):
            return 2
        else:
            return 0
    # Win of team_A
    elif goals_A > goals_B:
        if (goals_A_bet == goals_A) and (goals_B_bet == goals_B):
            return 4
        elif (goals_A_bet -goals_B_bet) == (goals_A - goals_B):
            return 3
        elif goals_A_bet > goals_B_bet:
            return 2
        else:
            return 0
    # Win of team_B
    else:
        if (goals_A_bet == goals_A) and (goals_B_bet == goals_B):
            return 4
        elif (goals_A_bet - goals_B_bet) == (goals_A - goals_B):
            return 3
        elif goals_A_bet < goals_B_bet:
            return 2
        else:
            return 0

### Prediction and Evaluation

In [13]:
df_feat_teams = pd.read_excel(os.path.join(path_data, file_features_last_match), encoding = 'iso-8859-1')
df_matches_em = pd.read_excel(os.path.join(path_data, file_matches_em), encoding = 'iso-8859-1')

df_matches_em = df_matches_em.rename(columns = {'team_home' : 'team_A',
                                                'team_away' : 'team_B',
                                                'goals_home' : 'goals_A',
                                                'goals_away' : 'goals_B'
                                               })

In [14]:
df_feat_teams.head(5)

Unnamed: 0,team,goals_weighted_mean_5,goals_weighted_mean_10,goals_normal_mean_5,goals_normal_mean_10,attempts_total_weighted_mean_5,attempts_total_weighted_mean_10,attempts_total_normal_mean_5,attempts_total_normal_mean_10,attempts_off_target_weighted_mean_5,...,blocks_normal_mean_5,blocks_normal_mean_10,clearances_weighted_mean_5,clearances_weighted_mean_10,clearances_normal_mean_5,clearances_normal_mean_10,passes_accuracy_weighted_mean_5,passes_accuracy_weighted_mean_10,passes_accuracy_normal_mean_5,passes_accuracy_normal_mean_10
0,Bosnia and Herzegovina,0.515524,0.872115,0.6,1.2,8.927936,11.426674,10.2,13.2,4.401353,...,4.6,4.0,0.0,0.0,0.0,0.0,72.406342,79.554149,80.6,83.6
1,Poland,0.913573,1.312851,1.0,1.5,7.622542,10.6908,8.4,12.2,2.467555,...,4.8,4.1,0.0,0.0,0.0,0.0,72.917068,79.576276,80.8,82.8
2,Italy,0.968606,2.0907,1.0,2.6,13.735538,16.302866,15.6,18.2,5.37095,...,2.2,2.5,0.0,0.0,0.0,0.0,77.806916,84.911828,86.8,88.5
3,Netherlands,1.068382,1.612555,1.0,1.9,12.766696,15.254578,14.0,16.8,4.696133,...,1.8,1.6,0.0,0.0,0.0,0.0,76.604228,83.63274,85.2,87.0
4,England,0.471127,1.711075,0.6,2.6,10.240947,11.909175,10.8,12.5,3.606357,...,1.4,1.2,0.0,0.0,0.0,0.0,77.021155,83.828621,86.2,87.3


In [15]:
df_matches_em.head(5)

Unnamed: 0,url,team_A,team_B,goals_A,goals_B
0,https://www.uefa.com/uefaeuro-2020/match/2024447/,Turkey,Italy,0,3
1,https://www.uefa.com/uefaeuro-2020/match/2024448/,Wales,Switzerland,1,1
2,https://www.uefa.com/uefaeuro-2020/match/2024457/,Turkey,Wales,0,2
3,https://www.uefa.com/uefaeuro-2020/match/2024458/,Italy,Switzerland,3,0
4,https://www.uefa.com/uefaeuro-2020/match/2024467/,Switzerland,Turkey,3,1


In [16]:
model = pickle.load(open(os.path.join(path_models, name_model), 'rb'))

In [17]:
prob_goals_A, prob_goals_B = get_goal_prob_for_pairings(df_feat_teams, df_matches_em[['team_A', 'team_B']], model)
prob_goals = list(zip(prob_goals_A, prob_goals_B))

In [18]:
result_max_expected_points = [get_result_with_max_expected_points(p_goals_A, p_goals_B) for p_goals_A, p_goals_B in prob_goals]

In [19]:
df_matches_em['goals_A_bet'] = [result[0] for result in result_max_expected_points]
df_matches_em['goals_B_bet'] = [result[1] for result in result_max_expected_points]

In [20]:
df_matches_em.head(5)

Unnamed: 0,url,team_A,team_B,goals_A,goals_B,goals_A_bet,goals_B_bet
0,https://www.uefa.com/uefaeuro-2020/match/2024447/,Turkey,Italy,0,3,1,2
1,https://www.uefa.com/uefaeuro-2020/match/2024448/,Wales,Switzerland,1,1,0,1
2,https://www.uefa.com/uefaeuro-2020/match/2024457/,Turkey,Wales,0,2,1,0
3,https://www.uefa.com/uefaeuro-2020/match/2024458/,Italy,Switzerland,3,0,1,0
4,https://www.uefa.com/uefaeuro-2020/match/2024467/,Switzerland,Turkey,3,1,1,0


In [21]:
df_matches_em['Points_Kicktipp'] = df_matches_em.apply(lambda row : evaluate_bet(row['goals_A'], row['goals_B'], row['goals_A_bet'], row['goals_B_bet']), axis = 1)

In [22]:
# The number of points we receive from the bets on single matches
df_matches_em['Points_Kicktipp'].sum()

89

## Prediction and Evaluation of Bonus Bets

### Auxiliary Functions

In [23]:
def prepare_games_in_group(list_of_teams, df_feat_teams):
    '''
    INPUT:  
    OUTPUT: 
    '''
    df_schedule_group = pd.DataFrame(data = list(itertools.combinations(list_of_teams, 2)), columns = ['team_A', 'team_B'])
    
    df_feat_teams_A = df_feat_teams.copy()
    df_feat_teams_A.columns = [col + '_A' for col in df_feat_teams_A.columns]
    
    df_feat_teams_B = df_feat_teams.copy()
    df_feat_teams_B.columns = [col + '_B' for col in df_feat_teams_B.columns]
    
    df_schedule_group_feat = pd.merge(df_schedule_group, df_feat_teams_A, how = 'inner', on = 'team_A')
    df_schedule_group_feat = pd.merge(df_schedule_group_feat, df_feat_teams_B, how = 'inner', on = 'team_B')
    
    return df_schedule_group_feat

In [24]:
def change_A_B_in_cols(name):
    '''
    INPUT:  String 'name'
    OUTPUT: 'name', where the suffix '_A' is replaced by '_B' and vice versa
    '''
    name_new = name
    if name.endswith('_A'):
        name_new = name.replace('_A', '_B')
    elif name.endswith('_B'):
        name_new = name.replace('_B', '_A')
    
    return name_new

In [25]:
def simulate_match(p_goals_A, p_goals_B):
    '''
    INPUT:  A probability distribution for the goals scored by team_A and team_B, respectively.
    OUTPUT: A random result of the game, where the goals are distributed according to the probability distribution given
            in the input.
    '''
    population_goals = [i for i in range(10)]
    goals_sim_A = choices(population_goals, p_goals_A)[0]
    goals_sim_B = choices(population_goals, p_goals_B)[0]
    
    return goals_sim_A, goals_sim_B

In [26]:
def simulate_match_without_draw(p_goals_A, p_goals_B):
    '''
    INPUT:  A probability distribution for the goals scored by team_A and team_B, respectively.
    OUTPUT: A random result of the game that is not a draw, where the goals are distributed according to 
            the probability distribution given in the input.
    '''
    population_goals = [i for i in range(10)]
    goals_sim_A, goals_sim_B = 0, 0
    
    while(goals_sim_A == goals_sim_B):
        goals_sim_A = choices(population_goals, p_goals_A)[0]
        goals_sim_B = choices(population_goals, p_goals_B)[0]
    
    return goals_sim_A, goals_sim_B

In [27]:
def compute_points(goals_1, goals_2):
    '''
    INPUT:  A soccer result.
    OUTPUT: The number of points a team receives for a result in UEFA Euro 2020.
    '''
    if goals_1 > goals_2:
        return 3
    elif goals_1 == goals_2:
        return 1
    else:
        return 0

### Class that represents a single group in UEFA Euro 2020

In [28]:
# The class Group is used for the Monte Carlo simulation of the tournament.

class Group():
    def __init__(self, list_of_teams, df_feat, model):
        '''
        INPUT:  A list of four teams forming the group. A dataframe containing the features, and a model that is indicates
                the probability distribution of the goals scored by teams.
        OUTPUT: -
        '''
        self.list_of_teams = list_of_teams
        self.list_of_pairings = list(itertools.combinations(self.list_of_teams, 2))
        
        self.df_pairings = pd.DataFrame(data = self.list_of_pairings, columns = ['team_A', 'team_B'])
        
        self.prob_distr_goals_A, self.prob_distr_goals_B = get_goal_prob_for_pairings(df_feat, self.df_pairings, model)
        
    def simulate_group(self):
        '''
        INPUT:  -
        OUTPUT: The teams on the first, second and third place of the group. Moreover, the points, goal difference and number
                of scored goals of the team on the third place.
        '''
        
        population_goals = [i for i in range(10)]
        df_results = self.df_pairings.copy()
        
        # The probability distributions of the goals scored in the matches
        prob_distr_goals_matches = list(zip(self.prob_distr_goals_A, self.prob_distr_goals_B))
        
        # Simulate matches
        goals_sim = [simulate_match(prob_distr[0], prob_distr[1]) for prob_distr in prob_distr_goals_matches]
        
        goals_sim_team_A = [goals[0] for goals in goals_sim]
        goals_sim_team_B = [goals[1] for goals in goals_sim]
         
        # Computation of the table at the end of the group phase
        df_results['goals_sim_A'], df_results['goals_sim_B'] = goals_sim_team_A, goals_sim_team_B
        
        df_results['points_sim_A'] = df_results.apply(lambda row : compute_points(row['goals_sim_A'], row['goals_sim_B']), axis = 1)
        df_results['points_sim_B'] = df_results.apply(lambda row : compute_points(row['goals_sim_B'], row['goals_sim_A']), axis = 1)
        
        df_results_mirrored = df_results.copy()
        df_results_mirrored.columns = [change_A_B_in_cols(col) for col in df_results_mirrored.columns]
        
        df_results = pd.concat([df_results, df_results_mirrored], axis = 0, sort = False).reset_index(drop = True)
        df_results['victories_sim_A'] = df_results['points_sim_A'].apply(lambda points : int(points == 3))
        
        df_results = df_results.groupby('team_A')['points_sim_A', 'goals_sim_A', 'goals_sim_B', 'victories_sim_A'].agg('sum').reset_index()
        df_results['difference_sim_A'] = df_results.apply(lambda row : row['goals_sim_A'] - row['goals_sim_B'], axis = 1)
        
        df_results = df_results.sort_values(by = ['points_sim_A', 'difference_sim_A', 'goals_sim_A'], ascending = [0, 0, 0])
        df_results = df_results.reset_index(drop = True)
    
        cols_third_team = ['team_A', 'points_sim_A', 'difference_sim_A', 'goals_sim_A', 'victories_sim_A']
    
        return list(df_results['team_A'].values[:2]) +  list(df_results[cols_third_team].iloc[2].values)

### Class that represents the Tournament Euro 2020

In [29]:
class EURO2020():
    def __init__(self, df_schedule, df_feat, model):
        '''
        INPUT:  A dataframe containing the schedule of Euro 2020, a dataframe containing the engineered features and a model
                that predicts the probabilities of goals scored by teams.
        OUTPUT: -
        '''
        self.groups = [Group(df_schedule[col].values, df_feat, model) for col in df_schedule]
    
    def compute_round_of_last_16(self, results_groups):
        '''
        INPUT:  Since 24 teams participated in Euro 2020 the rules for the round of the last 16 are a bit complicated.
                Apart from the first and second place in a group, also the four best third places take part in the round of 
                last 16. The pairings depend on from which groups the best four third places come from.
        OUTPUT: 
        '''
        
        # Computation, which teams on the third place in each group are the best four
        data_thirds = [result_group[2:] for result_group in results_groups]
        
        df_ranking_thirds = pd.DataFrame(data = data_thirds, columns = ['team', 'points', 'goal_difference', 'goals', 'victories'])
        df_ranking_thirds['group'] = ['A', 'B', 'C', 'D', 'E', 'F']
        
        df_ranking_thirds = df_ranking_thirds.sort_values(by = ['points', 'goal_difference', 'goals', 'victories'], ascending = [0, 0, 0, 0]).reset_index(drop = True)
        
        df_best_4_thirds = df_ranking_thirds[['team', 'group']].iloc[:4].sort_values(by = ['group']).reset_index(drop = True)
        
        best_4_thirds = df_best_4_thirds['group'].sum()
        
        # Computation of the round of the last 16
        dict_group_to_team = dict(zip(df_best_4_thirds['group'], df_best_4_thirds['team']))
        
        # This table stems from the rules of Euro 2020
        dict_opponents_thirds = {'ABCD' : ['A', 'D', 'B', 'C'],
                                 'ABCE' : ['A', 'E', 'B', 'C'],
                                 'ABCF' : ['A', 'F', 'B', 'C'],
                                 'ABDE' : ['D', 'E', 'A', 'B'],
                                 'ABDF' : ['D', 'F', 'A', 'B'],
                                 'ABEF' : ['E', 'F', 'B', 'A'],
                                 'ACDE' : ['E', 'D', 'C', 'A'],
                                 'ACDF' : ['F', 'D', 'C', 'A'],
                                 'ACEF' : ['E', 'F', 'C', 'A'],
                                 'ADEF' : ['E', 'F', 'D', 'A'],
                                 'BCDE' : ['E', 'D', 'B', 'C'],
                                 'BCDF' : ['F', 'D', 'C', 'B'],
                                 'BCEF' : ['F', 'E', 'C', 'B'],
                                 'BDEF' : ['F', 'E', 'D', 'B'],
                                 'CDEF' : ['F', 'E', 'D', 'C']
                                }
        
        round_of_last_16 = [[results_groups[1][0], dict_group_to_team[dict_opponents_thirds[best_4_thirds][0]]],
                            [results_groups[0][0], results_groups[2][1]],
                            [results_groups[5][0], dict_group_to_team[dict_opponents_thirds[best_4_thirds][3]]],
                            [results_groups[3][1], results_groups[4][1]],
                            [results_groups[4][0], dict_group_to_team[dict_opponents_thirds[best_4_thirds][2]]],
                            [results_groups[3][0], results_groups[5][1]],
                            [results_groups[2][0], dict_group_to_team[dict_opponents_thirds[best_4_thirds][1]]],
                            [results_groups[0][1], results_groups[1][1]]
                           ]

        return round_of_last_16
    
    def simulate_knockout_phase(self, df_knockout_phase, df_feat_teams, model):
        '''
        INPUT:  A dataframe containing the pairings in the knockout phase, a dataframe containing the engineered features
                and a model.
        OUTPUT: The winners of the simulated knockout phase.
        '''
        prob_distr_goals_A, prob_distr_goals_B = get_goal_prob_for_pairings(df_feat_teams, df_knockout_phase, model)
        
        prob_distr_goals_matches = list(zip(prob_distr_goals_A, prob_distr_goals_B))
        
        goals_sim = [simulate_match_without_draw(prob_distr[0], prob_distr[1]) for prob_distr in prob_distr_goals_matches]
        
        goals_sim_team_A = [goals[0] for goals in goals_sim]
        goals_sim_team_B = [goals[1] for goals in goals_sim]
                
        df_knockout_phase['goals_sim_A'], df_knockout_phase['goals_sim_B'] = goals_sim_team_A, goals_sim_team_B
        
        df_knockout_phase['winner'] = df_knockout_phase.apply(lambda row : row['team_A'] if row['goals_sim_A'] > row['goals_sim_B'] else row['team_B'], axis = 1)

        return df_knockout_phase['winner'].values
    
    def simulate_tournament(self, df_feat_teams, model):
        '''
        INPUT:  A dataframe containing the features of the teams and a model.
        OUTPUT: The winners of the groups, the participants of the semi-finals, and the champion of a simulated run
                of the tournament.
        '''
        results_groups = [g.simulate_group() for g in self.groups]

        winners_groups = [result_group[0] for result_group in results_groups]
        
        # Simulation round of last sixteen
        games_ko_last_16 = self.compute_round_of_last_16(results_groups)      
        df_ko_last_16 = pd.DataFrame(data = games_ko_last_16, columns = ['team_A', 'team_B'])
        winners_ko_last_16 = self.simulate_knockout_phase(df_ko_last_16, df_feat_teams, model)
        
        # Simulation round of last eight
        games_ko_last_8 = [winners_ko_last_16[2*i: 2*(i+1)] for i in range(len(winners_ko_last_16)//2)]
        df_ko_last_8 = pd.DataFrame(data = games_ko_last_8, columns = ['team_A', 'team_B'])
        winners_ko_last_8 = self.simulate_knockout_phase(df_ko_last_8, df_feat_teams, model)
        
        # Simulation of semi-finals
        games_ko_last_4 = [winners_ko_last_8[2*i: 2*(i+1)] for i in range(len(winners_ko_last_8)//2)]
        df_ko_last_4 = pd.DataFrame(data = games_ko_last_4, columns = ['team_A', 'team_B'])
        winners_ko_last_4 = self.simulate_knockout_phase(df_ko_last_4, df_feat_teams, model)
        
        # Simulation of the final
        games_ko_last_2 = [winners_ko_last_4[2*i: 2*(i+1)] for i in range(len(winners_ko_last_4)//2)]
        df_ko_last_2 = pd.DataFrame(data = games_ko_last_2, columns = ['team_A', 'team_B'])
        winners_ko_last_2 = self.simulate_knockout_phase(df_ko_last_2, df_feat_teams, model)
        
        return winners_groups, list(winners_ko_last_8), list(winners_ko_last_2)
    
    def monte_carlo(self, df_feat_teams, model, n):
        '''
        INPUT:  A dataframe containing the features, the model, and the number 'n' of repetitions of the Monte Carlo Simulation.
        OUTPUT: Dictionaries with the frequencies of the group winners, a dictionary with the frequencies of participants of
                the semi-finals and a dictionary with the frequencies of the champion.
        '''
        dicts_winners_groups = [defaultdict(int) for i in range(6)]
        dict_semi_finals = defaultdict(int)
        dict_champion = defaultdict(int)
        
        for i in range(n):
            winners_groups, semi_finals, champion = self.simulate_tournament(df_feat_teams, model)
            
            for i in range(len(winners_groups)):
                dicts_winners_groups[i][winners_groups[i]] += 1
                
            for i in range(len(semi_finals)):
                dict_semi_finals[semi_finals[i]] += 1
                
            dict_champion[champion[0]] += 1
        
        return dicts_winners_groups, dict_semi_finals, dict_champion

### Evaluation and Prediction

In [30]:
data_schedule = [['Italy', 'Belgium', 'Netherlands', 'England', 'Sweden', 'France'],
                 ['Wales', 'Denmark', 'Austria', 'Croatia', 'Spain', 'Germany'],
                 ['Switzerland', 'Finland', 'Ukraine', 'Czech Republic', 'Slovakia', 'Portugal'],
                 ['Turkey', 'Russia', 'North Macedonia', 'Scotland', 'Poland', 'Hungary']
                ]

df_schedule = pd.DataFrame(data = data_schedule, columns = ['Group ' + L for L in ['A', 'B', 'C', 'D', 'E', 'F']])

df_schedule

Unnamed: 0,Group A,Group B,Group C,Group D,Group E,Group F
0,Italy,Belgium,Netherlands,England,Sweden,France
1,Wales,Denmark,Austria,Croatia,Spain,Germany
2,Switzerland,Finland,Ukraine,Czech Republic,Slovakia,Portugal
3,Turkey,Russia,North Macedonia,Scotland,Poland,Hungary


In [31]:
euro2020 = EURO2020(df_schedule, df_feat_teams, model)

In [32]:
dicts_winners_groups, dict_semi_finals, dict_champion = euro2020.monte_carlo(df_feat_teams, model, number_runs_monte_carlo)

In [33]:
print('Most likely winners of each group:')
for i in range(len(dicts_winners_groups)):
    print(Counter(dicts_winners_groups[i]).most_common()[0])
print()
    
print('Most likely teams in the semi-finals:')
print(Counter(dict_semi_finals).most_common()[:4])
print()

print('Most likely winner of UEFA Euro 2020:')
print(Counter(dict_champion).most_common()[0])

Most likely winners of each group:
('Italy', 982)
('Belgium', 872)
('Netherlands', 1015)
('England', 739)
('Spain', 1232)
('Germany', 720)

Most likely teams in the semi-finals:
[('Spain', 819), ('Italy', 718), ('Netherlands', 696), ('Germany', 656)]

Most likely winner of UEFA Euro 2020:
('Spain', 328)
