## World cup 2022 predictions

The task is to predict the winner of the world cup 2022 on the basis of previous world cup results. The data is available in the file `worldcups.csv`, which contains the winners of the previous world cups. In the file `wcmatches.csv` you can find the results of all matches played in the world cups. The data is available from [Kaggle](https://www.kaggle.com/abecklas/fifa-world-cup). You can use some part of the data for training and some part for testing.

Then you can use the trained model to predict the winner of the world cup 2022 on the basis of the results of the matches played in the world cup 2022. The results of the matches can be found in the file `Fifa_world_cup_matches.csv`.

Few ideas for the prediction:

1. (Simple binary classification) Use the results of the matches in the group stage and 1st round of the knockout stage to to predict whether given country is a winner or not. -- disadvantage: the classes are not balanced (there are more losers than winners)

2. Predict each match independently, i.e. as features use the results of the given team in last 3 matches and the results of the opponent in last 3 matches. Then on this basis predict the result of the match. To get the final winner iterate through all matches in the knockout stage and predict the winner of each match.

### Import the data

In [1]:
import pandas as pd

winners = pd.read_csv('worldcups.csv')
winners.head()

Unnamed: 0,year,host,winner,second,third,fourth,goals_scored,teams,games,attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,434000
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,395000
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,483000
3,1950,Brazil,Uruguay,Brazil,Sweden,Spain,88,13,22,1337000
4,1954,Switzerland,West Germany,Hungary,Austria,Uruguay,140,16,26,943000


In [2]:
matches = pd.read_csv('wcmatches.csv')
matches.head()

Unnamed: 0,year,country,city,stage,home_team,away_team,home_score,away_score,outcome,win_conditions,winning_team,losing_team,date,month,dayofweek
0,1930,Uruguay,Montevideo,Group 1,France,Mexico,4,1,H,,France,Mexico,1930-07-13,Jul,Sunday
1,1930,Uruguay,Montevideo,Group 4,Belgium,United States,0,3,A,,United States,Belgium,1930-07-13,Jul,Sunday
2,1930,Uruguay,Montevideo,Group 2,Brazil,Yugoslavia,1,2,A,,Yugoslavia,Brazil,1930-07-14,Jul,Monday
3,1930,Uruguay,Montevideo,Group 3,Peru,Romania,1,3,A,,Romania,Peru,1930-07-14,Jul,Monday
4,1930,Uruguay,Montevideo,Group 1,Argentina,France,1,0,H,,Argentina,France,1930-07-15,Jul,Tuesday


In [3]:
matches2k22 = pd.read_csv('Fifa_world_cup_matches.csv')
matches2k22.head()

Unnamed: 0,team1,team2,possession team1,possession team2,possession in contest,number of goals team1,number of goals team2,date,hour,category,...,penalties scored team1,penalties scored team2,goal preventions team1,goal preventions team2,own goals team1,own goals team2,forced turnovers team1,forced turnovers team2,defensive pressures applied team1,defensive pressures applied team2
0,QATAR,ECUADOR,42%,50%,8%,0,2,20 NOV 2022,17 : 00,Group A,...,0,1,6,5,0,0,52,72,256,279
1,ENGLAND,IRAN,72%,19%,9%,6,2,21 NOV 2022,14 : 00,Group B,...,0,1,8,13,0,0,63,72,139,416
2,SENEGAL,NETHERLANDS,44%,45%,11%,0,2,21 NOV 2022,17 : 00,Group A,...,0,0,9,15,0,0,63,73,263,251
3,UNITED STATES,WALES,51%,39%,10%,1,1,21 NOV 2022,20 : 00,Group B,...,0,1,7,7,0,0,81,72,242,292
4,ARGENTINA,SAUDI ARABIA,64%,24%,12%,1,2,22 NOV 2022,11 : 00,Group C,...,1,0,4,14,0,0,65,80,163,361


In [4]:
# Create the set of unique teams from matches
teams = set(matches['home_team']).union(set(matches['away_team']))

In [5]:
years = set(matches['year'])

In [6]:
def get_goals_scored_lost(df, year, team):
    results = matches[(matches['year'] == year) & ((matches['home_team'] == team) | (matches['away_team'] == team))]
    if len(results) <= 3:
        return None

    matches_results = []
    for i in range(len(results)):
        goals_scored = results.iloc[i]['home_score'] if results.iloc[i]['home_team'] == team else results.iloc[i]['away_score']
        goals_lost = results.iloc[i]['away_score'] if results.iloc[i]['home_team'] == team else results.iloc[i]['home_score']
        matches_results.append([goals_scored, goals_lost])

    return matches_results

In [7]:
df = pd.DataFrame(columns=['team', 'year', 'm1_gs', 'm1_gl', 'm2_gs', 'm2_gl', 'm3_gs', 'm3_gl', 'm4_gs', 'm4_gl', 'm5_gs', 'm5_gl', 'm6_gs', 'm6_gl', 'm7_gs', 'm7_gl'])

max_len = 0
for year in years:
    teams = set(matches[matches['year'] == year]['home_team']).union(set(matches[matches['year'] == year]['away_team']))
    for team in teams:
        results = get_goals_scored_lost(matches, year, team)
        if results is not None:
            # Add the results to the dataframe
            df.loc[len(df)] = [team, year] + [results[i][j] for i in range(len(results)) for j in range(len(results[0]))] + [None] * (14 - len(results)*2)

df.head()

Unnamed: 0,team,year,m1_gs,m1_gl,m2_gs,m2_gl,m3_gs,m3_gl,m4_gs,m4_gl,m5_gs,m5_gl,m6_gs,m6_gl,m7_gs,m7_gl
0,Uruguay,1930,1,0,4,0,6,1,4,2,,,,,,
1,Argentina,1930,1,0,6,3,3,1,6,1,2.0,4.0,,,,
2,Italy,1934,7,1,1,1,1,0,1,0,2.0,1.0,,,,
3,Germany,1934,5,2,2,1,1,3,3,2,,,,,,
4,Czechoslovakia,1934,2,1,3,2,3,1,1,2,,,,,,


In [8]:
df

Unnamed: 0,team,year,m1_gs,m1_gl,m2_gs,m2_gl,m3_gs,m3_gl,m4_gs,m4_gl,m5_gs,m5_gl,m6_gs,m6_gl,m7_gs,m7_gl
0,Uruguay,1930,1,0,4,0,6,1,4,2,,,,,,
1,Argentina,1930,1,0,6,3,3,1,6,1,2,4,,,,
2,Italy,1934,7,1,1,1,1,0,1,0,2,1,,,,
3,Germany,1934,5,2,2,1,1,3,3,2,,,,,,
4,Czechoslovakia,1934,2,1,3,2,3,1,1,2,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219,Croatia,2018,2,0,3,0,2,1,1,1,2,2,2,1,2,4
220,Switzerland,2018,1,1,2,1,2,2,0,1,,,,,,
221,Denmark,2018,1,0,1,1,0,0,1,1,,,,,,
222,Spain,2018,3,3,1,0,2,2,1,1,,,,,,


In [9]:
def get_opponent(row, match_idx):
   match =  matches[(matches['year'] == row['year']) & ((matches['home_team'] == row['team']) | (matches['away_team'] == row['team']))].iloc[match_idx-1]
   return match['home_team'] if match['home_team'] != row['team'] else match['away_team']

In [10]:
# Split the data into form 3 mathces and the following one
df2 = pd.DataFrame(columns=['team', 'year', 'm1_gs', 'm1_gl', 'm2_gs', 'm2_gl', 'm3_gs', 'm3_gl', 'target_gs', 'target_gl'])
for i in range(len(df)):
    row = df.iloc[i]
    # Get the last not null values
    last_not_null = row.last_valid_index()
    idx = row.index.get_loc(last_not_null)
    while idx-7 >= 2:
        df2.loc[len(df2)] = [row[0], row[1]] + row[idx-7:idx+1].tolist()
        idx -= 2

In [11]:
df2

Unnamed: 0,team,year,m1_gs,m1_gl,m2_gs,m2_gl,m3_gs,m3_gl,target_gs,target_gl
0,Uruguay,1930,1,0,4,0,6,1,4,2
1,Argentina,1930,6,3,3,1,6,1,2,4
2,Argentina,1930,1,0,6,3,3,1,6,1
3,Italy,1934,1,1,1,0,1,0,2,1
4,Italy,1934,7,1,1,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...
471,Switzerland,2018,1,1,2,1,2,2,0,1
472,Denmark,2018,1,0,1,1,0,0,1,1
473,Spain,2018,3,3,1,0,2,2,1,1
474,Sweden,2018,1,2,3,0,1,0,0,2


In [12]:
X, y = df2[['m1_gl', 'm1_gs', 'm2_gl', 'm2_gs', 'm3_gl', 'm3_gs']], df2[['target_gl', 'target_gs']]

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

In [15]:
regr.score(X_test, y_test)

0.00034578652035771595

In [16]:
predictions = regr.predict(X_test)

In [17]:
# Calculate the accuracy for argmax predictions and test set
def accuracy(predictions, y_test):
    correct = 0
    for i in range(len(predictions)):
        if predictions[i].argmax() == y_test.iloc[i].argmax():
            correct += 1

    return correct / len(predictions)

In [18]:
accuracy(predictions, y_test)

0.6041666666666666

In [19]:
import numpy as np

# Split the target into 2 columns for each regression problem
y_train1, y_train2 = y_train['target_gl'], y_train['target_gs']
y_test1, y_test2 = y_test['target_gl'], y_test['target_gs']

# Train the models
regr1 = linear_model.LinearRegression()
regr1.fit(X_train, y_train1)

regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.6041666666666666

In [20]:
from sklearn import svm, naive_bayes

# Train the models
regr1 = svm.SVR(kernel='rbf', C=1e3, gamma=0.1)
regr1.fit(X_train, y_train1)

regr2 = svm.SVR(kernel='rbf', C=1e3, gamma=0.1)
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.3958333333333333

In [21]:
regr1 = naive_bayes.GaussianNB()
regr1.fit(X_train, y_train1)

regr2 = naive_bayes.GaussianNB()
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.53125

In [22]:
list_C = np.arange(500, 2000, 100)
list_gamma = np.arange(0.01, 0.1, 0.01)
best_C = 0
best_gamma = 0
best_accuracy = 0

for C in list_C:
    for gamma in list_gamma:
        regr1 = svm.SVR(kernel='rbf', C=C, gamma=gamma)
        regr1.fit(X_train, y_train1)

        regr2 = svm.SVR(kernel='rbf', C=C, gamma=gamma)
        regr2.fit(X_train, y_train2)

        # Predict the values
        predictions1 = regr1.predict(X_test)
        predictions2 = regr2.predict(X_test)

        # Calculate the accuracy
        acc = accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

        if acc > best_accuracy:
            best_C = C
            best_gamma = gamma
            best_accuracy = acc

print(best_C, best_gamma, best_accuracy)

600 0.01 0.5833333333333334


In [23]:
# Split the data into form 3 mathces and the following one
df3 = pd.DataFrame(columns=['t1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl', 'target_gs', 'target_gl'])
for i in range(len(df)):
    t1_row = df.iloc[i]
    # Get the last not null values
    last_not_null = t1_row.last_valid_index()
    idx = t1_row.index.get_loc(last_not_null)
   
    while idx-7 >= 2:
        match_idx = int((idx-1)/2)
        t2_row = df[(df['team'] == get_opponent(t1_row, match_idx)) & (df['year'] == t1_row['year'])]
        if len(t2_row) > 0:
            t2_row = t2_row.iloc[0]
            df3.loc[len(df3)] = t1_row[idx-7:idx-1].tolist() + t2_row[idx-7:idx-1].tolist() + [t1_row[idx-1], t1_row[idx]]
        idx -= 2

In [24]:
df3.dropna(inplace=True)

In [25]:
X, y = df3[['t1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl']], df3[['target_gl', 'target_gs']]

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [27]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

In [28]:
regr.score(X_test, y_test)

-0.03756092651098997

In [29]:
predictions = regr.predict(X_test)
accuracy(predictions, y_test)

0.6210526315789474

In [30]:
import numpy as np

# Split the target into 2 columns for each regression problem
y_train1, y_train2 = y_train['target_gl'], y_train['target_gs']
y_test1, y_test2 = y_test['target_gl'], y_test['target_gs']

# Train the models
regr1 = linear_model.LinearRegression()
regr1.fit(X_train, y_train1)

regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.6210526315789474

In [31]:
from sklearn import svm, naive_bayes

# Train the models
regr1 = svm.SVR(kernel='rbf', C=600, gamma=0.01)
regr1.fit(X_train, y_train1)

regr2 = svm.SVR(kernel='rbf', C=600, gamma=0.01)
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.5263157894736842

In [32]:
from sklearn import svm, naive_bayes

# Train the models
regr1 = svm.SVR(kernel='poly', degree=4)
regr1.fit(X_train, y_train1)

regr2 = svm.SVR(kernel='poly', degree=5)
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.5473684210526316

In [33]:
degree1s = np.arange(1, 15, 1)
degree2s = np.arange(1, 15, 1)

best_degree1 = 0
best_degree2 = 0
best_accuracy = 0

for degree1 in degree1s:
    for degree2 in degree2s:
        regr1 = svm.SVR(kernel='poly', degree=degree1)
        regr1.fit(X_train, y_train1)

        regr2 = svm.SVR(kernel='poly', degree=degree2)
        regr2.fit(X_train, y_train2)

        # Predict the values
        predictions1 = regr1.predict(X_test)
        predictions2 = regr2.predict(X_test)

        # Calculate the accuracy
        acc = accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

        if acc > best_accuracy:
            best_degree1 = degree1
            best_degree2 = degree2
            best_accuracy = acc

print(best_degree1, best_degree2, best_accuracy)

10 4 0.6947368421052632


In [34]:
regr1 = naive_bayes.GaussianNB()
regr1.fit(X_train, y_train1)

regr2 = naive_bayes.GaussianNB()
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.5263157894736842

In [35]:
# Best model
regr1 = svm.SVR(kernel='poly', degree=best_degree1)
regr1.fit(X_train, y_train1)

regr2 = svm.SVR(kernel='poly', degree=best_degree2)
regr2.fit(X_train, y_train2)

# Predict the values
predictions1 = regr1.predict(X_test)
predictions2 = regr2.predict(X_test)

# Calculate the accuracy
accuracy(np.stack([predictions1, predictions2], axis=1), y_test)

0.6947368421052632

In [36]:
matches2k22.head()

Unnamed: 0,team1,team2,possession team1,possession team2,possession in contest,number of goals team1,number of goals team2,date,hour,category,...,penalties scored team1,penalties scored team2,goal preventions team1,goal preventions team2,own goals team1,own goals team2,forced turnovers team1,forced turnovers team2,defensive pressures applied team1,defensive pressures applied team2
0,QATAR,ECUADOR,42%,50%,8%,0,2,20 NOV 2022,17 : 00,Group A,...,0,1,6,5,0,0,52,72,256,279
1,ENGLAND,IRAN,72%,19%,9%,6,2,21 NOV 2022,14 : 00,Group B,...,0,1,8,13,0,0,63,72,139,416
2,SENEGAL,NETHERLANDS,44%,45%,11%,0,2,21 NOV 2022,17 : 00,Group A,...,0,0,9,15,0,0,63,73,263,251
3,UNITED STATES,WALES,51%,39%,10%,1,1,21 NOV 2022,20 : 00,Group B,...,0,1,7,7,0,0,81,72,242,292
4,ARGENTINA,SAUDI ARABIA,64%,24%,12%,1,2,22 NOV 2022,11 : 00,Group C,...,1,0,4,14,0,0,65,80,163,361


In [37]:
def get_goals_scored_lost(df, team):
    results = matches2k22[((matches2k22['team1'] == team) | (matches2k22['team2'] == team))]
    if len(results) <= 3:
        return None

    matches_results = []
    for i in range(len(results)):
        goals_scored = results.iloc[i]['number of goals team1'] if results.iloc[i]['team1'] == team else results.iloc[i]['number of goals team2']
        goals_lost = results.iloc[i]['number of goals team2'] if results.iloc[i]['team1'] == team else results.iloc[i]['number of goals team1']
        matches_results.append([goals_scored, goals_lost])

    return matches_results

In [38]:
# Perform similar preprocessing for the matches 2k22
df2k22 = pd.DataFrame(columns=['team', 'm1_gs', 'm1_gl', 'm2_gs', 'm2_gl', 'm3_gs', 'm3_gl', 'm4_gs', 'm4_gl', 'm5_gs', 'm5_gl', 'm6_gs', 'm6_gl', 'm7_gs', 'm7_gl'])

max_len = 0
teams = set(matches2k22['team1']).union(set(matches2k22['team2']))
for team in teams:
    results = get_goals_scored_lost(matches, team)
    if results is not None:
        # Add the results to the dataframe
        df2k22.loc[len(df2k22)] = [team] + [results[i][j] for i in range(len(results)) for j in range(len(results[0]))] + [None] * (14 - len(results)*2)

df2k22.head()

Unnamed: 0,team,m1_gs,m1_gl,m2_gs,m2_gl,m3_gs,m3_gl,m4_gs,m4_gl,m5_gs,m5_gl,m6_gs,m6_gl,m7_gs,m7_gl
0,POLAND,0,0,2,0,0,2,1,3,,,,,,
1,KOREA REPUBLIC,0,0,2,3,2,1,1,4,,,,,,
2,JAPAN,2,1,0,1,2,1,1,1,,,,,,
3,FRANCE,4,1,2,1,0,1,3,1,,,,,,
4,PORTUGAL,3,2,2,0,1,2,6,1,,,,,,


In [39]:
def get_quaterfinal_opponent(team):
   if team == 'NETHERLANDS':
      return 'ARGENTINA'
   elif team == 'ARGENTINA':
      return 'NETHERLANDS'
   elif team == 'BRAZIL':
      return 'CROATIA'
   elif team == 'CROATIA':
      return 'BRAZIL'
   elif team == 'FRANCE':
      return 'ENGLAND'
   elif team == 'ENGLAND':
      return 'FRANCE'
   elif team == 'PORTUGAL':
      return 'MOROCCO'
   elif team == 'MOROCCO':
      return 'PORTUGAL'

In [40]:
# Split the data into form 3 mathces and the following one
df2k22_op = pd.DataFrame(columns=['team', 't1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl'])
for i in range(len(df2k22)):
    t1_row = df2k22.iloc[i]
    # Get the last not null values
    last_not_null = t1_row.last_valid_index()
    idx = t1_row.index.get_loc(last_not_null)
   
    while idx-7 >= 1:
        match_idx = int(idx/2)
        t2_row = df2k22[df2k22['team'] == get_quaterfinal_opponent(t1_row['team'])]
        if len(t2_row) > 0:
            t2_row = t2_row.iloc[0]
            df2k22_op.loc[len(df2k22_op)] = [t1_row['team']] + t1_row[idx-5:idx+1].tolist() + t2_row[idx-5:idx+1].tolist()
        idx -= 2

In [41]:
df2k22_op

Unnamed: 0,team,t1_m1_gs,t1_m1_gl,t1_m2_gs,t1_m2_gl,t1_m3_gs,t1_m3_gl,t2_m1_gs,t2_m1_gl,t2_m2_gs,t2_m2_gl,t2_m3_gs,t2_m3_gl
0,FRANCE,2,1,0,1,3,1,0,0,3,0,3,0
1,PORTUGAL,2,0,1,2,6,1,2,0,2,1,0,0
2,ENGLAND,0,0,3,0,3,0,2,1,0,1,3,1
3,CROATIA,4,1,0,0,1,1,1,0,0,1,4,1
4,ARGENTINA,2,0,2,0,2,1,1,1,2,0,3,1
5,MOROCCO,2,0,2,1,0,0,2,0,1,2,6,1
6,BRAZIL,1,0,0,1,4,1,4,1,0,0,1,1
7,NETHERLANDS,1,1,2,0,3,1,2,0,2,0,2,1


In [42]:
X_quater = df2k22_op[['t1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl']]

In [43]:
predictions1 = regr1.predict(X_quater)
predictions2 = regr2.predict(X_quater)

In [44]:
for i in range(len(df2k22_op['team'])):
    team = df2k22_op['team'].iloc[i]
    prediction = [predictions1[i], predictions2[i]]
    print('Match between {} and {} will end with {}:{}.'.format(team, get_quaterfinal_opponent(team), int(prediction[0]), int(prediction[1])))
    

Match between FRANCE and ENGLAND will end with 0:1.
Match between PORTUGAL and MOROCCO will end with -1:3.
Match between ENGLAND and FRANCE will end with 0:1.
Match between CROATIA and BRAZIL will end with 1:1.
Match between ARGENTINA and NETHERLANDS will end with 0:0.
Match between MOROCCO and PORTUGAL will end with -2:1.
Match between BRAZIL and CROATIA will end with 1:1.
Match between NETHERLANDS and ARGENTINA will end with 0:1.


In [45]:
def winner(team1, team2, predictions, predictions_reverse):
    goal_scored_team1 = (predictions[0] + predictions_reverse[1]) / 2
    goal_scored_team2 = (predictions[1] + predictions_reverse[0]) / 2
    if goal_scored_team1 >= goal_scored_team2:
        return team1
    elif goal_scored_team1 < goal_scored_team2:
        return team2

In [46]:
quaterfinals = [('NETHERLANDS', 'ARGENTINA'), ('BRAZIL', 'CROATIA'), ('FRANCE', 'ENGLAND'), ('PORTUGAL', 'MOROCCO')]

In [47]:
for match in quaterfinals:
    team1 = match[0]
    team2 = match[1]
    team1_idx = df2k22_op[df2k22_op['team'] == team1].index[0]
    team2_idx = df2k22_op[df2k22_op['team'] == team2].index[0]
    print('Winner of the match between {} and {} is {}.'.format(team1, team2, winner(team1, team2, [predictions1[team1_idx], predictions2[team1_idx]], [predictions1[team2_idx], predictions2[team2_idx]])))

Winner of the match between NETHERLANDS and ARGENTINA is ARGENTINA.
Winner of the match between BRAZIL and CROATIA is CROATIA.
Winner of the match between FRANCE and ENGLAND is ENGLAND.
Winner of the match between PORTUGAL and MOROCCO is MOROCCO.


In [48]:
# Append the predicted results to the dataframe
for team in [team for match in quaterfinals for team in match]:
    team_idx = df2k22_op[df2k22_op['team'] == team].index[0]
    team_idx2 = df2k22[df2k22['team'] == team].index[0]
    opponent_idx = df2k22_op[df2k22_op['team'] == get_quaterfinal_opponent(team)].index[0]
    opponent_idx2 = df2k22[df2k22['team'] == get_quaterfinal_opponent(team)].index[0]
    df2k22.loc[team_idx2, 'm5_gs'] = int((predictions1[team_idx] + predictions2[opponent_idx]) / 2)
    df2k22.loc[team_idx2, 'm5_gl'] = int((predictions2[team_idx] + predictions1[opponent_idx]) / 2)


In [49]:
df2k22

Unnamed: 0,team,m1_gs,m1_gl,m2_gs,m2_gl,m3_gs,m3_gl,m4_gs,m4_gl,m5_gs,m5_gl,m6_gs,m6_gl,m7_gs,m7_gl
0,POLAND,0,0,2,0,0,2,1,3,,,,,,
1,KOREA REPUBLIC,0,0,2,3,2,1,1,4,,,,,,
2,JAPAN,2,1,0,1,2,1,1,1,,,,,,
3,FRANCE,4,1,2,1,0,1,3,1,0.0,1.0,,,,
4,PORTUGAL,3,2,2,0,1,2,6,1,0.0,0.0,,,,
5,SPAIN,7,0,1,1,1,2,0,0,,,,,,
6,ENGLAND,6,2,0,0,3,0,3,0,1.0,0.0,,,,
7,CROATIA,0,0,4,1,0,0,1,1,1.0,1.0,,,,
8,AUSTRALIA,1,4,1,0,1,0,1,2,,,,,,
9,SWITZERLAND,1,0,0,1,3,2,1,6,,,,,,


In [50]:
def get_semifinal_opponent(team):
    if team == 'ARGENTINA':
        return 'CROATIA'
    elif team == 'CROATIA':
        return 'ARGENTINA'
    elif team == 'ENGLAND':
        return 'MOROCCO'
    elif team == 'MOROCCO':
        return 'ENGLAND'

In [51]:
# Get the last 3 matches for the semifinals teams
df2k22_sf = pd.DataFrame(columns=['team', 't1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl'])
for i in range(len(df2k22)):
    t1_row = df2k22.iloc[i]
    # Get the last not null values
    last_not_null = t1_row.last_valid_index()
    idx = t1_row.index.get_loc(last_not_null)
   
    match_idx = int(idx/2)
    t2_row = df2k22[df2k22['team'] == get_semifinal_opponent(t1_row['team'])]
    if len(t2_row) > 0:
        t2_row = t2_row.iloc[0]
        df2k22_sf.loc[len(df2k22_sf)] = [t1_row['team']] + t1_row[idx-5:idx+1].tolist() + t2_row[idx-5:idx+1].tolist()


In [52]:
df2k22_sf

Unnamed: 0,team,t1_m1_gs,t1_m1_gl,t1_m2_gs,t1_m2_gl,t1_m3_gs,t1_m3_gl,t2_m1_gs,t2_m1_gl,t2_m2_gs,t2_m2_gl,t2_m3_gs,t2_m3_gl
0,ENGLAND,3,0,3,0,1,0,2,1,0,0,0,0
1,CROATIA,0,0,1,1,1,1,2,0,2,1,0,0
2,ARGENTINA,2,0,2,1,0,0,0,0,1,1,1,1
3,MOROCCO,2,1,0,0,0,0,3,0,3,0,1,0


In [53]:
X_sf = df2k22_sf[['t1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl']]

In [54]:
predictions1 = regr1.predict(X_sf)
predictions2 = regr2.predict(X_sf)

for i in range(len(df2k22_sf['team'])):
    team = df2k22_sf['team'].iloc[i]
    prediction = [predictions1[i], predictions2[i]]
    print('Match between {} and {} will end with {}:{}.'.format(team, get_semifinal_opponent(team), int(prediction[0]), int(prediction[1])))

Match between ENGLAND and MOROCCO will end with 0:1.
Match between CROATIA and ARGENTINA will end with 1:1.
Match between ARGENTINA and CROATIA will end with 1:1.
Match between MOROCCO and ENGLAND will end with 1:0.


In [56]:
semifinals = [('CROATIA', 'ARGENTINA'), ('ENGLAND', 'MOROCCO')]
for match in semifinals:
    team1 = match[0]
    team2 = match[1]
    team1_idx = df2k22_sf[df2k22_sf['team'] == team1].index[0]
    team2_idx = df2k22_sf[df2k22_sf['team'] == team2].index[0]
    print('Winner of the match between {} and {} is {}.'.format(team1, team2, winner(team1, team2, [predictions1[team1_idx], predictions2[team1_idx]], [predictions1[team2_idx], predictions2[team2_idx]])))

Winner of the match between CROATIA and ARGENTINA is CROATIA.
Winner of the match between ENGLAND and MOROCCO is MOROCCO.


In [57]:
# Append the predicted results to the dataframe
for team in [team for match in semifinals for team in match]:
    team_idx = df2k22_sf[df2k22_sf['team'] == team].index[0]
    team_idx2 = df2k22[df2k22['team'] == team].index[0]
    opponent_idx = df2k22_sf[df2k22_sf['team'] == get_semifinal_opponent(team)].index[0]
    opponent_idx2 = df2k22[df2k22['team'] == get_semifinal_opponent(team)].index[0]
    df2k22.loc[team_idx2, 'm6_gs'] = int((predictions1[team_idx] + predictions2[opponent_idx]) / 2)
    df2k22.loc[team_idx2, 'm6_gl'] = int((predictions2[team_idx] + predictions1[opponent_idx]) / 2)

In [58]:
df2k22

Unnamed: 0,team,m1_gs,m1_gl,m2_gs,m2_gl,m3_gs,m3_gl,m4_gs,m4_gl,m5_gs,m5_gl,m6_gs,m6_gl,m7_gs,m7_gl
0,POLAND,0,0,2,0,0,2,1,3,,,,,,
1,KOREA REPUBLIC,0,0,2,3,2,1,1,4,,,,,,
2,JAPAN,2,1,0,1,2,1,1,1,,,,,,
3,FRANCE,4,1,2,1,0,1,3,1,0.0,1.0,,,,
4,PORTUGAL,3,2,2,0,1,2,6,1,0.0,0.0,,,,
5,SPAIN,7,0,1,1,1,2,0,0,,,,,,
6,ENGLAND,6,2,0,0,3,0,3,0,1.0,0.0,0.0,1.0,,
7,CROATIA,0,0,4,1,0,0,1,1,1.0,1.0,1.0,1.0,,
8,AUSTRALIA,1,4,1,0,1,0,1,2,,,,,,
9,SWITZERLAND,1,0,0,1,3,2,1,6,,,,,,


In [59]:
def get_finals_opponent(team):
    if team == 'ARGENTINA':
        return 'ENGLAND'
    elif team == 'ENGLAND':
        return 'ARGENTINA'
    elif team == 'CROATIA':
        return 'MOROCCO'
    elif team == 'MOROCCO':
        return 'CROATIA'

In [60]:
# Get the last 3 matches for the finals teams
df2k22_finals = pd.DataFrame(columns=['team', 't1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl'])
for i in range(len(df2k22)):
    t1_row = df2k22.iloc[i]
    # Get the last not null values
    last_not_null = t1_row.last_valid_index()
    idx = t1_row.index.get_loc(last_not_null)
   
    match_idx = int(idx/2)
    t2_row = df2k22[df2k22['team'] == get_finals_opponent(t1_row['team'])]
    if len(t2_row) > 0:
        t2_row = t2_row.iloc[0]
        df2k22_finals.loc[len(df2k22_finals)] = [t1_row['team']] + t1_row[idx-5:idx+1].tolist() + t2_row[idx-5:idx+1].tolist()


In [61]:
X_finals = df2k22_finals[['t1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl']]

In [62]:
predictions1 = regr1.predict(X_finals)
predictions2 = regr2.predict(X_finals)

for i in range(len(df2k22_finals['team'])):
    team = df2k22_finals['team'].iloc[i]
    prediction = [predictions1[i], predictions2[i]]
    print('Match between {} and {} will end with {}:{}.'.format(team, get_finals_opponent(team), int(prediction[0]), int(prediction[1])))

Match between ENGLAND and ARGENTINA will end with 1:1.
Match between CROATIA and MOROCCO will end with 1:0.
Match between ARGENTINA and ENGLAND will end with 1:0.
Match between MOROCCO and CROATIA will end with 1:1.


In [63]:
finals = [('ENGLAND', 'ARGENTINA'), ('CROATIA', 'MOROCCO')]
for match in finals:
    team1 = match[0]
    team2 = match[1]
    team1_idx = df2k22_finals[df2k22_finals['team'] == team1].index[0]
    team2_idx = df2k22_finals[df2k22_finals['team'] == team2].index[0]
    print('Winner of the match between {} and {} is {}.'.format(team1, team2, winner(team1, team2, [predictions1[team1_idx], predictions2[team1_idx]], [predictions1[team2_idx], predictions2[team2_idx]])))

Winner of the match between ENGLAND and ARGENTINA is ARGENTINA.
Winner of the match between CROATIA and MOROCCO is CROATIA.


Hence final winner is predicted to be Croatia.

# Actual results

Results after semi-finals:

In [65]:
df2k22

Unnamed: 0,team,m1_gs,m1_gl,m2_gs,m2_gl,m3_gs,m3_gl,m4_gs,m4_gl,m5_gs,m5_gl,m6_gs,m6_gl,m7_gs,m7_gl
0,POLAND,0,0,2,0,0,2,1,3,,,,,,
1,KOREA REPUBLIC,0,0,2,3,2,1,1,4,,,,,,
2,JAPAN,2,1,0,1,2,1,1,1,,,,,,
3,FRANCE,4,1,2,1,0,1,3,1,0.0,1.0,,,,
4,PORTUGAL,3,2,2,0,1,2,6,1,0.0,0.0,,,,
5,SPAIN,7,0,1,1,1,2,0,0,,,,,,
6,ENGLAND,6,2,0,0,3,0,3,0,1.0,0.0,0.0,1.0,,
7,CROATIA,0,0,4,1,0,0,1,1,1.0,1.0,1.0,1.0,,
8,AUSTRALIA,1,4,1,0,1,0,1,2,,,,,,
9,SWITZERLAND,1,0,0,1,3,2,1,6,,,,,,


In [70]:
df2k22.loc[df2k22['team'] == 'FRANCE', 'm5_gs'] = 2
df2k22.loc[df2k22['team'] == 'FRANCE', 'm5_gl'] = 1
df2k22.loc[df2k22['team'] == 'FRANCE', 'm6_gs'] = 2
df2k22.loc[df2k22['team'] == 'FRANCE', 'm6_gl'] = 0
df2k22.loc[df2k22['team'] == 'ARGENTINA', 'm5_gs'] = 2
df2k22.loc[df2k22['team'] == 'ARGENTINA', 'm5_gl'] = 2
df2k22.loc[df2k22['team'] == 'ARGENTINA', 'm6_gs'] = 3
df2k22.loc[df2k22['team'] == 'ARGENTINA', 'm6_gl'] = 0
df2k22.loc[df2k22['team'] == 'CROATIA', 'm5_gs'] = 1
df2k22.loc[df2k22['team'] == 'CROATIA', 'm5_gl'] = 1
df2k22.loc[df2k22['team'] == 'CROATIA', 'm6_gs'] = 0
df2k22.loc[df2k22['team'] == 'CROATIA', 'm6_gl'] = 3
df2k22.loc[df2k22['team'] == 'MOROCCO', 'm5_gs'] = 1
df2k22.loc[df2k22['team'] == 'MOROCCO', 'm5_gl'] = 0
df2k22.loc[df2k22['team'] == 'MOROCCO', 'm6_gs'] = 0
df2k22.loc[df2k22['team'] == 'MOROCCO', 'm6_gl'] = 2

In [71]:
def get_finals_opponent(team):
    if team == 'ARGENTINA':
        return 'FRANCE'
    elif team == 'FRANCE':
        return 'ARGENTINA'
    elif team == 'CROATIA':
        return 'MOROCCO'
    elif team == 'MOROCCO':
        return 'CROATIA'

In [72]:
# Get the last 3 matches for the finals teams
df2k22_finals = pd.DataFrame(columns=['team', 't1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl'])
for i in range(len(df2k22)):
    t1_row = df2k22.iloc[i]
    # Get the last not null values
    last_not_null = t1_row.last_valid_index()
    idx = t1_row.index.get_loc(last_not_null)
   
    match_idx = int(idx/2)
    t2_row = df2k22[df2k22['team'] == get_finals_opponent(t1_row['team'])]
    if len(t2_row) > 0:
        t2_row = t2_row.iloc[0]
        df2k22_finals.loc[len(df2k22_finals)] = [t1_row['team']] + t1_row[idx-5:idx+1].tolist() + t2_row[idx-5:idx+1].tolist()


In [73]:
df2k22_finals

Unnamed: 0,team,t1_m1_gs,t1_m1_gl,t1_m2_gs,t1_m2_gl,t1_m3_gs,t1_m3_gl,t2_m1_gs,t2_m1_gl,t2_m2_gs,t2_m2_gl,t2_m3_gs,t2_m3_gl
0,FRANCE,3,1,2,1,2,0,2,1,2,2,3,0
1,CROATIA,1,1,1,1,0,3,0,0,1,0,0,2
2,ARGENTINA,2,1,2,2,3,0,3,1,2,1,2,0
3,MOROCCO,0,0,1,0,0,2,1,1,1,1,0,3


In [74]:
X_finals = df2k22_finals[['t1_m1_gs', 't1_m1_gl', 't1_m2_gs', 't1_m2_gl', 't1_m3_gs', 't1_m3_gl', 't2_m1_gs', 't2_m1_gl', 't2_m2_gs', 't2_m2_gl', 't2_m3_gs', 't2_m3_gl']]

In [75]:
predictions1 = regr1.predict(X_finals)
predictions2 = regr2.predict(X_finals)

for i in range(len(df2k22_finals['team'])):
    team = df2k22_finals['team'].iloc[i]
    prediction = [predictions1[i], predictions2[i]]
    print('Match between {} and {} will end with {}:{}.'.format(team, get_finals_opponent(team), int(prediction[0]), int(prediction[1])))

Match between FRANCE and ARGENTINA will end with -3:0.
Match between CROATIA and MOROCCO will end with 1:0.
Match between ARGENTINA and FRANCE will end with -1:0.
Match between MOROCCO and CROATIA will end with 1:1.


In [76]:
finals = [('FRANCE', 'ARGENTINA'), ('CROATIA', 'MOROCCO')]
for match in finals:
    team1 = match[0]
    team2 = match[1]
    team1_idx = df2k22_finals[df2k22_finals['team'] == team1].index[0]
    team2_idx = df2k22_finals[df2k22_finals['team'] == team2].index[0]
    print('Winner of the match between {} and {} is {}.'.format(team1, team2, winner(team1, team2, [predictions1[team1_idx], predictions2[team1_idx]], [predictions1[team2_idx], predictions2[team2_idx]])))

Winner of the match between FRANCE and ARGENTINA is ARGENTINA.
Winner of the match between CROATIA and MOROCCO is CROATIA.


Hence basing on actual results the winner will be Argentina.