# World Cup Prediction Model
### By: David Hoffman and Kyle Kolodziej

## 1. Problem Definition

For this project we would like to take the most recent world cup data (including each team that is playing in the tournament and the different statistical evaluations for these teams) and use this data to predict the percentage for success that a user imputed team has.  The user will be able to input a specific team and different attack and defense ratings for that team and the model will be able to remove that team from the dataset and predict the success percentage for that team given the inputed data.  This will allow a person to see the success percentages of their chosen team given its current data or in a hypothetical situation where its attack or defense is better than it actually is.

## 2. Data Gathering

In [3]:
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup

# Code on how to get data extracted from all the teams for all years on both pages of FIFA's International rankings
data = pd.DataFrame(columns=['Year', 'Team', 'att', 'mid', 'def', 'ovr'])
version = 0
for fifa in range(5, 24):
    toFind = True
#     print("On fifa ", fifa)
    while toFind:
        version += 1
        for page in range(2):
            # https://www.fifaindex.com/teams/fifa22_527/?page=1&type=1
            webAddress = ""
            toCheck = "" # check matching version in responses title
            if fifa < 10:
                toCheck = '0' + str(fifa)
                webAddress = 'https://www.fifaindex.com/teams/fifa0' + str(fifa) + '_' + str(version) + '/?page=' + str(page+1)+'&type=1'
            elif fifa < 23:
                toCheck = str(fifa)
                
                webAddress = 'https://www.fifaindex.com/teams/fifa' + str(fifa) + '_' + str(version) + '/?page=' + str(page+1)+'&type=1'
            else:
                # Get current stats
                toCheck = str(fifa - 1)
                webAddress = 'https://www.fifaindex.com/teams/?page=' + str(page+1) + '&type=1'
            r = requests.get(webAddress)

            if r.status_code != 200:
                #print("Error getting web address")
                version += 1
                break
                
            soup = BeautifulSoup(r.content, 'html.parser')
            
            if toCheck not in str(soup.title):
#                 print("not in title!")
#                 print("Title: " + str(soup.title))
#                 print("Fifa: " + toCheck + "\n")
                break
            else:
                toFind = False
            s = soup.find('table', class_='table table-striped table-teams')
            content = s.find_all('td')
            
            year = 1999 + fifa # Need to update
            team, attRate, midRate, defRate, ovrRate  = '', None, None, None, None
            #textLines = []
            i = 0
            for line in content:
                
                if line.text != '' and line.text != 'International' and line.text != '\n\n' and line.text != "Men's National":
                    # textLines was keeping track of all lines parsed, useful for seeing which index corresponds with an attribute
                    # textLines.append(line)
                    # Assign value to respective variable
                    if i == 0:
                        team = line.text
                    elif i == 1:
                        attRate = line.text
                    elif i == 2:
                        midRate = line.text
                    elif i == 3:
                        defRate = line.text
                    elif i == 4:
                        ovrRate = line.text
                        df2 = pd.DataFrame({'Year': [year],
                                'Team': [team],
                                'att': [attRate],
                                'mid': [midRate],
                                'def': [defRate],
                                'ovr': [ovrRate]})
                        data = data.append(df2, ignore_index = True)

                    # Update i
                    # If i is at 5, reset to 0 as at the start of a new team in the content
                    i += 1
                    if i == 5:
                        i = 0

here
here


In [4]:
data.head(1)

Unnamed: 0,Year,Team,att,mid,def,ovr
0,2004,France,94,89,84,88


In [5]:
data.tail(1)

Unnamed: 0,Year,Team,att,mid,def,ovr
820,2022,New Zealand,69,67,68,68


Nice got all the of the stats on each team from 2004

Fifa is missing some of the world cup teams so will need to impute them based off of world cup rankings
* impute rankings from https://www.fifa.com/fifa-world-ranking

In [38]:
def parseFifaRankData(page_source, year):
    # Function to parse a page on a Fifa Ranking page
    # Inserts the Team Name along with their Points and Overall Rank for the respective year into the data frame points
    # Points and Overall Rank will be normalized
    
    soup = BeautifulSoup(page_source, 'lxml')
    s = soup.find_all('tr', class_='fc-ranking-item-full_rankingTableFullRow__1nbp7')
    pointArr = []
    rankArr = []
    nameArr = []
    yearArr = []
    for line in s:
        # Each line contains the container for a team
        teamName = line.find('span', class_='d-none d-lg-block').text
        nameArr.append(teamName)
        teamPoints = line.find('div', class_="d-flex ff-mr-16").text
        pointArr.append(float(teamPoints))
        teamRank = line.find('h6', class_="ff-m-0").text
        rankArr.append(float(teamRank))
        yearArr.append(year)
        
    return pointArr, rankArr, nameArr, yearArr
        

In [417]:
from decimal import Decimal
def normalizePointAndRankArr(pointArr, rankArr):
    # Normalize on scale from 0.1 to 3.1 for the point arr
    # Normalize on scale from 0.1 to 10.1
    t_max = 3
    t_min = 0.1
    diff = t_max - t_min
    diff_arr_points = max(pointArr) - min(pointArr)
    normPointArr = []
    for num in pointArr:
        temp = (((num - min(pointArr))*t_max)/diff_arr_points) + t_min
        normPointArr.append(temp)
    
    t_max = 10
    t_min = 0.1
    diff_arr_rank = max(rankArr) - min(rankArr)
    normRankArr = []
    for num in rankArr:
        temp = (((num - min(rankArr))*t_max)/diff_arr_rank) + t_min
        temp = t_max - temp
        temp = temp + t_min + t_min
#         toAdd = t_max - temp + t_min
#         toAdd = Decimal(toAdd)
        normRankArr.append(temp)
    
    return normPointArr, normRankArr

In [418]:
def addToPointsDF(points, pointArrNorm, rankArrNorm, nameArr, yearArr):
    df2 = pd.DataFrame({'Year': yearArr,
                    'Team': nameArr,
                    'Points': pointArrNorm,
                    'Rank': rankArrNorm})
    points = points.append(df2, ignore_index = True)
    return points

In [419]:
# Need to use the driver: https://stackoverflow.com/questions/52687372/beautifulsoup-not-returning-complete-html-of-the-page
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import numpy as np

points = pd.DataFrame(columns=['Year', 'Team', 'Points', 'Rank'])

url = "https://www.fifa.com/fifa-world-ranking/men?dateId=id13603"
driver = webdriver.Chrome(executable_path=r"C:/Users/Kyle/Downloads/chromedriver_win32/chromedriver.exe")
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
driver.find_element_by_xpath("//button[@id='onetrust-accept-btn-handler']").click()


for year in range(2004, 2023):
    # Get data of top 100 teams from 2004 to 2022 in February
    date = 'Feb ' + str(year)
    xPath = "//button[@class='ff-dropdown_dropupContentButton__3WmBL' and contains(.,'" + date + "')]"
    driver.find_element_by_xpath("//div[@class='ff-dropdown_dropup__3DoLH null ']").click()
    driver.find_element_by_xpath(xPath).click() # click to this year
    time.sleep(1) # wait for the page to load for 1 sec
    #points = parseFifaRankData(points, driver.page_source, year)
    pointArr1, rankArr1, nameArr1, yearArr1 = parseFifaRankData(driver.page_source, year)
    
    driver.find_element_by_xpath("//div[@aria-label='Go to Page 2']").click() # page 2
    time.sleep(1) # wait for the page to load for 1 sec
    #points = parseFifaRankData(points, driver.page_source, year)
    pointArr2, rankArr2, nameArr2, yearArr2 = parseFifaRankData(driver.page_source, year)
    
    pointArr = pointArr1 + pointArr2
    rankArr = rankArr1 + rankArr2
    nameArr = nameArr1 + nameArr2
    yearArr = yearArr1 + yearArr2
    
    pointArrNorm, rankArrNorm = normalizePointAndRankArr(pointArr, rankArr)
    points = addToPointsDF(points, pointArrNorm, rankArrNorm, nameArr, yearArr)

  # Remove the CWD from sys.path while we load stuff.
  del sys.path[0]


In [420]:
points.head(5)

Unnamed: 0,Year,Team,Points,Rank
0,2004,Brazil,3.1,10.1
1,2004,France,2.962944,9.99899
2,2004,Spain,2.711675,9.89798
3,2004,Netherlands,2.361421,9.79697
4,2004,Mexico,2.308122,9.69596


In [421]:
points.tail(5)

Unnamed: 0,Year,Team,Points,Rank
1895,2022,Kyrgyz Republic,0.195828,0.50404
1896,2022,Congo,0.160528,0.40303
1897,2022,Vietnam,0.146727,0.30202
1898,2022,Equatorial Guinea,0.105665,0.20101
1899,2022,Palestine,0.1,0.1


Nice got all the points and rank in there with the teams (with points and rank normalized by year)

# Align the Team Name's Syntax Between Points and Data

In [422]:
pointsTeams = points["Team"].unique()
dataTeams = data["Team"].unique()
pointsTeamsNotInData = np.setdiff1d(pointsTeams, dataTeams)
dataTeamsNotInPoints = np.setdiff1d(dataTeams, pointsTeams)

In [423]:
pointsTeamsNotInData

array(['Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Armenia',
       'Azerbaijan', 'Bahrain', 'Belarus', 'Benin',
       'Bosnia and Herzegovina', 'Botswana', 'Burkina Faso', 'Cabo Verde',
       'Cape Verde Islands', 'Central African Republic', 'Congo',
       'Congo DR', 'Cuba', 'Curacao', 'Curaçao', 'Cyprus',
       'Dominican Republic', 'El Salvador', 'Equatorial Guinea',
       'Estonia', 'Ethiopia', 'FYR Macedonia', 'Faroe Islands', 'Gabon',
       'Gambia', 'Georgia', 'Ghana', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'IR Iran',
       'Indonesia', 'Iraq', 'Israel', 'Jamaica', 'Japan', 'Jordan',
       'Kazakhstan', 'Kenya', 'Korea DPR', 'Kuwait', 'Kyrgyz Republic',
       'Latvia', 'Lebanon', 'Liberia', 'Libya', 'Lithuania', 'Luxembourg',
       'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Moldova',
       'Montenegro', 'Morocco', 'Mozambique', 'Namibia', 'Niger',
       'North Macedonia', 'Oman', 'Palestine', 'Panama'

In [424]:
dataTeamsNotInPoints

array(['India'], dtype=object)

In [425]:
pointsTeams.sort()
pointsTeams

array(['Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Belarus', 'Belgium', 'Benin', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Cabo Verde',
       'Cameroon', 'Canada', 'Cape Verde Islands',
       'Central African Republic', 'Chile', 'China PR', 'Colombia',
       'Congo', 'Congo DR', 'Costa Rica', 'Croatia', 'Cuba', 'Curacao',
       'Curaçao', 'Cyprus', 'Czech Republic', "Côte d'Ivoire", 'Denmark',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'England',
       'Equatorial Guinea', 'Estonia', 'Ethiopia', 'FYR Macedonia',
       'Faroe Islands', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia',
       'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary',
       'IR Iran', 'Iceland', 'Indonesia', 'Iraq', 'Israel', 'Italy',
       'Jamaica', 'Japa

In [426]:
dataTeams.sort()
dataTeams

array(['Argentina', 'Australia', 'Austria', 'Belgium', 'Bolivia',
       'Brazil', 'Bulgaria', 'Cameroon', 'Canada', 'Chile', 'China PR',
       'Colombia', 'Costa Rica', 'Croatia', 'Czech Republic',
       "Côte d'Ivoire", 'Denmark', 'Ecuador', 'Egypt', 'England',
       'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland',
       'India', 'Italy', 'Korea Republic', 'Mexico', 'Netherlands',
       'New Zealand', 'Nigeria', 'Northern Ireland', 'Norway', 'Paraguay',
       'Peru', 'Poland', 'Portugal', 'Republic of Ireland', 'Romania',
       'Russia', 'Saudi Arabia', 'Scotland', 'Serbia', 'Slovenia',
       'South Africa', 'Spain', 'Sweden', 'Switzerland', 'Tunisia',
       'Turkey', 'USA', 'Ukraine', 'Uruguay', 'Venezuela', 'Wales'],
      dtype=object)

Team names to change in data...
* Austria (National team) --> Austria
* Holland --> Netherlands
* Rep. Of Korea --> Korea Republic
* United States --> USA
* Republic Of Ireland --> Republic of Ireland

Team names to change in points...
* Korea DPR --> Korea Republic

In [427]:
data.loc[data["Team"] == "Austria (National team)", "Team"] = "Austria"
data.loc[data["Team"] == "Holland", "Team"] = "Netherlands"
data.loc[data["Team"] == "Rep. Of Korea", "Team"] = "Korea Republic"
data.loc[data["Team"] == "United States", "Team"] = "USA"
data.loc[data["Team"] == "Republic Of Ireland", "Team"] = "Republic of Ireland"
points.loc[points["Team"] == "Korea DPR", "Team"] = "Korea Republic"

# Impute Each Team's OVR with OHE the Year

* Merge the data frames points and data on Year and Team
* Drop the NaN's
* Drop the columns Team, att, mid, def
* OHE the years

Now would be able to train/test split for imputing...
* y = ovr
* x = everything else (year OHE, points, rank)


In [428]:
dataToImpute = pd.merge(points, data, how='left', on=['Year','Team'])
dataToImpute = dataToImpute.dropna()
dataToImpute.head(5)

Unnamed: 0,Year,Team,Points,Rank,att,mid,def,ovr
0,2004,Brazil,3.1,10.1,92,88,87,88
1,2004,France,2.962944,9.99899,94,89,84,88
2,2004,Spain,2.711675,9.89798,90,88,86,88
4,2004,Mexico,2.308122,9.69596,70,71,68,70
5,2004,Argentina,2.300508,9.594949,86,85,87,85


In [429]:
# Drop the columns
dataToImpute = dataToImpute.drop(columns=['Team', 'att', 'mid', 'def'])
dataToImpute.head(1)

Unnamed: 0,Year,Points,Rank,ovr
0,2004,3.1,10.1,88


In [430]:
len(dataToImpute)

796

In [431]:
ohe_data = pd.get_dummies(dataToImpute, columns = ['Year'])
ohe_data["Year"] = dataToImpute["Year"]
ohe_data.head(3)

Unnamed: 0,Points,Rank,ovr,Year_2004,Year_2005,Year_2006,Year_2007,Year_2008,Year_2009,Year_2010,...,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018,Year_2019,Year_2020,Year_2021,Year_2022,Year
0,3.1,10.1,88,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2004
1,2.962944,9.99899,88,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2004
2,2.711675,9.89798,88,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2004


In [432]:
ohe_data.tail(3)

Unnamed: 0,Points,Rank,ovr,Year_2004,Year_2005,Year_2006,Year_2007,Year_2008,Year_2009,Year_2010,...,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018,Year_2019,Year_2020,Year_2021,Year_2022,Year
1856,1.082342,4.443434,71,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2022
1859,0.942111,4.140404,70,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2022
1874,0.607998,2.625253,70,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2022


# Predict the Overall

Start with a train test split, stratifying on the years

In [433]:
from sklearn.model_selection import train_test_split
X = ohe_data.drop(columns=['ovr', 'Year'])
y = ohe_data['ovr']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=ohe_data.Year)

In [434]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lm_ovr = LinearRegression()
lm_ovr.fit(X_train,y_train)
lm_predictions_ovr = lm_ovr.predict(X_test)

lin_mse = mean_squared_error(y_test, lm_predictions_ovr)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

3.0799509311347735

Not too bad, overall is off by a little over 3 on average with the linear regression model

Let's do a regression tree next

In [435]:
from sklearn.tree import DecisionTreeRegressor

tree_reg_ovr = DecisionTreeRegressor()
tree_reg_ovr.fit(X_train, y_train)
tree_predictions_ovr = tree_reg_ovr.predict(X_test)
tree_mse = mean_squared_error(y_test, tree_predictions_ovr)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

4.12954295291864

Decision tree regressor performs a little worse than the linear regression

Now let's try a random forest regressor

In [436]:
from sklearn.ensemble import RandomForestRegressor
forest_reg_ovr = RandomForestRegressor()
forest_reg_ovr.fit(X_train, y_train)
forest_predictions_ovr = forest_reg_ovr.predict(X_test)
forest_mse = mean_squared_error(y_test, forest_predictions_ovr)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

3.1550489574582423

Random forest regressor performs even better...nice!

# Using Random Forest Regressor Model to Predict the Overall's with OHE

In [437]:
imputedOveralls = points.copy()
imputedOveralls = pd.get_dummies(imputedOveralls, columns = ['Year'])

xImpute = imputedOveralls.drop(columns=['Team'])

overall_predictions = forest_reg_ovr.predict(xImpute)
overall_predictions = np.round(overall_predictions)
overall_predictions

array([88., 88., 86., ..., 68., 68., 68.])

In [438]:
points['ovr'] = overall_predictions.astype(int)
points.head(2)

Unnamed: 0,Year,Team,Points,Rank,ovr
0,2004,Brazil,3.1,10.1,88
1,2004,France,2.962944,9.99899,88


In [439]:
points.tail(100)

Unnamed: 0,Year,Team,Points,Rank,ovr
1800,2022,Belgium,3.100000,10.100000,83
1801,2022,Brazil,3.075644,9.998990,83
1802,2022,France,2.895174,9.897980,82
1803,2022,Argentina,2.802397,9.796970,83
1804,2022,England,2.746857,9.695960,83
1805,2022,Italy,2.675434,9.594949,83
1806,2022,Spain,2.501017,9.493939,83
1807,2022,Portugal,2.285538,9.392929,81
1808,2022,Denmark,2.257889,9.291919,80
1809,2022,Netherlands,2.253967,9.190909,82


Hmm that's kinda weird that the overalls are not so in order

Let's try doing it without OHE the year

# Impute the Overall without the Year

In [440]:
dataToImpute.head(5)

Unnamed: 0,Year,Points,Rank,ovr
0,2004,3.1,10.1,88
1,2004,2.962944,9.99899,88
2,2004,2.711675,9.89798,88
4,2004,2.308122,9.69596,70
5,2004,2.300508,9.594949,85


In [441]:
from sklearn.model_selection import train_test_split
X = dataToImpute.drop(columns=['ovr', 'Year'])
y = dataToImpute['ovr']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=dataToImpute.Year)

In [442]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lm_ovr = LinearRegression()
lm_ovr.fit(X_train,y_train)
lm_predictions_ovr = lm_ovr.predict(X_test)

lin_mse = mean_squared_error(y_test, lm_predictions_ovr)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

3.166819703119896

In [443]:
from sklearn.tree import DecisionTreeRegressor

tree_reg_ovr = DecisionTreeRegressor()
tree_reg_ovr.fit(X_train, y_train)
tree_predictions_ovr = tree_reg_ovr.predict(X_test)
tree_mse = mean_squared_error(y_test, tree_predictions_ovr)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

3.8627026728413245

In [444]:
from sklearn.ensemble import RandomForestRegressor
forest_reg_ovr = RandomForestRegressor()
forest_reg_ovr.fit(X_train, y_train)
forest_predictions_ovr = forest_reg_ovr.predict(X_test)
forest_mse = mean_squared_error(y_test, forest_predictions_ovr)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

3.2972685903197494

In [445]:
toImpute = points.copy()
toImpute = toImpute.drop(columns=['Team', 'Year', 'ovr'])

overall_predictions = lm_ovr.predict(toImpute)
overall_predictions = np.round(overall_predictions)
overall_predictions

array([84., 83., 82., ..., 67., 67., 67.])

In [446]:
points['ovr'] = overall_predictions.astype(int)
points.head(2)

Unnamed: 0,Year,Team,Points,Rank,ovr
0,2004,Brazil,3.1,10.1,84
1,2004,France,2.962944,9.99899,83


In [447]:
points.tail(100)

Unnamed: 0,Year,Team,Points,Rank,ovr
1800,2022,Belgium,3.100000,10.100000,84
1801,2022,Brazil,3.075644,9.998990,84
1802,2022,France,2.895174,9.897980,83
1803,2022,Argentina,2.802397,9.796970,82
1804,2022,England,2.746857,9.695960,82
1805,2022,Italy,2.675434,9.594949,82
1806,2022,Spain,2.501017,9.493939,81
1807,2022,Portugal,2.285538,9.392929,80
1808,2022,Denmark,2.257889,9.291919,80
1809,2022,Netherlands,2.253967,9.190909,80


# Get Previous International Soccer Game Data

Get previous game data

In [448]:
game = pd.read_csv("results.csv")
game["Year"] = game.date.str[:4].astype(str) # string splice first 4 letters of entire date column

game.head(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,Year
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,1873
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,1874
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,1876


In [449]:
game = game.drop(columns=['date', 'city', 'country', 'neutral'])
game = game.where(game['Year'] >= '2004')
game = game.dropna()
game.head(5)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004
26361,Bermuda,Barbados,0.0,4.0,Friendly,2004
26362,Kuwait,Yemen,4.0,0.0,Gulf Cup,2004
26363,Oman,Bahrain,0.0,1.0,Gulf Cup,2004
26364,United Arab Emirates,Qatar,0.0,0.0,Gulf Cup,2004


In [450]:
len(game)

17061

In [451]:
game["tournament"].value_counts()

Friendly                                 5957
FIFA World Cup qualification             4023
UEFA Euro qualification                  1084
African Cup of Nations qualification      744
African Cup of Nations                    357
UEFA Nations League                       308
CECAFA Cup                                278
AFC Asian Cup qualification               269
African Nations Championship              264
CFU Caribbean Cup qualification           264
FIFA World Cup                            256
Gold Cup                                  238
COSAFA Cup                                233
AFF Championship                          207
UEFA Euro                                 195
Copa América                              190
Island Games                              185
AFC Asian Cup                             179
Gulf Cup                                  136
EAFF Championship                         108
CONCACAF Nations League                   106
SAFF Cup                          

Friendlies take up almost a third of the data...going to try building models with and without friendlies included and see which one performs the best

In [452]:
game.head(1)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004


In [453]:
game['score'] = game['home_score'] - game['away_score']
game['outcome'] = None
game.loc[game["score"] == 0, "outcome"] = 0
game.loc[game["score"] > 0, "outcome"] = 1
game.loc[game["score"] < 0, "outcome"] = -1
game.head(5)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year,score,outcome
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004,-1.0,-1
26361,Bermuda,Barbados,0.0,4.0,Friendly,2004,-4.0,-1
26362,Kuwait,Yemen,4.0,0.0,Gulf Cup,2004,4.0,1
26363,Oman,Bahrain,0.0,1.0,Gulf Cup,2004,-1.0,-1
26364,United Arab Emirates,Qatar,0.0,0.0,Gulf Cup,2004,0.0,0


Predict if home team wins
outcome:
* 1 if home team wins
* 0 if draw
* -1 if away team wins




# Match Syntax of the Teams in Game Data with the Points and Data Teams

follow process of above

In [454]:
home = game["home_team"].unique()
away = game["away_team"].unique()
allGameTeams = list(set(home).union(set(away)))
allGameTeams.sort()
allGameTeams

['Abkhazia',
 'Afghanistan',
 'Albania',
 'Alderney',
 'Algeria',
 'American Samoa',
 'Andalusia',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Arameans Suryoye',
 'Argentina',
 'Armenia',
 'Artsakh',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barawa',
 'Barbados',
 'Basque Country',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bonaire',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Brittany',
 'Brunei',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Canary Islands',
 'Cape Verde',
 'Cascadia',
 'Catalonia',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chagos Islands',
 'Chameria',
 'Chile',
 'China PR',
 'Colombia',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Corsica',
 'Costa Rica',
 'County of Nice',
 'Crimea',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'DR Cong

look at teams in game that arent in data
of these teams, look at what teams aren't in points too

change up whatever possible to match them up
look into scraping another page or two?

In [455]:
pointsTeams = points["Team"].unique()
dataTeams = data["Team"].unique()
allGameTeamsNotInData = np.setdiff1d(allGameTeams, dataTeams)
gameTeamsNotInDataAndNotInPoints = np.setdiff1d(allGameTeamsNotInData, pointsTeams)

In [456]:
len(allGameTeamsNotInData)

241

In [457]:
dataTeams.sort()
dataTeams

array(['Argentina', 'Australia', 'Austria', 'Belgium', 'Bolivia',
       'Brazil', 'Bulgaria', 'Cameroon', 'Canada', 'Chile', 'China PR',
       'Colombia', 'Costa Rica', 'Croatia', 'Czech Republic',
       "Côte d'Ivoire", 'Denmark', 'Ecuador', 'Egypt', 'England',
       'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland',
       'India', 'Italy', 'Korea Republic', 'Mexico', 'Netherlands',
       'New Zealand', 'Nigeria', 'Northern Ireland', 'Norway', 'Paraguay',
       'Peru', 'Poland', 'Portugal', 'Republic of Ireland', 'Romania',
       'Russia', 'Saudi Arabia', 'Scotland', 'Serbia', 'Slovenia',
       'South Africa', 'Spain', 'Sweden', 'Switzerland', 'Tunisia',
       'Turkey', 'USA', 'Ukraine', 'Uruguay', 'Venezuela', 'Wales'],
      dtype=object)

In [458]:
allGameTeamsNotInData

array(['Abkhazia', 'Afghanistan', 'Albania', 'Alderney', 'Algeria',
       'American Samoa', 'Andalusia', 'Andorra', 'Angola', 'Anguilla',
       'Antigua and Barbuda', 'Arameans Suryoye', 'Armenia', 'Artsakh',
       'Aruba', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barawa', 'Barbados', 'Basque Country', 'Belarus', 'Belize',
       'Benin', 'Bermuda', 'Bhutan', 'Bonaire', 'Bosnia and Herzegovina',
       'Botswana', 'British Virgin Islands', 'Brittany', 'Brunei',
       'Brunei Darussalam', 'Burkina Faso', 'Burundi', 'Cambodia',
       'Canary Islands', 'Cape Verde', 'Cascadia', 'Catalonia',
       'Cayman Islands', 'Central African Republic', 'Chad',
       'Chagos Islands', 'Chameria', 'Comoros', 'Congo', 'Cook Islands',
       'Corsica', 'County of Nice', 'Crimea', 'Cuba', 'Curaçao', 'Cyprus',
       'DR Congo', 'Darfur', 'Djibouti', 'Dominica', 'Dominican Republic',
       'El Salvador', 'Ellan Vannin', 'Equatorial Guinea', 'Eritrea',
       'Estonia', 'Eswatini',

In [459]:
pointsTeams.sort()
pointsTeams

array(['Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Belarus', 'Belgium', 'Benin', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Cabo Verde',
       'Cameroon', 'Canada', 'Cape Verde Islands',
       'Central African Republic', 'Chile', 'China PR', 'Colombia',
       'Congo', 'Congo DR', 'Costa Rica', 'Croatia', 'Cuba', 'Curacao',
       'Curaçao', 'Cyprus', 'Czech Republic', "Côte d'Ivoire", 'Denmark',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'England',
       'Equatorial Guinea', 'Estonia', 'Ethiopia', 'FYR Macedonia',
       'Faroe Islands', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia',
       'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary',
       'IR Iran', 'Iceland', 'Indonesia', 'Iraq', 'Israel', 'Italy',
       'Jamaica', 'Japa

In [460]:
gameTeamsNotInDataAndNotInPoints

array(['Abkhazia', 'Afghanistan', 'Alderney', 'American Samoa',
       'Andalusia', 'Andorra', 'Anguilla', 'Arameans Suryoye', 'Artsakh',
       'Aruba', 'Bahamas', 'Bangladesh', 'Barawa', 'Barbados',
       'Basque Country', 'Belize', 'Bermuda', 'Bhutan', 'Bonaire',
       'British Virgin Islands', 'Brittany', 'Brunei',
       'Brunei Darussalam', 'Burundi', 'Cambodia', 'Canary Islands',
       'Cape Verde', 'Cascadia', 'Catalonia', 'Cayman Islands', 'Chad',
       'Chagos Islands', 'Chameria', 'Comoros', 'Cook Islands', 'Corsica',
       'County of Nice', 'Crimea', 'DR Congo', 'Darfur', 'Djibouti',
       'Dominica', 'Ellan Vannin', 'Eritrea', 'Eswatini',
       'Falkland Islands', 'Felvidék', 'Fiji', 'French Guiana', 'Frøya',
       'Galicia', 'Gibraltar', 'Gotland', 'Gozo', 'Greenland',
       'Guadeloupe', 'Guam', 'Guernsey', 'Găgăuzia', 'Hitra', 'Hong Kong',
       'Iran', 'Iraqi Kurdistan', 'Isle of Man', 'Isle of Wight',
       'Ivory Coast', 'Jersey', 'Kabylia', 'Kernow', 'Kir

Teams that need to be changed in game data
* Ivory Coast --> Côte d'Ivoire
* South Korea --> Korea Republic
* United States --> USA

Points Teams name changes
* IR Iran --> Iran

In [461]:
game.head(1)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year,score,outcome
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004,-1.0,-1


In [462]:
game.loc[game["home_team"] == "Ivory Coast", "home_team"] = "Côte d'Ivoire"
game.loc[game["away_team"] == "Ivory Coast", "away_team"] = "Côte d'Ivoire"
game.loc[game["home_team"] == "South Korea", "home_team"] = "Korea Republic"
game.loc[game["away_team"] == "South Korea", "away_team"] = "Korea Republic"
game.loc[game["home_team"] == "United States", "home_team"] = "USA"
game.loc[game["away_team"] == "United States", "away_team"] = "USA"
points.loc[points["Team"] == "IR Iran", "Team"] = "Iran"

# Add in Home Team and Away Team Overall

In [463]:
game.head(1)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year,score,outcome
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004,-1.0,-1


loop through game
for each team:
    check to see if there is a match in data for the team
    if not:
        check to see if there is a match in points
    if not in points and data:
        remove the row from game

In [464]:
data.head(1)

Unnamed: 0,Year,Team,att,mid,def,ovr
0,2004,France,94,89,84,88


In [465]:
game = game.reset_index() # Need to reset after getting rid of games before 2004
game['home_att'] = None
game['home_mid'] = None
game['home_def'] = None
game['home_ovr'] = None
game['away_att'] = None
game['away_mid'] = None
game['away_def'] = None
game['away_ovr'] = None
indexesToRemove = [] # keep track of rows that need to be removed
for idx in range(len(game)):
    year = int(game.at[idx, 'Year'])
    homeTeam = game.at[idx, 'home_team']
    awayTeam = game.at[idx, 'away_team']
    dataHomeMatchIdx = np.where((data['Team'] == homeTeam) & (data['Year'] == year))
    if len(dataHomeMatchIdx[0]) == 0:
        # home team is not in data
        # check if home team is in points
        pointsHomeMatchIdx = np.where((points['Team'] == homeTeam) & (points['Year'] == year))
        if len(pointsHomeMatchIdx[0]) != 0:
            # home team is in points
            # check if away team exists in data first
            dataAwayMatchIdx = np.where((data['Team'] == awayTeam) & (data['Year'] == year))
            if len(dataAwayMatchIdx[0]) == 0:
                # home team in points
                # away team is not in data
                # check if away team in points
                pointsAwayMatchIdx = np.where((points['Team'] == awayTeam) & (points['Year'] == year))
                if len(pointsAwayMatchIdx[0]) == 0:
                    # away team not in either so remove
                    #print('remove 1')
                    # REMOVE HERE
                    game = game.drop(index=idx)
                else:
                    # away team is in points
                    # home team in points
                    attHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']
                    midHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']
                    defHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']
                    ovrHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']

                    attAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                    midAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                    defAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                    ovrAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                    
                    # UPDATE DATA FRAME
                    game.at[idx, 'home_att'] = attHome
                    game.at[idx, 'home_mid'] = midHome
                    game.at[idx, 'home_def'] = defHome
                    game.at[idx, 'home_ovr'] = ovrHome
                    game.at[idx, 'away_att'] = attAway
                    game.at[idx, 'away_mid'] = midAway
                    game.at[idx, 'away_def'] = defAway
                    game.at[idx, 'away_ovr'] = ovrAway
            else:
                # away team is in data
                # home team is in points
                
                attHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']
                midHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']
                defHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']
                ovrHome = points.at[pointsHomeMatchIdx[0][0], 'ovr']

                attAway = data.at[dataAwayMatchIdx[0][0], 'att']
                midAway = data.at[dataAwayMatchIdx[0][0], 'mid']
                defAway = data.at[dataAwayMatchIdx[0][0], 'def']
                ovrAway = data.at[dataAwayMatchIdx[0][0], 'ovr']

                # update the columns in the data frame
                game.at[idx, 'home_att'] = attHome
                game.at[idx, 'home_mid'] = midHome
                game.at[idx, 'home_def'] = defHome
                game.at[idx, 'home_ovr'] = ovrHome
                game.at[idx, 'away_att'] = attAway
                game.at[idx, 'away_mid'] = midAway
                game.at[idx, 'away_def'] = defAway
                game.at[idx, 'away_ovr'] = ovrAway
        else:
            # no home team match
            #print('remove 2')
            #print(homeTeam + ", " + awayTeam)
            game = game.drop(index=idx)
         
    else:
        # home team is in data
        # now see if away team is in there too
        dataAwayMatchIdx = np.where((data['Team'] == awayTeam) & (data['Year'] == year))
        if len(dataAwayMatchIdx[0]) == 0:
            # away team not in data
            # check if away team is in points
            pointsAwayMatchIdx = np.where((points['Team'] == awayTeam) & (points['Year'] == year))
            if len(pointsAwayMatchIdx[0]) == 0:
                # away team not in either so remove
                #print('remove 3')
                game = game.drop(index=idx)
            else:
                # home team in data
                # away team in points
                attHome = data.at[dataHomeMatchIdx[0][0], 'att']
                midHome = data.at[dataHomeMatchIdx[0][0], 'mid']
                defHome = data.at[dataHomeMatchIdx[0][0], 'def']
                ovrHome = data.at[dataHomeMatchIdx[0][0], 'ovr']
            
                attAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                midAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                defAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                ovrAway = points.at[pointsAwayMatchIdx[0][0], 'ovr']
                
                # Update data frame
                game.at[idx, 'home_att'] = attHome
                game.at[idx, 'home_mid'] = midHome
                game.at[idx, 'home_def'] = defHome
                game.at[idx, 'home_ovr'] = ovrHome
                game.at[idx, 'away_att'] = attAway
                game.at[idx, 'away_mid'] = midAway
                game.at[idx, 'away_def'] = defAway
                game.at[idx, 'away_ovr'] = ovrAway
                
                
            
        else:
            # both home and away in data
            attHome = data.at[dataHomeMatchIdx[0][0], 'att']
            midHome = data.at[dataHomeMatchIdx[0][0], 'mid']
            defHome = data.at[dataHomeMatchIdx[0][0], 'def']
            ovrHome = data.at[dataHomeMatchIdx[0][0], 'ovr']
            
            attAway = data.at[dataAwayMatchIdx[0][0], 'att']
            midAway = data.at[dataAwayMatchIdx[0][0], 'mid']
            defAway = data.at[dataAwayMatchIdx[0][0], 'def']
            ovrAway = data.at[dataAwayMatchIdx[0][0], 'ovr']
            
            # update the columns in the data frame
            game.at[idx, 'home_att'] = attHome
            game.at[idx, 'home_mid'] = midHome
            game.at[idx, 'home_def'] = defHome
            game.at[idx, 'home_ovr'] = ovrHome
            game.at[idx, 'away_att'] = attAway
            game.at[idx, 'away_mid'] = midAway
            game.at[idx, 'away_def'] = defAway
            game.at[idx, 'away_ovr'] = ovrAway
    

In [466]:
game.head(5)

Unnamed: 0,index,home_team,away_team,home_score,away_score,tournament,Year,score,outcome,home_att,home_mid,home_def,home_ovr,away_att,away_mid,away_def,away_ovr
0,26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004,-1.0,-1,71,71,71,71,76,76,76,76
3,26363,Oman,Bahrain,0.0,1.0,Gulf Cup,2004,-1.0,-1,71,71,71,71,71,71,71,71
4,26364,United Arab Emirates,Qatar,0.0,0.0,Gulf Cup,2004,0.0,0,70,70,70,70,72,72,72,72
5,26365,Kuwait,Saudi Arabia,1.0,1.0,Gulf Cup,2004,0.0,0,72,72,72,72,76,76,76,76
7,26367,Oman,Saudi Arabia,1.0,2.0,Gulf Cup,2004,-1.0,-1,71,71,71,71,76,76,76,76


In [467]:
game["home_att"] = pd.to_numeric(game["home_att"])
game["home_mid"] = pd.to_numeric(game["home_mid"])
game["home_def"] = pd.to_numeric(game["home_def"])
game["home_ovr"] = pd.to_numeric(game["home_ovr"])
game["away_att"] = pd.to_numeric(game["away_att"])
game["away_mid"] = pd.to_numeric(game["away_mid"])
game["away_def"] = pd.to_numeric(game["away_def"])
game["away_ovr"] = pd.to_numeric(game["away_ovr"])

In [468]:
# Now let's combine the home and away attributes to be home - away
game['att'] = game['home_att'] - game['away_att']
game['mid'] = game['home_mid'] - game['away_mid']
game['def'] = game['home_def'] - game['away_def']
game['ovr'] = game['home_ovr'] - game['away_ovr']
game.head(2)

Unnamed: 0,index,home_team,away_team,home_score,away_score,tournament,Year,score,outcome,home_att,...,home_def,home_ovr,away_att,away_mid,away_def,away_ovr,att,mid,def,ovr
0,26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004,-1.0,-1,71,...,71,71,76,76,76,76,-5,-5,-5,-5
3,26363,Oman,Bahrain,0.0,1.0,Gulf Cup,2004,-1.0,-1,71,...,71,71,71,71,71,71,0,0,0,0


In [469]:
gameDrop = game.copy()
gameDrop = gameDrop.drop(columns=['index','home_team','away_team','home_score', 'away_score','score','home_att',
                                 'home_mid','home_def','home_ovr','away_att','away_mid','away_def','away_ovr' ])
gameDrop.head(2)

Unnamed: 0,tournament,Year,outcome,att,mid,def,ovr
0,Gulf Cup,2004,-1,-5,-5,-5,-5
3,Gulf Cup,2004,-1,0,0,0,0


# Modeling with All Game Data (Includes Friendlies)

In [470]:
gameWithFriendly = gameDrop.copy()
gameWithFriendly = gameWithFriendly.drop(columns=['tournament'])
gameWithFriendly.head(2)

Unnamed: 0,Year,outcome,att,mid,def,ovr
0,2004,-1,-5,-5,-5,-5
3,2004,-1,0,0,0,0


## Train/Test Split

In [471]:
from sklearn.model_selection import train_test_split
X = gameWithFriendly.drop(columns=['outcome', 'Year'])
y = gameWithFriendly['outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=gameWithFriendly.Year)

## Linear Regression Model

In [493]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)
lm_predictions[lm_predictions <= -.0000000000001] = -1
lm_predictions[lm_predictions >= .0000000000001] = 1
lm_predictions[(lm_predictions < .0000000000001) & (lm_predictions > -.0000000000001)] = 0
lm_predictions

array([ 1.,  1.,  1., ...,  1.,  1., -1.])

In [494]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, lm_predictions)

0.5310077519379846

## Decision Tree Regressor

In [495]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
tree_predictions = tree_reg.predict(X_test)
tree_predictions[tree_predictions <= -.0000000000001] = -1
tree_predictions[tree_predictions >= .0000000000001] = 1
tree_predictions[(tree_predictions < .0000000000001) & (tree_predictions > -.0000000000001)] = 0
accuracy_score(y_test, tree_predictions)

0.44573643410852715

## Random Forest

In [496]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)
forest_predictions = forest_reg.predict(X_test)
forest_predictions[forest_predictions <= -.0000000000001] = -1
forest_predictions[forest_predictions >= .0000000000001] = 1
forest_predictions[(forest_predictions < .0000000000001) & (forest_predictions > -.0000000000001)] = 0
accuracy_score(y_test, forest_predictions)

0.4935400516795866

# Modeling without Friendlies

In [None]:
# do work here

# Simulating the 2022 World Cup

doing this one with friendly model

In [497]:
groupA = ['Qatar', 'Ecuador', 'Netherlands', 'Senegal']
groupB = ['England', 'Iran', 'USA', 'Wales'] # not sure w Wales
groupC = ['Argentina', 'Poland', 'Mexico', 'Saudi Arabia']
groupD = ['France', 'Denmark', 'Tunisia', 'Peru'] # Peru not sure
groupE = ['Spain', 'Germany', 'Japan', 'Costa Rica'] # Costa rica not sure
groupF = ['Belgium', 'Canada', 'Morocco', 'Croatia']
groupG = ['Brazil', 'Serbia', 'Switzerland', 'Cameroon']
groupH = ['Portugal', 'Ghana', 'Uruguay', 'Korea Republic']

worldCupTeams = groupA + groupB + groupC + groupD + groupE + groupF + groupG + groupH

# Left side
# A1 v B2 : C2 v D2
# E1 v F2 : G1 v H2

# Right side
# D1 v C2 : B1 v A2
# F1 v E2 : G2 v H1

worldCup = pd.DataFrame(columns = ['Team', 'Points', 'Group', 'att', 'mid', 'def', 'ovr'])
idx = 0
teamsNotFound = []
for team in groupA:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'A'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    idx += 1
    
for team in groupB:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'B'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    
    idx += 1
for team in groupC:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'C'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    
    idx += 1
    
for team in groupD:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'D'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    
    idx += 1
    
for team in groupE:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'E'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    idx += 1
    
for team in groupF:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'F'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    idx += 1
    
for team in groupG:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'G'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    idx += 1
    
for team in groupH:
    worldCup.at[idx, 'Team'] = team
    worldCup.at[idx, 'Points'] = 0
    worldCup.at[idx, 'Group'] = 'H'
    dataIdx = np.where((data['Team'] == team) & (data['Year'] == 2022))
    if len(dataIdx[0]) == 0:
        # team not in data, check points
        pointsIdx = np.where((points['Team'] == team) & (points['Year'] == 2022))
        if len(pointsIdx[0]) == 0:
            teamsNotFound.append(team)
            att = 80
            mid = 80
            defRate = 80
            ovr = 80
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
        else:
            # found in points
            att = points.at[pointsIdx[0][0], 'ovr']
            mid = points.at[pointsIdx[0][0], 'ovr']
            defRate = points.at[pointsIdx[0][0], 'ovr']
            ovr = points.at[pointsIdx[0][0], 'ovr']
            worldCup.at[idx, 'att'] = att
            worldCup.at[idx, 'mid'] = mid
            worldCup.at[idx, 'def'] = defRate
            worldCup.at[idx, 'ovr'] = ovr
    else:
        # found in data
        att = data.at[dataIdx[0][0], 'att']
        mid = data.at[dataIdx[0][0], 'mid']
        defRate = data.at[dataIdx[0][0], 'def']
        ovr = data.at[dataIdx[0][0], 'ovr']
        
        worldCup.at[idx, 'att'] = att
        worldCup.at[idx, 'mid'] = mid
        worldCup.at[idx, 'def'] = defRate
        worldCup.at[idx, 'ovr'] = ovr
    idx += 1

In [498]:
teamsNotFound

[]

In [499]:
worldCup.head(5)

Unnamed: 0,Team,Points,Group,att,mid,def,ovr
0,Qatar,0,A,74,74,74,74
1,Ecuador,0,A,75,75,75,75
2,Netherlands,0,A,81,82,83,82
3,Senegal,0,A,78,78,78,78
4,England,0,B,86,83,83,83


In [500]:
worldCup["att"] = pd.to_numeric(worldCup["att"])
worldCup["mid"] = pd.to_numeric(worldCup["mid"])
worldCup["def"] = pd.to_numeric(worldCup["def"])
worldCup["ovr"] = pd.to_numeric(worldCup["ovr"])

In [518]:
groupsToRun = [groupA, groupB, groupC, groupD, groupE, groupF, groupG, groupH]
for group in groupsToRun:
    for i in range(len(group)-1):
        team1 = group[i]
        team1WC = worldCup[worldCup['Team'] == team1]
        predictor = pd.DataFrame(columns = ['att', 'mid', 'def', 'ovr'])
        for j in range(i + 1, len(group)):
            team2 = group[j]
            team2WC = worldCup[worldCup['Team'] == team2]

            predictor.at[0,'att'] = team1WC['att'].get_values()[0] - team2WC['att'].get_values()[0]
            predictor.at[0,'mid'] = team1WC['mid'].get_values()[0] - team2WC['mid'].get_values()[0]
            predictor.at[0,'def'] = team1WC['def'].get_values()[0] - team2WC['def'].get_values()[0]
            predictor.at[0,'ovr'] = team1WC['ovr'].get_values()[0] - team2WC['ovr'].get_values()[0]
            predictedVal = lm.predict(predictor)
            predictedVal[predictedVal <= -.0000000000001] = -1
            predictedVal[predictedVal >= .0000000000001] = 1
            predictedVal[(predictedVal < .0000000000001) & (predictedVal > -.0000000000001)] = 0
            if predictedVal == 1:
                # team 1 wins
                worldCup.at[team1WC.index[0], 'Points'] += 3
            elif predictedVal == 0:
                # draw
                worldCup.at[team1WC.index[0], 'Points'] += 1
                worldCup.at[team2WC.index[0], 'Points'] += 1
            else:
                # team 2 wins
                worldCup.at[team2WC.index[0], 'Points'] += 3

In [519]:
# worldCup['Points'] = 0

In [520]:
worldCup

Unnamed: 0,Team,Points,Group,att,mid,def,ovr
0,Qatar,3,A,74,74,74,74
1,Ecuador,0,A,75,75,75,75
2,Netherlands,9,A,81,82,83,82
3,Senegal,6,A,78,78,78,78
4,England,9,B,86,83,83,83
5,Iran,6,B,78,78,78,78
6,USA,3,B,75,75,74,75
7,Wales,0,B,74,74,72,74
8,Argentina,9,C,86,81,81,83
9,Poland,6,C,81,74,75,77


# Running Rest of World Cup after Group Stage

In [505]:
worldCup.head(4)

Unnamed: 0,Team,Points,Group,att,mid,def,ovr
0,Qatar,2,A,74,74,74,74
1,Ecuador,2,A,75,75,75,75
2,Netherlands,9,A,81,82,83,82
3,Senegal,2,A,78,78,78,78


In [521]:
pointsMost, pointsSec, ovrMost, ovrSec, topTeam, secTeam = 0, 0, 0, 0, None, None
groupIdx = 0
groupLetter = 'A'
teamsAdvancing = []
for i in range(len(worldCup)):
    currPoints = worldCup.at[i, 'Points']
    currOvr = worldCup.at[i, 'ovr']
    currTeam = worldCup.at[i, 'Team']
    if currPoints > pointsMost:
        pointsMost = currPoints
        ovrMost = currOvr
        topTeam = currTeam
    elif currPoints == pointsMost:
        if currOvr > ovrMost:
            pointsSec = pointsMost
            ovrSec = ovrMost
            secTeam = topTeam
            pointsMost = currPoints
            ovrMost = currOvr
            topTeam = currTeam 
    elif currPoints > pointsSec or (currPoints == pointsSec and currOvr > ovrSec):
        pointsSec = currPoints
        ovrSec = currOvr
        secTeam = currTeam
    groupIdx += 1
    if groupIdx == 4:
        newGroup = [topTeam, secTeam]
        teamsAdvancing.append(newGroup)
        groupIdx = 0
        pointsMost, pointsSec, ovrMost, ovrSec, topTeam, secTeam = 0, 0, 0, 0, None, None      

In [522]:
teamsAdvancing

[['Netherlands', 'Senegal'],
 ['England', 'Iran'],
 ['Argentina', 'Poland'],
 ['France', 'Denmark'],
 ['Spain', 'Germany'],
 ['Belgium', 'Croatia'],
 ['Brazil', 'Switzerland'],
 ['Portugal', 'Uruguay']]

In [523]:
a1 = teamsAdvancing[0][0]
a2 = teamsAdvancing[0][1]
b1 = teamsAdvancing[1][0]
b2 = teamsAdvancing[1][1]
c1 = teamsAdvancing[2][0]
c2 = teamsAdvancing[2][1]
d1 = teamsAdvancing[3][0]
d2 = teamsAdvancing[3][1]
e1 = teamsAdvancing[4][0]
e2 = teamsAdvancing[4][1]
f1 = teamsAdvancing[5][0]
f2 = teamsAdvancing[5][1]
g1 = teamsAdvancing[6][0]
g2 = teamsAdvancing[6][1]
h1 = teamsAdvancing[7][0]
h2 = teamsAdvancing[7][1]

In [529]:
def simulateKnockOutGame(teamName1, teamName2):
    predictor = pd.DataFrame(columns = ['att', 'mid', 'def', 'ovr'])
    team1WC = worldCup[worldCup['Team'] == teamName1]
    team2WC = worldCup[worldCup['Team'] == teamName2]
    predictor.at[0,'att'] = team1WC['att'].get_values()[0] - team2WC['att'].get_values()[0]
    predictor.at[0,'mid'] = team1WC['mid'].get_values()[0] - team2WC['mid'].get_values()[0]
    predictor.at[0,'def'] = team1WC['def'].get_values()[0] - team2WC['def'].get_values()[0]
    predictor.at[0,'ovr'] = team1WC['ovr'].get_values()[0] - team2WC['ovr'].get_values()[0]
    predictedVal = lm.predict(predictor)
    predictedVal[predictedVal <= -.0000000000001] = -1
    predictedVal[predictedVal >= .0000000000001] = 1
    predictedVal[(predictedVal < .0000000000001) & (predictedVal > -.0000000000001)] = 0
    if predictedVal == 1:
        # teamName1 wins
        return teamName1
    elif predictedVal == -1:
        # teamName2 wins
        return teamName2
    else:
        if team1WC['ovr'].get_values()[0] > team2WC['ovr'].get_values()[0]:
            # teamName1 wins in ET/PKs
            return teamName1
        else:
            # teamName2 wins in ET/PKs
            return teamName2

In [528]:
# Left side
# A1 v B2 : C1 v D2
# E1 v F2 : G1 v H2

# Right side
# D1 v C2 : B1 v A2
# F1 v E2 : H1 v G2 
leftSideTop = [[a1, b2], [c1, d2]]
leftSideBot = [[e1, f2], [g1, h2]]
leftSide = [leftSideTop, leftSideBot]

rightSideTop = [[d1, c2], [b1, a2]]
rightSideBot = [[f1, e2], [h1, g2]]
rightSide = [rightSideTop, rightSideBot]
knockOut = [leftSide, rightSide]
semis = []
for side in knockOut:
    for quarter in side:
        q1, q2 = None, None
        qIdx = 0
        for game in quarter:
            predictor = pd.DataFrame(columns = ['att', 'mid', 'def', 'ovr'])
            team1 = game[0]
            team2 = game[1]
            winner = simulateKnockOutGame(team1, team2)
            if qIdx == 0:
                q1 = winner
            else:
                toSemis = simulateKnockOutGame(q1, winner)
                semis.append(toSemis)
            qIdx += 1

final1 = simulateKnockOutGame(semis[0], semis[1])
final2 = simulateKnockOutGame(semis[2], semis[3])
champion = simulateKnockOutGame(final1, final2)
print(champion)

Argentina


# Save the Model

In [503]:
import pickle
# open a file, where you ant to store the data
file = open('world_cup_model.pkl', 'wb')

# dump information to that file
pickle.dump(lm, file)

## 4. Feature Engineering and Feature Selection

## 5. Model Training and Model Evaluation

## 6. Hyperparameter Tuning and Evaluation

## 7. Prediction

## 8. References

[1] Deploying Machine Learning Model Using Heroku. https://datamahadev.com/deploying-machine-learning-model-using-heroku/

[2] Using Machine Learning to Simulate World Cup Matches. https://towardsdatascience.com/using-machine-learning-to-simulate-world-cup-matches-959e24d0731

[3] World Cup 2018 Prediction. https://www.kaggle.com/code/agostontorok/soccer-world-cup-2018-winner