# World Cup Prediction Model
### By: David Hoffman and Kyle Kolodziej

## 1. Problem Definition

For this project we would like to take the most recent world cup data (including each team that is playing in the tournament and the different statistical evaluations for these teams) and use this data to predict the percentage for success that a user imputed team has.  The user will be able to input a specific team and different attack and defense ratings for that team and the model will be able to remove that team from the dataset and predict the success percentage for that team given the inputed data.  This will allow a person to see the success percentages of their chosen team given its current data or in a hypothetical situation where its attack or defense is better than it actually is.

## 2. Data Gathering

In [2]:
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup

# Code on how to get data extracted from all the teams for all years on both pages of FIFA's International rankings
data = pd.DataFrame(columns=['Year', 'Team', 'att', 'mid', 'def', 'ovr'])
version = 0
for fifa in range(5, 24):
    toFind = True
#     print("On fifa ", fifa)
    while toFind:
        version += 1
        for page in range(2):
            # https://www.fifaindex.com/teams/fifa22_527/?page=1&type=1
            webAddress = ""
            toCheck = "" # check matching version in responses title
            if fifa < 10:
                toCheck = '0' + str(fifa)
                webAddress = 'https://www.fifaindex.com/teams/fifa0' + str(fifa) + '_' + str(version) + '/?page=' + str(page+1)+'&type=1'
            elif fifa < 23:
                toCheck = str(fifa)
                webAddress = 'https://www.fifaindex.com/teams/fifa' + str(fifa) + '_' + str(version) + '/?page=' + str(page+1)+'&type=1'
            else:
                # Get current stats
                toCheck = str(fifa - 1)
                webAddress = 'https://www.fifaindex.com/teams/?page=' + str(page+1) + '&type=1'
            r = requests.get(webAddress)

            if r.status_code != 200:
                #print("Error getting web address")
                version += 1
                break
                
            soup = BeautifulSoup(r.content, 'html.parser')
            
            if toCheck not in str(soup.title):
#                 print("not in title!")
#                 print("Title: " + str(soup.title))
#                 print("Fifa: " + toCheck + "\n")
                break
            else:
                toFind = False
            s = soup.find('table', class_='table table-striped table-teams')
            content = s.find_all('td')
            
            year = 1999 + fifa # Need to update
            team, attRate, midRate, defRate, ovrRate  = '', None, None, None, None
            #textLines = []
            i = 0
            for line in content:
                
                if line.text != '' and line.text != 'International' and line.text != '\n\n' and line.text != "Men's National":
                    # textLines was keeping track of all lines parsed, useful for seeing which index corresponds with an attribute
                    # textLines.append(line)
                    #print("Line: \n" + str(line) + "\n")
                    
                    
                    # Assign value to respective variable
                    if i == 0:
                        team = line.text
                    elif i == 1:
                        attRate = line.text
                    elif i == 2:
                        midRate = line.text
                    elif i == 3:
                        defRate = line.text
                    elif i == 4:
                        ovrRate = line.text
                        df2 = pd.DataFrame({'Year': [year],
                                'Team': [team],
                                'att': [attRate],
                                'mid': [midRate],
                                'def': [defRate],
                                'ovr': [ovrRate]})
                        data = data.append(df2, ignore_index = True)

                    # Update i
                    # If i is at 5, reset to 0 as at the start of a new team in the content
                    i += 1
                    if i == 5:
                        i = 0



In [3]:
data.head(1)

Unnamed: 0,Year,Team,att,mid,def,ovr
0,2004,France,94,89,84,88


Nice got all the of the stats on each team from 2004

Fifa is missing some of the world cup teams so will need to impute them based off of world cup rankings
* impute rankings from https://www.fifa.com/fifa-world-ranking

In [4]:
def parseFifaRankData(points, page_source, year):
    # Function to parse a page on a Fifa Ranking page
    # Inserts the Team Name along with their Points and Overall Rank for the respective year into the data frame points
    # Points and Overall Rank will be normalized
    
    soup = BeautifulSoup(page_source, 'lxml')
    s = soup.find_all('tr', class_='fc-ranking-item-full_rankingTableFullRow__1nbp7')
    pointArr = []
    rankArr = []
    nameArr = []
    yearArr = []
    for line in s:
        # Each line contains the container for a team
        teamName = line.find('span', class_='d-none d-lg-block').text
        nameArr.append(teamName)
        teamPoints = line.find('div', class_="d-flex ff-mr-16").text
        pointArr.append(float(teamPoints))
        teamRank = line.find('h6', class_="ff-m-0").text
        rankArr.append(float(teamRank))
        yearArr.append(year)

    # Normalize points and rank then add to the data frame
    pointNorm = np.linalg.norm(pointArr)
    normPointArr = pointArr/pointNorm
    
    rankNorm = np.linalg.norm(rankArr)
    normRankArr = rankArr/rankNorm
    
    df2 = pd.DataFrame({'Year': yearArr,
                    'Team': nameArr,
                    'Points': normPointArr,
                    'Rank': normRankArr})
    points = points.append(df2, ignore_index = True)
    return points
        

In [8]:
# Need to use the driver: https://stackoverflow.com/questions/52687372/beautifulsoup-not-returning-complete-html-of-the-page
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import numpy as np

points = pd.DataFrame(columns=['Year', 'Team', 'Points', 'Rank'])

url = "https://www.fifa.com/fifa-world-ranking/men?dateId=id13603"
driver = webdriver.Chrome(executable_path=r"C:/Users/Kyle/Downloads/chromedriver_win32/chromedriver.exe")
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
driver.find_element_by_xpath("//button[@id='onetrust-accept-btn-handler']").click()


for year in range(2004, 2023):
    # Get data of top 100 teams from 2004 to 2022 in February
    date = 'Feb ' + str(year)
    xPath = "//button[@class='ff-dropdown_dropupContentButton__3WmBL' and contains(.,'" + date + "')]"
    driver.find_element_by_xpath("//div[@class='ff-dropdown_dropup__3DoLH null ']").click()
    driver.find_element_by_xpath(xPath).click() # click to this year
    time.sleep(1) # wait for the page to load for 1 sec
    points = parseFifaRankData(points, driver.page_source, year)
    
    driver.find_element_by_xpath("//div[@aria-label='Go to Page 2']").click() # page 2
    time.sleep(1) # wait for the page to load for 1 sec
    points = parseFifaRankData(points, driver.page_source, year)


  # Remove the CWD from sys.path while we load stuff.
  del sys.path[0]


In [9]:
points.head(1)

Unnamed: 0,Year,Team,Points,Rank
0,2004,Brazil,0.178185,0.004833


In [10]:
points.tail(1)

Unnamed: 0,Year,Team,Points,Rank
1899,2022,Palestine,0.12922,0.183982


Nice got all the points and rank in there with the teams (with points and rank normalized by year)

# Align the Team Name's Syntax Between Points and Data

In [11]:
pointsTeams = points["Team"].unique()
dataTeams = data["Team"].unique()
pointsTeamsNotInData = np.setdiff1d(pointsTeams, dataTeams)
dataTeamsNotInPoints = np.setdiff1d(dataTeams, pointsTeams)

In [12]:
pointsTeamsNotInData

array(['Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Armenia',
       'Azerbaijan', 'Bahrain', 'Belarus', 'Benin',
       'Bosnia and Herzegovina', 'Botswana', 'Burkina Faso', 'Cabo Verde',
       'Cape Verde Islands', 'Central African Republic', 'Congo',
       'Congo DR', 'Cuba', 'Curacao', 'Curaçao', 'Cyprus',
       'Dominican Republic', 'El Salvador', 'Equatorial Guinea',
       'Estonia', 'Ethiopia', 'FYR Macedonia', 'Faroe Islands', 'Gabon',
       'Gambia', 'Georgia', 'Ghana', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'IR Iran',
       'Indonesia', 'Iraq', 'Israel', 'Jamaica', 'Japan', 'Jordan',
       'Kazakhstan', 'Kenya', 'Korea DPR', 'Kuwait', 'Kyrgyz Republic',
       'Latvia', 'Lebanon', 'Liberia', 'Libya', 'Lithuania', 'Luxembourg',
       'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Moldova',
       'Montenegro', 'Morocco', 'Mozambique', 'Namibia', 'Niger',
       'North Macedonia', 'Oman', 'Palestine', 'Panama'

In [13]:
dataTeamsNotInPoints

array(['Austria (National team)', 'Holland', 'India', 'Rep. Of Korea',
       'Republic Of Ireland', 'United States'], dtype=object)

In [14]:
pointsTeams.sort()
pointsTeams

array(['Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Belarus', 'Belgium', 'Benin', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Cabo Verde',
       'Cameroon', 'Canada', 'Cape Verde Islands',
       'Central African Republic', 'Chile', 'China PR', 'Colombia',
       'Congo', 'Congo DR', 'Costa Rica', 'Croatia', 'Cuba', 'Curacao',
       'Curaçao', 'Cyprus', 'Czech Republic', "Côte d'Ivoire", 'Denmark',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'England',
       'Equatorial Guinea', 'Estonia', 'Ethiopia', 'FYR Macedonia',
       'Faroe Islands', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia',
       'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary',
       'IR Iran', 'Iceland', 'Indonesia', 'Iraq', 'Israel', 'Italy',
       'Jamaica', 'Japa

In [15]:
dataTeams.sort()
dataTeams

array(['Argentina', 'Australia', 'Austria', 'Austria (National team)',
       'Belgium', 'Bolivia', 'Brazil', 'Bulgaria', 'Cameroon', 'Canada',
       'Chile', 'China PR', 'Colombia', 'Costa Rica', 'Croatia',
       'Czech Republic', "Côte d'Ivoire", 'Denmark', 'Ecuador', 'Egypt',
       'England', 'Finland', 'France', 'Germany', 'Greece', 'Holland',
       'Hungary', 'Iceland', 'India', 'Italy', 'Korea Republic', 'Mexico',
       'Netherlands', 'New Zealand', 'Nigeria', 'Northern Ireland',
       'Norway', 'Paraguay', 'Peru', 'Poland', 'Portugal',
       'Rep. Of Korea', 'Republic Of Ireland', 'Republic of Ireland',
       'Romania', 'Russia', 'Saudi Arabia', 'Scotland', 'Serbia',
       'Slovenia', 'South Africa', 'Spain', 'Sweden', 'Switzerland',
       'Tunisia', 'Turkey', 'Ukraine', 'United States', 'Uruguay',
       'Venezuela', 'Wales'], dtype=object)

Team names to change in data...
* Austria (National team) --> Austria
* Holland --> Netherlands
* Rep. Of Korea --> Korea Republic
* United States --> USA
* Republic Of Ireland --> Republic of Ireland

Team names to change in points...
* Korea DPR --> Korea Republic

In [16]:
data.loc[data["Team"] == "Austria (National team)", "Team"] = "Austria"
data.loc[data["Team"] == "Holland", "Team"] = "Netherlands"
data.loc[data["Team"] == "Rep. Of Korea", "Team"] = "Korea Republic"
data.loc[data["Team"] == "United States", "Team"] = "USA"
data.loc[data["Team"] == "Republic Of Ireland", "Team"] = "Republic of Ireland"
points.loc[points["Team"] == "Korea DPR", "Team"] = "Korea Republic"

# Impute Each Team's OVR

* Merge the data frames points and data on Year and Team
* Drop the NaN's
* Drop the columns Team, att, mid, def
* OHE the years

Now would be able to train/test split for imputing...
* y = ovr
* x = everything else (year OHE, points, rank)


In [17]:
dataToImpute = pd.merge(points, data, how='left', on=['Year','Team'])
dataToImpute = dataToImpute.dropna()
dataToImpute.head(5)

Unnamed: 0,Year,Team,Points,Rank,att,mid,def,ovr
0,2004,Brazil,0.178185,0.004833,92,88,87,88
1,2004,France,0.174394,0.009667,94,89,84,88
2,2004,Spain,0.167444,0.0145,90,88,86,88
4,2004,Mexico,0.156281,0.024167,70,71,68,70
5,2004,Argentina,0.15607,0.029,86,85,87,85


In [18]:
# Drop the columns
dataToImpute = dataToImpute.drop(columns=['Team', 'att', 'mid', 'def'])
dataToImpute.head(1)

Unnamed: 0,Year,Points,Rank,ovr
0,2004,0.178185,0.004833,88


In [19]:
len(dataToImpute)

796

In [20]:
ohe_data = pd.get_dummies(dataToImpute, columns = ['Year'])
ohe_data["Year"] = dataToImpute["Year"]
ohe_data.head(3)

Unnamed: 0,Points,Rank,ovr,Year_2004,Year_2005,Year_2006,Year_2007,Year_2008,Year_2009,Year_2010,...,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018,Year_2019,Year_2020,Year_2021,Year_2022,Year
0,0.178185,0.004833,88,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2004
1,0.174394,0.009667,88,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2004
2,0.167444,0.0145,88,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2004


In [21]:
ohe_data.tail(3)

Unnamed: 0,Points,Rank,ovr,Year_2004,Year_2005,Year_2006,Year_2007,Year_2008,Year_2009,Year_2010,...,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018,Year_2019,Year_2020,Year_2021,Year_2022,Year
1856,0.150905,0.10487,71,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2022
1859,0.147809,0.110389,70,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2022
1874,0.140434,0.137987,70,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2022


# Predict the Overall

Start with a train test split, stratifying on the years

In [24]:
from sklearn.model_selection import train_test_split
X = ohe_data.drop(columns=['ovr', 'Year'])
y = ohe_data['ovr']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=ohe_data.Year)

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

lin_mse = mean_squared_error(y_test, lm_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

3.4677139109532944

Not too bad, overall is off by a little over 3 on average with the linear regression model

Let's do a regression tree next

In [26]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
tree_predictions = tree_reg.predict(X_test)
tree_mse = mean_squared_error(y_test, tree_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

3.948452580160711

Decision tree regressor performs a little worse than the linear regression

Now let's try a random forest regressor

In [27]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)
forest_predictions = forest_reg.predict(X_test)
forest_mse = mean_squared_error(y_test, forest_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

3.095887528607723

Random forest regressor performs even better...nice!

# Using Random Forest Regressor Model to Predict the Overall's

In [28]:
points.head(5)

Unnamed: 0,Year,Team,Points,Rank
0,2004,Brazil,0.178185,0.004833
1,2004,France,0.174394,0.009667
2,2004,Spain,0.167444,0.0145
3,2004,Netherlands,0.157755,0.019334
4,2004,Mexico,0.156281,0.024167


In [29]:
imputedOveralls = points.copy()
imputedOveralls = pd.get_dummies(imputedOveralls, columns = ['Year'])

xImpute = imputedOveralls.drop(columns=['Team'])

overall_predictions = forest_reg.predict(xImpute)
overall_predictions = np.round(overall_predictions)
overall_predictions

array([87., 86., 85., ..., 73., 73., 72.])

In [30]:
points['ovr'] = overall_predictions.astype(int)
points.head(2)

Unnamed: 0,Year,Team,Points,Rank,ovr
0,2004,Brazil,0.178185,0.004833,87
1,2004,France,0.174394,0.009667,86


# Get Previous International Soccer Game Data

Get previous game data

In [31]:
game = pd.read_csv("results.csv")
game["Year"] = game.date.str[:4].astype(str) # string splice first 4 letters of entire date column

game.head(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,Year
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,1873
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,1874
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,1876


In [32]:
game = game.drop(columns=['date', 'city', 'country', 'neutral'])
game = game.where(game['Year'] >= '2004')
game = game.dropna()
game.head(5)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004
26361,Bermuda,Barbados,0.0,4.0,Friendly,2004
26362,Kuwait,Yemen,4.0,0.0,Gulf Cup,2004
26363,Oman,Bahrain,0.0,1.0,Gulf Cup,2004
26364,United Arab Emirates,Qatar,0.0,0.0,Gulf Cup,2004


In [33]:
len(game)

17061

In [34]:
game["tournament"].value_counts()

Friendly                                      5957
FIFA World Cup qualification                  4023
UEFA Euro qualification                       1084
African Cup of Nations qualification           744
African Cup of Nations                         357
UEFA Nations League                            308
CECAFA Cup                                     278
AFC Asian Cup qualification                    269
African Nations Championship                   264
CFU Caribbean Cup qualification                264
FIFA World Cup                                 256
Gold Cup                                       238
COSAFA Cup                                     233
AFF Championship                               207
UEFA Euro                                      195
Copa América                                   190
Island Games                                   185
AFC Asian Cup                                  179
Gulf Cup                                       136
EAFF Championship              

Friendlies take up almost a third of the data...going to try building models with and without friendlies included and see which one performs the best

In [35]:
game.head(1)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004


In [36]:
game['score'] = game['home_score'] - game['away_score']
game['outcome'] = None
game.loc[game["score"] == 0, "outcome"] = 0
game.loc[game["score"] > 0, "outcome"] = 1
game.loc[game["score"] < 0, "outcome"] = -1
game.head(5)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year,score,outcome
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004,-1.0,-1
26361,Bermuda,Barbados,0.0,4.0,Friendly,2004,-4.0,-1
26362,Kuwait,Yemen,4.0,0.0,Gulf Cup,2004,4.0,1
26363,Oman,Bahrain,0.0,1.0,Gulf Cup,2004,-1.0,-1
26364,United Arab Emirates,Qatar,0.0,0.0,Gulf Cup,2004,0.0,0


Predict if home team wins
outcome:
* 1 if home team wins
* 0 if draw
* -1 if away team wins




# Match Syntax of the Teams in Game Data with the Points and Data Teams

follow process of above

# Add in Home Team and Away Team Overall

In [37]:
game.head(1)

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,Year,score,outcome
26360,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,2004,-1.0,-1


loop through game
for each team:
    check to see if there is a match in data for the team
    if not:
        check to see if there is a match in points

## 3. Data Preperation and EDA

In [4]:
# Drop data from before 2004 as don't have any metrics
import numpy as np
game = game.where(game['Year'] >= '2004')
game = game.dropna()
game.head(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,Year
26360,2004-01-01,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,Kuwait City,Kuwait,1.0,2004
26361,2004-01-01,Bermuda,Barbados,0.0,4.0,Friendly,Hamilton,Bermuda,0.0,2004
26362,2004-01-01,Kuwait,Yemen,4.0,0.0,Gulf Cup,Kuwait City,Kuwait,0.0,2004
26363,2004-01-03,Oman,Bahrain,0.0,1.0,Gulf Cup,Kuwait City,Kuwait,1.0,2004
26364,2004-01-03,United Arab Emirates,Qatar,0.0,0.0,Gulf Cup,Kuwait City,Kuwait,1.0,2004


In [5]:
game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17061 entries, 26360 to 43420
Data columns (total 10 columns):
date          17061 non-null object
home_team     17061 non-null object
away_team     17061 non-null object
home_score    17061 non-null float64
away_score    17061 non-null float64
tournament    17061 non-null object
city          17061 non-null object
country       17061 non-null object
neutral       17061 non-null float64
Year          17061 non-null object
dtypes: float64(3), object(7)
memory usage: 1.4+ MB


In [6]:
game["Year"].value_counts()

2019    1156
2008    1095
2021    1086
2011    1083
2004    1066
2012    1007
2015    1003
2007     975
2013     968
2017     958
2016     928
2009     912
2018     908
2014     860
2010     854
2006     832
2005     799
2020     299
2022     272
Name: Year, dtype: int64

In [7]:
game["tournament"].value_counts()

Friendly                                 5957
FIFA World Cup qualification             4023
UEFA Euro qualification                  1084
African Cup of Nations qualification      744
African Cup of Nations                    357
UEFA Nations League                       308
CECAFA Cup                                278
AFC Asian Cup qualification               269
African Nations Championship              264
CFU Caribbean Cup qualification           264
FIFA World Cup                            256
Gold Cup                                  238
COSAFA Cup                                233
AFF Championship                          207
UEFA Euro                                 195
Copa América                              190
Island Games                              185
AFC Asian Cup                             179
Gulf Cup                                  136
EAFF Championship                         108
CONCACAF Nations League                   106
SAFF Cup                          

# World Cup Bracket Breakdown

In [109]:
groupA = ['Qatar', 'Ecuador', 'Netherlands', 'Senegal']
groupB = ['England', 'Iran', 'USA', 'Wales'] # not sure w Wales
groupC = ['Argentina', 'Poland', 'Mexico', 'Saudi Arabia']
groupD = ['France', 'Denmark', 'Tunisia', 'Peru'] # Peru not sure
groupE = ['Spain', 'Germany', 'Japan', 'Costa Rica'] # Costa rica not sure
groupF = ['Belgium', 'Canada', 'Morocco', 'Croatia']
groupG = ['Brazil', 'Serbia', 'Switzerland', 'Cameroon']
groupH = ['Portugal', 'Ghana', 'Uruguay', 'Korea Republic']

worldCupTeams = groupA + groupB + groupC + groupD + groupE + groupF + groupG + groupH

# Left side
# A1 v B2 : C2 v D2
# E1 v F2 : G1 v H2

# Right side
# D1 v C2 : B1 v A2
# F1 v E2 : G2 v H1

In [110]:
worldCupTeams

['Qatar',
 'Ecuador',
 'Netherlands',
 'Senegal',
 'England',
 'Iran',
 'USA',
 'Wales',
 'Argentina',
 'Poland',
 'Mexico',
 'Saudi Arabia',
 'France',
 'Denmark',
 'Tunisia',
 'Peru',
 'Spain',
 'Germany',
 'Japan',
 'Costa Rica',
 'Belgium',
 'Canada',
 'Morocco',
 'Croatia',
 'Brazil',
 'Serbia',
 'Switzerland',
 'Cameroon',
 'Portugal',
 'Ghana',
 'Uruguay',
 'Korea Republic']

# THIS ONE WORKS

In [98]:
# Need to use the driver: https://stackoverflow.com/questions/52687372/beautifulsoup-not-returning-complete-html-of-the-page
import time
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path=r"C:/Users/Kyle/Downloads/chromedriver_win32/chromedriver.exe")
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
#driver.find_element_by_link_text("I'm OK with that").click()
#driver.find_element_by_xpath('//button[normalize-space()=I'm OK with that]').click()
driver.find_element_by_xpath("//button[@id='onetrust-accept-btn-handler']").click()
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
#print(soup.prettify())

s = soup.find_all('tr', class_='fc-ranking-item-full_rankingTableFullRow__1nbp7')
# print(s)
rankedTeams = []

# for link in soup.findAll('tr', class_='fc-ranking-item-full_rankingTableFullRow__1nbp7'):
#         print(link.get('onclick'))
for line in s:
    # Each line contains the container for a team
    teamName = line.find('span', class_='d-none d-lg-block')
    teamName = teamName.text
    rankedTeams.append(teamName)
    teamPoints = line.find('div', class_="d-flex ff-mr-16")

#driver.find_elements_by_xpath('//*[@id="content"]/main/section[2]/div/div/div[2]/div/div/div/div/div[2]/div[2]').click()
# driver.find_element_by_css_selector()
#python_button.click()
driver.find_element_by_xpath("//div[@aria-label='Go to Page 2']").click()
page_source = driver.page_source
soup2 = BeautifulSoup(page_source, 'lxml')
#print(soup2.prettify())
s = soup2.find_all('tr', class_='fc-ranking-item-full_rankingTableFullRow__1nbp7')
for line in s:
    # Each line contains the container for a team
    teamName = line.find('span', class_='d-none d-lg-block')
    teamName = teamName.text
    rankedTeams.append(teamName)
    teamPoints = line.find('div', class_="d-flex ff-mr-16")

  
  # This is added back by InteractiveShellApp.init_path()


In [99]:
rankedTeams

['Brazil',
 'Belgium',
 'France',
 'Argentina',
 'England',
 'Italy',
 'Spain',
 'Portugal',
 'Mexico',
 'Netherlands',
 'Denmark',
 'Germany',
 'Uruguay',
 'Switzerland',
 'USA',
 'Croatia',
 'Colombia',
 'Wales',
 'Sweden',
 'Senegal',
 'IR Iran',
 'Peru',
 'Japan',
 'Morocco',
 'Serbia',
 'Poland',
 'Ukraine',
 'Chile',
 'Korea Republic',
 'Nigeria',
 'Costa Rica',
 'Egypt',
 'Czech Republic',
 'Austria',
 'Tunisia',
 'Russia',
 'Cameroon',
 'Canada',
 'Scotland',
 'Hungary',
 'Norway',
 'Australia',
 'Turkey',
 'Algeria',
 'Slovakia',
 'Ecuador',
 'Republic of Ireland',
 'Romania',
 'Saudi Arabia',
 'Paraguay',
 'Qatar',
 'Mali',
 "Côte d'Ivoire",
 'Northern Ireland',
 'Greece',
 'Burkina Faso',
 'Finland',
 'Venezuela',
 'Bosnia and Herzegovina',
 'Ghana',
 'Panama',
 'North Macedonia',
 'Iceland',
 'Jamaica',
 'Slovenia',
 'Albania',
 'Congo DR',
 'United Arab Emirates',
 'South Africa',
 'Montenegro',
 'Cabo Verde',
 'Iraq',
 'Bulgaria',
 'El Salvador',
 'Oman',
 'Israel',
 'Chi

# Current Work

In [100]:
game.head(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,Year
26360,2004-01-01,Bahrain,Saudi Arabia,0.0,1.0,Gulf Cup,Kuwait City,Kuwait,1.0,2004
26361,2004-01-01,Bermuda,Barbados,0.0,4.0,Friendly,Hamilton,Bermuda,0.0,2004
26362,2004-01-01,Kuwait,Yemen,4.0,0.0,Gulf Cup,Kuwait City,Kuwait,0.0,2004
26363,2004-01-03,Oman,Bahrain,0.0,1.0,Gulf Cup,Kuwait City,Kuwait,1.0,2004
26364,2004-01-03,United Arab Emirates,Qatar,0.0,0.0,Gulf Cup,Kuwait City,Kuwait,1.0,2004


In [191]:
home = game["home_team"].unique()
away = game["away_team"].unique()
allTeams = list(set(home).union(set(away)))
allTeams.sort()
allTeams

['Abkhazia',
 'Afghanistan',
 'Albania',
 'Alderney',
 'Algeria',
 'American Samoa',
 'Andalusia',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Arameans Suryoye',
 'Argentina',
 'Armenia',
 'Artsakh',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barawa',
 'Barbados',
 'Basque Country',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bonaire',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Brittany',
 'Brunei',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Canary Islands',
 'Cape Verde',
 'Cascadia',
 'Catalonia',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chagos Islands',
 'Chameria',
 'Chile',
 'China PR',
 'Colombia',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Corsica',
 'Costa Rica',
 'County of Nice',
 'Crimea',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'DR Cong

In [108]:
len(allTeams)

295

In [113]:
missingTeams = set(worldCupTeams).difference(set(rankedTeams))
missingTeams

{'Iran'}

## 4. Feature Engineering and Feature Selection

## 5. Model Training and Model Evaluation

## 6. Hyperparameter Tuning and Evaluation

## 7. Prediction

## 8. References

[1] Deploying Machine Learning Model Using Heroku. https://datamahadev.com/deploying-machine-learning-model-using-heroku/

[2] Using Machine Learning to Simulate World Cup Matches. https://towardsdatascience.com/using-machine-learning-to-simulate-world-cup-matches-959e24d0731

[3] World Cup 2018 Prediction. https://www.kaggle.com/code/agostontorok/soccer-world-cup-2018-winner