## **World Cup 2018 Prediction Test** 

Aim: Create model to see if predictions for World Cup 2018 results are accurate adn then to predict World Cup 2022 results in Summer 2022
Predicting the winners of the next FIFA World Cup

#### Filtering out irrelevant historic match data

The highest level of football before Senior football is U21 football. For national teams the AVG squad age is between 24 - 29.6 years. Players under the age of 21 rarely get recruited to the senior team so to accomodated for those that may 21 years old will be our lower bound. To accomodate for older players our upper bound will be 30 years. 

* It is rare for teams to stay together for longer than 5 years in any sport, some people get better in their game and stay others get worse and are dropped, some retire, some leave through injury. The list is endless, based on this number I will only collect data from up to 5 years before the day before the first game of the World Cup

* Because we want a clear view of the performance of the current squad and not the England team of 1966 for example because that isnt the team playing now. A limit of 9 years would help imrpove this bias.  
* Currently doing more research to validate any lower or higher number. I will simply update the function input when I get this infor. 
* Past performance of previous squads does not guarantee future performance 

In [167]:
# Importing the dependencies 
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.ticker as plticker
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### **----------------------------------  1. Data Collection  -----------------------------------**

In [168]:
# Reading data from JSON hyperlink to dataframe
fixtures = pd.read_json("https://fixturedownload.com/feed/json/fifa-world-cup-2018")

# Exporting dataframe to csv
fixtures.to_csv('../Data/fixtures.csv', index=False)

In [169]:
# Loading the data from csv to dataframe
df = pd.read_csv('../Data/results.csv')

In [170]:
# Loading fifa_rankings for 2018 into dataframe
ranking = pd.read_csv('../Data/fifa_rankings.csv') 

### **----------------------------------  2. Data Cleaning  -----------------------------------**

In [171]:
def recent_date() :
    """
    NO INPUT 
    """ 
    # LAST DAY OF DATA - Getting 5 years of data before World cup 2018
    recent_date = datetime.date(2018, 6, 13)

    # Converting the yr value to string to allow for calculation
    recent_date = str(recent_date)

    # Returning recent date
    return recent_date


def past_date(yr: int = 'test') :
    """
    yr -- Num of years before the first day of the World Cup 2018 
    """ 
    # LAST DAY OF DATA - Getting 5 years of data before World cup 2018
    recent_date = datetime.date(2018, 6, 13)

    # Converting the yr value to string to allow for calculation  # The strptime() method creates a datetime object from the given string. 
    recent_date = str(recent_date)
    datem = datetime.datetime.strptime(recent_date, "%Y-%m-%d")

    # Subtracting yr amount of years from LAST DAY OF DATA
    past_year = int(datem.year) - yr
    past_date = datetime.date(past_year, 6, 13)

    # As string as we want the output without the datetime definition before it
    return str(past_date)


In [172]:
pd.options.mode.chained_assignment = None  # default='warn' # Must be set in first line 

# Conditions to look at dates that are within 9 years from today
data = df[(df.date <= recent_date()) & (df.date >= past_date(5))]

# Dropping 'neutral' column
data.drop(['neutral'],inplace=True, axis=1)

In [173]:
# Creating list 'winner'
winner = []

# for loop appending team name to winner column depending who who has a greater score and if they are equal word 'Draw' will be output
for i in range (len(data['home_team'])):
    if data ['home_score'].iloc[i] > data['away_score'].iloc[i]:
        winner.append(data['home_team'].iloc[i])
    elif data['home_score'].iloc[i] < data['away_score'].iloc[i]:
        winner.append(data['away_team'].iloc[i])
    else:
        winner.append('Draw')

# Creating 'winning_team' column and assigning the winner list values to it
data['winning_team'] = winner

# Creating goal_difference column
data['goal_difference'] = np.absolute(data['home_score'] - data['away_score'])

In [174]:
# Resetting index
data.reset_index(drop=True, inplace=True)

# Viewing first 5 rows of new data
data.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,winning_team,goal_difference
0,2013-06-14,Ghana,Ivory Coast,1.0,1.0,Friendly,Obuasi,Ghana,Draw,0.0
1,2013-06-14,Guatemala,Argentina,0.0,4.0,Friendly,Guatemala,Guatemala,Argentina,4.0
2,2013-06-14,Libya,Togo,2.0,0.0,FIFA World Cup qualification,Tripoli,Libya,Libya,2.0
3,2013-06-14,Moldova,Kyrgyzstan,2.0,1.0,Friendly,Tiraspol,Moldova,Moldova,1.0
4,2013-06-15,Botswana,Central African Republic,3.0,2.0,FIFA World Cup qualification,Lobatse,Botswana,Botswana,1.0


In [175]:
#dropping columns that wll not affect matchoutcomes
df_teams = data.drop(['date', 'home_score', 'away_score', 'tournament', 'city', 'country', 'goal_difference'], axis=1)

In [176]:
# 'winning_team' labels: '2' if the home team has won, '1' if it was a draw, and '0' if the away team has won.

df_teams.loc[df_teams.winning_team == df_teams.home_team,'winning_team']=2
df_teams.loc[df_teams.winning_team == 'Draw', 'winning_team']=1
df_teams.loc[df_teams.winning_team == df_teams.away_team, 'winning_team']=0

In [177]:
# Viewing datatypes of columns
df_teams.dtypes

home_team       object
away_team       object
winning_team    object
dtype: object

In [178]:
# Viewing shape of data
df_teams.shape

(4666, 3)

In [179]:
# Initial check for null values using info
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4666 entries, 0 to 4665
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   home_team     4666 non-null   object
 1   away_team     4666 non-null   object
 2   winning_team  4666 non-null   object
dtypes: object(3)
memory usage: 109.5+ KB


In [180]:
# 2nd check for number of null values
df_teams.isnull().sum()

home_team       0
away_team       0
winning_team    0
dtype: int64

In [181]:
"""
Not necessary as there will be too many

# Checking distribution of categorical fields 
print(df_teams.home_team.value_counts())
print(df_teams.away_team.value_counts())
print(df_teams.winning_team.value_counts())

"""

'\nNot necessary as there will be too many\n\n# Checking distribution of categorical fields \nprint(df_teams.home_team.value_counts())\nprint(df_teams.away_team.value_counts())\nprint(df_teams.winning_team.value_counts())\n\n'

In [182]:
# Getting summary of statistics pertaining to the DataFrame columns
df_teams.describe()

Unnamed: 0,home_team,away_team,winning_team
count,4666,4666,4666
unique,269,269,3
top,United States,Uganda,2
freq,51,44,2207


In [183]:
#convert home team and away team from categorical variables to continous inputs 
# Get dummy variables
# Converting catgorical features to one hot encoding | drop_first=True - First column will be dropped to avoid dummy variable trap
final = pd.get_dummies(df_teams, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# Viewing first 5 rows
final.head()

Unnamed: 0,winning_team,home_team_Abkhazia,home_team_Afghanistan,home_team_Albania,home_team_Alderney,home_team_Algeria,home_team_American Samoa,home_team_Andorra,home_team_Angola,home_team_Anguilla,...,away_team_Wales,away_team_Western Armenia,away_team_Western Isles,away_team_Yemen,away_team_Ynys Môn,away_team_Yorkshire,away_team_Zambia,away_team_Zanzibar,away_team_Zimbabwe,away_team_Åland Islands
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### **----------------------------------  3. ML Model building  -----------------------------------**

In [184]:
# X features 
X = final.drop(['winning_team'], axis=1)

# y Target
y = final["winning_team"]

# Assigning y as type int
y = y.astype('int')

In [185]:
# Splitting data into training and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

In [186]:
# Check to see if data split correctly 
print(X_train.shape)
print(X_test.shape)

(3732, 538)
(934, 538)


#### Logistic Regression

In [187]:
# Implementing Logistic
model = LogisticRegression()

model.fit(X, y)

LogisticRegression()

In [188]:
score1 = model.score(X_train, y_train)
score2 = model.score(X_test, y_test)

print("Training set accuracy: ", '%.3f'%(score1))
print("Test set accuracy: ", '%.3f'%(score2))

Training set accuracy:  0.648
Test set accuracy:  0.657


In [189]:
# List for storing the group stage games
pred_set = []

In [190]:
fixtures.head()

Unnamed: 0,MatchNumber,RoundNumber,DateUtc,Location,HomeTeam,AwayTeam,Group,HomeTeamScore,AwayTeamScore
0,1,1,2018-06-14 15:00:00Z,"Luzhniki Stadium, Moscow",Russia,Saudi Arabia,Group A,5,0
1,2,1,2018-06-15 12:00:00Z,Ekaterinburg Stadium,Egypt,Uruguay,Group A,0,1
2,3,1,2018-06-15 15:00:00Z,Saint Petersburg Stadium,Morocco,Iran,Group B,0,1
3,4,1,2018-06-15 18:00:00Z,"Fisht Stadium, Sochi",Portugal,Spain,Group B,3,3
4,5,1,2018-06-16 10:00:00Z,Kazan Arena,France,Australia,Group C,2,1


In [191]:
# Create new columns with ranking position of each team
fixtures.insert(1, 'first_position', fixtures['HomeTeam'].map(ranking.set_index('Team')['Position']))
fixtures.insert(2, 'second_position', fixtures['AwayTeam'].map(ranking.set_index('Team')['Position']))

# We only need the group stage games, so we have to slice the dataset
fixtures = fixtures.iloc[:48, :]
fixtures.tail()

Unnamed: 0,MatchNumber,first_position,second_position,RoundNumber,DateUtc,Location,HomeTeam,AwayTeam,Group,HomeTeamScore,AwayTeamScore
43,44,6,25,3,2018-06-27 18:00:00Z,Nizhny Novgorod Stadium,Switzerland,Costa Rica,Group E,2,2
44,45,60,10,3,2018-06-28 14:00:00Z,Volgograd Stadium,Japan,Poland,Group H,0,1
45,46,28,16,3,2018-06-28 14:00:00Z,Samara Stadium,Senegal,Colombia,Group H,0,1
46,47,55,14,3,2018-06-28 18:00:00Z,Saransk Stadium,Panama,Tunisia,Group G,1,2
47,48,13,3,3,2018-06-28 18:00:00Z,Kaliningrad Stadium,England,Belgium,Group G,0,1


In [192]:
# Loop to add teams to new prediction dataset based on the ranking position of each team
for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'home_team': row['HomeTeam'], 'away_team': row['AwayTeam'], 'winning_team': None})
    else:
        pred_set.append({'home_team': row['AwayTeam'], 'away_team': row['HomeTeam'], 'winning_team': None})
        
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set

pred_set.head()

Unnamed: 0,home_team,away_team,winning_team
0,Russia,Saudi Arabia,
1,Uruguay,Egypt,
2,Iran,Morocco,
3,Portugal,Spain,
4,France,Australia,


In [193]:
pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# Add missing columns compared to the model's training dataset
missing_cols = set(final.columns) - set(pred_set.columns)
for c in missing_cols:
    pred_set[c] = 0
pred_set = pred_set[final.columns]

# Remove winning team column
pred_set = pred_set.drop(['winning_team'], axis=1)

pred_set.head()

  pred_set[c] = 0


Unnamed: 0,home_team_Abkhazia,home_team_Afghanistan,home_team_Albania,home_team_Alderney,home_team_Algeria,home_team_American Samoa,home_team_Andorra,home_team_Angola,home_team_Anguilla,home_team_Antigua and Barbuda,...,away_team_Wales,away_team_Western Armenia,away_team_Western Isles,away_team_Yemen,away_team_Ynys Môn,away_team_Yorkshire,away_team_Zambia,away_team_Zanzibar,away_team_Zimbabwe,away_team_Åland Islands
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [194]:
#group matches 
predictions = model.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
    if predictions[i] == 2:
        print("Winner: " + backup_pred_set.iloc[i, 1])
    elif predictions[i] == 1:
        print("Draw")
    elif predictions[i] == 0:
        print("Winner: " + backup_pred_set.iloc[i, 0])
    print('Probability of ' + backup_pred_set.iloc[i, 1] + ' winning: ', '%.3f'%(model.predict_proba(pred_set)[i][2]))
    print('Probability of Draw: ', '%.3f'%(model.predict_proba(pred_set)[i][1]))
    print('Probability of ' + backup_pred_set.iloc[i, 0] + ' winning: ', '%.3f'%(model.predict_proba(pred_set)[i][0]))
    print("")

Saudi Arabia and Russia
Winner: Saudi Arabia
Probability of Saudi Arabia winning:  0.583
Probability of Draw:  0.232
Probability of Russia winning:  0.185

Egypt and Uruguay
Winner: Egypt
Probability of Egypt winning:  0.743
Probability of Draw:  0.183
Probability of Uruguay winning:  0.074

Morocco and Iran
Winner: Morocco
Probability of Morocco winning:  0.516
Probability of Draw:  0.285
Probability of Iran winning:  0.199

Spain and Portugal
Winner: Portugal
Probability of Spain winning:  0.341
Probability of Draw:  0.250
Probability of Portugal winning:  0.408

Australia and France
Winner: Australia
Probability of Australia winning:  0.723
Probability of Draw:  0.214
Probability of France winning:  0.063

Iceland and Argentina
Winner: Iceland
Probability of Iceland winning:  0.729
Probability of Draw:  0.167
Probability of Argentina winning:  0.103

Denmark and Peru
Winner: Denmark
Probability of Denmark winning:  0.569
Probability of Draw:  0.261
Probability of Peru winning:  0.17

In [207]:
# List of tuples before 
group_16 = [('Denmark', 'Iceland'),
            ('Saudi Arabia', 'Portugal'),
            ('Morocco', 'Russia'),
            ('Nigeria', 'Australia'),
            ('Serbia', 'Sweden'),
            ('Panama', 'Poland'),
            ('Korea Republic', 'Costa Rica'),
            ('Japan', 'Tunisia')]

In [210]:
def clean_and_predict(matches, ranking, final, model):

    # Initialization of auxiliary list for data cleaning
    positions = []

    # Loop to retrieve each team's position according to FIFA ranking
    for match in matches:
        positions.append(ranking.loc[ranking['Team'] == match[0],'Position'].iloc[0])
        positions.append(ranking.loc[ranking['Team'] == match[1],'Position'].iloc[0])
    
    # Creating the DataFrame for prediction
    pred_set = []

    # Initializing iterators for while loop
    i = 0
    j = 0

    # 'i' will be the iterator for the 'positions' list, and 'j' for the list of matches (list of tuples)
    while i < len(positions):
        dict1 = {}

        # If position of first team is better, he will be the 'home' team, and vice-versa
        if positions[i] < positions[i + 1]:
            dict1.update({'home_team': matches[j][0], 'away_team': matches[j][1]})
        else:
            dict1.update({'home_team': matches[j][1], 'away_team': matches[j][0]})

        # Append updated dictionary to the list, that will later be converted into a DataFrame
        pred_set.append(dict1)
        i += 2
        j += 1

    # Convert list into DataFrame
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    # Get dummy variables and drop winning_team column
    pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

    # Add missing columns compared to the model's training dataset
    missing_cols2 = set(final.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[final.columns]

    # Remove winning team column
    pred_set = pred_set.drop(['winning_team'], axis=1)

    # Predict!
    predictions = model.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 2:
            print("Winner: " + backup_pred_set.iloc[i, 1])
        elif predictions[i] == 1:
            print("Draw")
        elif predictions[i] == 0:
            print("Winner: " + backup_pred_set.iloc[i, 0])
        #print('Probability of ' + backup_pred_set.iloc[i, 1] + ' winning: ' , '%.3f'%(model.predict_proba(pred_set)[i][2]))
        #print('Probability of Draw: ', '%.3f'%(model.predict_proba(pred_set)[i][1])) 
        #print('Probability of ' + backup_pred_set.iloc[i, 0] + ' winning: ', '%.3f'%(model.predict_proba(pred_set)[i][0]))
        print("")

In [211]:
clean_and_predict(group_16, ranking, final, model)

Iceland and Denmark
Winner: Iceland

Saudi Arabia and Portugal
Winner: Saudi Arabia

Russia and Morocco
Winner: Russia

Nigeria and Australia
Winner: Nigeria

Serbia and Sweden
Winner: Serbia

Panama and Poland
Winner: Panama

Korea Republic and Costa Rica
Winner: Korea Republic

Japan and Tunisia
Winner: Japan



  pred_set[c] = 0


In [212]:
# List of matches
quarters = [('Iceland', 'Saudi Arabia'),
            ('Serbia', 'Panama'),
            ('Korea Republic', 'England'),
            ('Japan', 'Nigeria')]

In [213]:
clean_and_predict(quarters, ranking, final, model)

  pred_set[c] = 0


Saudi Arabia and Iceland
Winner: Saudi Arabia

Panama and Serbia
Winner: Panama

Korea Republic and England
Winner: Korea Republic

Japan and Nigeria
Winner: Japan



In [214]:
# List of matches
semi = [('Saudi Arabia', 'Panama'),
        ('Korea Republic', 'Japan')]

In [215]:
clean_and_predict(semi, ranking, final, model)

Saudi Arabia and Panama
Winner: Saudi Arabia

Korea Republic and Japan
Winner: Korea Republic



  pred_set[c] = 0


In [216]:
# Finals
finals = [('Korea Republic', 'Saudi Arabia')]

In [217]:
clean_and_predict(finals, ranking, final, model)

Saudi Arabia and Korea Republic
Winner: Saudi Arabia



  pred_set[c] = 0


In [204]:
# Based on those teams that participated in the 2018 World cup 
# narrowing to team patcipating in the world cup
worldcup_teams = ['Australia', ' Iran', 'Japan', 'Korea Republic', 
            'Saudi Arabia', 'Egypt', 'Morocco', 'Nigeria', 
            'Senegal', 'Tunisia', 'Costa Rica', 'Mexico', 
            'Panama', 'Argentina', 'Brazil', 'Colombia', 
            'Peru', 'Uruguay', 'Belgium', 'Croatia', 
            'Denmark', 'England', 'France', 'Germany', 
            'Iceland', 'Poland', 'Portugal', 'Russia', 
            'Serbia', 'Spain', 'Sweden', 'Switzerland']
data_teams_home = data[data['home_team'].isin(worldcup_teams)]
data_teams_away = data[data['away_team'].isin(worldcup_teams)]
data_teams = pd.concat((data_teams_home, data_teams_away))
data_teams.drop_duplicates()
data_teams.count()

date               1822
home_team          1822
away_team          1822
home_score         1822
away_score         1822
tournament         1822
city               1822
country            1822
winning_team       1822
goal_difference    1822
dtype: int64

In [205]:
# scapr teasm that played in WORLD CUP 2018 AND JUST KEEP TEAMS COL

# Based on participation last year, these are the only teasm that can 'win'


url = "https://fbref.com/en/comps/1/FIFA-World-Cup-Stats"
dfs = pd.read_html(url)

# For loop to locate desired table with len(df) > 5 | Renaming table to data 
for df in dfs:
    if len(df) > 15:
        table = df
        break
WCnations = table 

WCnations = WCnations['Squad']


WCnations.dropna(inplace=True)
WCnations= WCnations.str.split(" ", n = 1, expand = True) 
WCnations.rename({1 :'Nations'},axis=1, inplace=True)
WCnations.drop([0], axis=1, inplace=True)
WCnations.reset_index(drop=True, inplace=True)
WCnations

Unnamed: 0,Nations
0,France
1,Croatia
2,Belgium
3,England
4,Uruguay
5,Brazil
6,Sweden
7,Russia
8,Colombia
9,Denmark


In [206]:
# Project will be completed 14th June when fixtures are released 