# Project Introduction

This notebook aims to predict football match outcomes using machine learning algorithms. It outlines the goals and the dataset to be used.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

# Data Preparation

The following cells are dedicated to importing the dataset, initial data inspection, and cleaning.

In [2]:
# We will now import the dataset which I will call df
df= pd.read_csv(r"C:\Users\santo\OneDrive\Documents\R unit 2\Programming\Matches.csv")


In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
0,0,2023-08-13,16:30,Premier League,Matchweek 1,Sun,Away,D,1.0,1.0,...,Match Report,,13.0,1.0,17.8,0.0,0,0.0,2024,Liverpool
1,1,2023-08-19,15:00,Premier League,Matchweek 2,Sat,Home,W,3.0,1.0,...,Match Report,,25.0,9.0,16.8,1.0,0,1.0,2024,Liverpool
2,2,2023-08-27,16:30,Premier League,Matchweek 3,Sun,Away,W,2.0,1.0,...,Match Report,,9.0,4.0,17.2,1.0,0,0.0,2024,Liverpool
3,3,2023-09-03,14:00,Premier League,Matchweek 4,Sun,Home,W,3.0,0.0,...,Match Report,,17.0,4.0,14.7,0.0,0,0.0,2024,Liverpool
4,4,2023-09-16,12:30,Premier League,Matchweek 5,Sat,Away,W,3.0,1.0,...,Match Report,,16.0,5.0,15.8,0.0,0,0.0,2024,Liverpool


In [4]:
df.columns

Index(['Unnamed: 0', 'date', 'time', 'comp', 'round', 'day', 'venue', 'result',
       'gf', 'ga', 'opponent', 'xg', 'xga', 'poss', 'attendance', 'captain',
       'formation', 'referee', 'match report', 'notes', 'sh', 'sot', 'dist',
       'fk', 'pk', 'pkatt', 'season', 'team'],
      dtype='object')

In [5]:
df.info()
df
df['result'] = df['result'].replace(['L', 'D','W'], ['0', '1', '3'])
df.describe()
df.duplicated().sum()
df= df.apply(pd.to_numeric, errors='ignore')
df.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12596 entries, 0 to 12595
Data columns (total 28 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    12596 non-null  int64  
 1   date          12596 non-null  object 
 2   time          6516 non-null   object 
 3   comp          12596 non-null  object 
 4   round         12596 non-null  object 
 5   day           12596 non-null  object 
 6   venue         12596 non-null  object 
 7   result        12596 non-null  object 
 8   gf            12596 non-null  float64
 9   ga            12596 non-null  float64
 10  opponent      12596 non-null  object 
 11  xg            4996 non-null   float64
 12  xga           4996 non-null   float64
 13  poss          6516 non-null   float64
 14  attendance    10670 non-null  float64
 15  captain       5756 non-null   object 
 16  formation     11086 non-null  object 
 17  referee       12596 non-null  object 
 18  match report  12596 non-nu

Unnamed: 0        int64
date             object
time             object
comp             object
round            object
day              object
venue            object
result            int64
gf              float64
ga              float64
opponent         object
xg              float64
xga             float64
poss            float64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes           float64
sh              float64
sot             float64
dist            float64
fk              float64
pk                int64
pkatt           float64
season            int64
team             object
dtype: object

# Feature Engineering

Adding new features or modifying existing ones to improve the model's predictive power.

In [6]:
df.fillna(0, inplace=True)

In [7]:
df["date"] = pd.to_datetime(df["date"])
df["target"] = df["result"] 
df=df.drop(['comp', 'notes',], axis=1)

In [8]:
# Change the date column from string to datetime format
df["date"] = pd.to_datetime(df["date"])
# Define the target variable as 'result'
df["target"] = df["result"]
# Create variables for analysis
df["venue_code"] = df["venue"].astype("category").cat.codes
df["opp_code"] = df["opponent"].astype("category").cat.codes
# Fill NaN values in the 'time' column with 0
df["time"].fillna('0', inplace=True)
# Convert 'time' values to string, replace unwanted characters, and then convert to int
df["hour"] = df["time"].astype(str).str.replace(":.+", "", regex=True).astype(int)
# Create the 'day_code' variable to determine the day of the week
df["day_code"] = df["date"].dt.dayofweek
# Display the modified DataFrame
df.head()


Unnamed: 0.1,Unnamed: 0,date,time,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,target,venue_code,opp_code,hour,day_code
0,0,2023-08-13,16:30,Matchweek 1,Sun,Away,1,1.0,1.0,Chelsea,...,0.0,0,0.0,2024,Liverpool,1,0,12,16,6
1,1,2023-08-19,15:00,Matchweek 2,Sat,Home,3,3.0,1.0,Bournemouth,...,1.0,0,1.0,2024,Liverpool,3,1,6,15,5
2,2,2023-08-27,16:30,Matchweek 3,Sun,Away,3,2.0,1.0,Newcastle Utd,...,1.0,0,0.0,2024,Liverpool,3,0,26,16,6
3,3,2023-09-03,14:00,Matchweek 4,Sun,Home,3,3.0,0.0,Aston Villa,...,0.0,0,0.0,2024,Liverpool,3,1,1,14,6
4,4,2023-09-16,12:30,Matchweek 5,Sat,Away,3,3.0,1.0,Wolves,...,0.0,0,0.0,2024,Liverpool,3,0,42,12,5


# Model Training

Training the machine learning model and evaluating its performance.

In [9]:
#we will now be importing random forests
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)
train = df[df["date"] < '2023-08-11']
#Test will be the data after 2021-08-13 which is the season we want to predict
test = df[df["date"] >= '2023-08-11']
#These are the predictors we will use in order to predict the results of the games
predictors = ["venue_code", "opp_code", "hour", "day_code","poss","xg","xga"]
rf.fit(train[predictors], train["target"])
preds = rf.predict(test[predictors])
#We will import accuracy score to check on the accuracy of our model
from sklearn.metrics import accuracy_score
accuracy= accuracy_score(test["target"], preds)
accuracy
#We will  create a new data frame to compare the results predicted with the actual ones
actual_vs_predicted=pd.DataFrame(dict(actual=test["target"], predicted=preds))
#This shows how many of each result was predicted
pd.crosstab(index=actual_vs_predicted["actual"], columns=actual_vs_predicted["predicted"])


predicted,0,1,3
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,136,7,31
1,34,12,42
3,35,9,130


In [10]:
accuracy

0.6376146788990825

In [11]:
# we will import precision score
from sklearn.metrics import precision_score
#The precision is very similar to our accuracy
precision_score(test["target"], preds,average='micro')
group_results= df.groupby("team")
grouped_matches = df.groupby("team")
group = grouped_matches.get_group("Liverpool").sort_values("date")
group


Unnamed: 0.1,Unnamed: 0,date,time,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,target,venue_code,opp_code,hour,day_code
11912,1,2006-08-19,0,Matchweek 1,Sat,Away,1,1.0,1.0,Sheffield Utd,...,0.0,1,0.0,2015,Liverpool,1,0,32,0,5
11913,3,2006-08-26,0,Matchweek 3,Sat,Home,3,2.0,1.0,West Ham,...,0.0,0,0.0,2015,Liverpool,3,1,40,0,5
11914,4,2006-09-09,0,Matchweek 4,Sat,Away,0,0.0,3.0,Everton,...,0.0,0,0.0,2015,Liverpool,0,0,15,0,5
11915,6,2006-09-17,0,Matchweek 5,Sun,Away,0,0.0,1.0,Chelsea,...,0.0,0,0.0,2015,Liverpool,0,0,12,0,6
11916,7,2006-09-20,0,Matchweek 2,Wed,Home,3,2.0,0.0,Newcastle Utd,...,0.0,0,0.0,2015,Liverpool,3,1,26,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17,26,2023-12-23,17:30,Matchweek 18,Sat,Home,1,1.0,1.0,Arsenal,...,1.0,0,0.0,2024,Liverpool,1,1,0,17,5
18,27,2023-12-26,17:30,Matchweek 19,Tue,Away,3,2.0,0.0,Burnley,...,0.0,0,0.0,2024,Liverpool,3,0,9,17,1
19,28,2024-01-01,20:00,Matchweek 20,Mon,Home,3,4.0,2.0,Newcastle Utd,...,0.0,1,2.0,2024,Liverpool,3,1,26,20,0
20,31,2024-01-21,16:30,Matchweek 21,Sun,Away,3,4.0,0.0,Bournemouth,...,0.0,0,0.0,2024,Liverpool,3,0,6,16,6


In [12]:
#we will now be using rolling averages for the columns "gf", "ga", "sh", "sot", "pk" 
#with the aim to improve the quality of our predictions
columns = ["gf", "ga", "sh", "sot", "pk"]
columns
new_columns = [f"{c}_rolling" for c in columns]
new_columns


['gf_rolling', 'ga_rolling', 'sh_rolling', 'sot_rolling', 'pk_rolling']

In [13]:
# We will now be using the funtion rolling_averages with the values from group, columns and new_columns
def rolling_averages(group, columns, new_columns):
    group = group.sort_values("date")
    rolling_stats = group[columns].rolling(3, closed='left').mean()
    group[new_columns] = rolling_stats
    group = group.dropna(subset=new_columns)
    return group

rolling_averages(group, columns, new_columns)
df_rolling = df.groupby("team").apply(lambda x: rolling_averages(x, columns, new_columns))
df_rolling
df_rolling = df_rolling.droplevel('team')
df_rolling

df_rolling.index = range(df_rolling.shape[0])
df_rolling

Unnamed: 0.1,Unnamed: 0,date,time,round,day,venue,result,gf,ga,opponent,...,target,venue_code,opp_code,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,pk_rolling
0,6,2006-09-17,0,Matchweek 5,Sun,Away,3,1.0,0.0,Manchester Utd,...,3,0,24,0,6,0.666667,1.000000,0.000000,0.000000,0.333333
1,7,2006-09-23,0,Matchweek 6,Sat,Home,3,3.0,0.0,Sheffield Utd,...,3,1,32,0,5,0.666667,0.666667,0.000000,0.000000,0.333333
2,9,2006-09-30,0,Matchweek 7,Sat,Away,3,2.0,1.0,Charlton Ath,...,3,0,11,0,5,1.666667,0.333333,0.000000,0.000000,0.333333
3,10,2006-10-14,0,Matchweek 8,Sat,Home,3,3.0,0.0,Watford,...,3,1,38,0,5,2.000000,0.333333,0.000000,0.000000,0.000000
4,12,2006-10-22,0,Matchweek 9,Sun,Away,3,4.0,0.0,Reading,...,3,0,31,0,6,2.666667,0.333333,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12462,19,2023-12-24,13:00,Matchweek 18,Sun,Home,3,2.0,1.0,Chelsea,...,3,1,12,13,6,0.666667,1.333333,10.333333,3.666667,0.000000
12463,20,2023-12-27,19:30,Matchweek 19,Wed,Away,3,4.0,1.0,Brentford,...,3,0,7,19,2,1.000000,1.666667,12.666667,4.333333,0.000000
12464,21,2023-12-30,15:00,Matchweek 20,Sat,Home,3,3.0,0.0,Everton,...,3,1,15,15,5,2.000000,1.666667,13.000000,4.666667,0.000000
12465,24,2024-01-22,19:45,Matchweek 21,Mon,Away,1,0.0,0.0,Brighton,...,1,0,8,19,0,3.000000,0.666667,12.333333,5.666667,0.000000


# Making Predictions

Applying the model to make predictions on unseen data and creating a final evaluation table.

In [14]:
def make_predictions(df, predictors):
    train = df[df["date"] < '2023-08-12']
    test = df[df["date"] >= '2023-08-11']
    rf.fit(train[predictors], train["target"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["target"], predicted=preds), index=test.index)
    acc = precision_score(test["target"], preds,average='micro')
    return combined, acc

In [15]:
make_predictions(df, predictors)
#We will now be creating a combined column with the actual results and the predicted ones
combined, acc = make_predictions(df_rolling, predictors + new_columns)
acc
combined = combined.merge(df_rolling[["date", "team", "opponent", "result"]], left_index=True, right_index=True)
combined
combined.head(10)


Unnamed: 0,actual,predicted,date,team,opponent,result
605,3,0,2023-08-12,Arsenal,Nott'ham Forest,3
606,3,3,2023-08-21,Arsenal,Crystal Palace,3
607,1,3,2023-08-26,Arsenal,Fulham,1
608,3,3,2023-09-03,Arsenal,Manchester Utd,3
609,3,3,2023-09-17,Arsenal,Everton,3
610,1,3,2023-09-24,Arsenal,Tottenham,1
611,3,3,2023-09-30,Arsenal,Bournemouth,3
612,3,0,2023-10-08,Arsenal,Manchester City,3
613,1,0,2023-10-21,Arsenal,Chelsea,1
614,3,3,2023-10-28,Arsenal,Sheffield Utd,3


In [16]:
#The following is very important as we will be changing the names or abbreviations of some teams to just one common abreviation, this is to prevent inconsistencies as some teams have different names in the dataset
class MissingDict(dict):
    __missing__ = lambda self, key: key

map_values = {"Brighton and Hove Albion": "Brighton", "Manchester United": "Manchester Utd", "Newcastle United": "Newcastle Utd", "Tottenham Hotspur": "Tottenham", "West Ham United": "West Ham", "Wolverhampton Wanderers": "Wolves","Sheffield United": "Sheffield Utd", "Charlton Athletic": "Charlton Ath","West Bromwich Albion": "West Bromw", "Queens Park Rangers": "QPR", "Nottingham Forest": "Nott'ham Forest"} 
mapping = MissingDict(**map_values)

combined["new_team"] = combined["team"].map(mapping)

combined["predicted points"]=combined["predicted"]
combined["actual points"]=combined["actual"]
merged = combined.merge(combined, left_on=["date", "new_team"], right_on=["date", "opponent"])

In [17]:
new_df = combined[['team','predicted points','actual points']].copy()
new_df['predicted points'] = pd.to_numeric(new_df["predicted points"]) 
new_df['actual points'] = pd.to_numeric(new_df["actual points"]) 

In [18]:
final_table=new_df.groupby('team').sum() 

In [19]:
final_table.loc['Everton'] -= 10

In [20]:
# Sort the DataFrame based on 'predicted points' column in descending order
sorted_table = final_table.sort_values(by='predicted points', ascending=False)
# Sort the 'predicted points' column in descending order
predicted_points_table = sorted_table[['predicted points']].sort_values(by='predicted points', ascending=False)
# Sort the 'actual points' column in descending order
actual_points_table = sorted_table[['actual points']].sort_values(by='actual points', ascending=False)
# Display the sorted 'predicted points' table
print("Predicted Points Table:")
print(predicted_points_table)
# Display the sorted 'actual points' table
print("\nActual Points Table:")
print(actual_points_table)



Predicted Points Table:
                          predicted points
team                                      
Arsenal                                 55
Manchester City                         54
Liverpool                               53
Tottenham Hotspur                       51
Newcastle United                        43
Aston Villa                             43
Chelsea                                 42
Brentford                               39
Crystal Palace                          33
Everton                                 32
Brighton and Hove Albion                30
West Ham United                         28
Manchester United                       27
Bournemouth                             25
Wolverhampton Wanderers                 22
Fulham                                  18
Nottingham Forest                       16
Luton Town                              16
Sheffield United                         9
Burnley                                  5

Actual Points Table:
        

# Conclusion

Summarizing the notebook's findings and discussing potential next steps for further analysis.