<a href="https://www.kaggle.com/code/sabahao/pl-randomforest?scriptVersionId=147010955" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Predicting Premier League Football Matches with Random Forest**

Federico Sabatini 马浩 - 华东理工大学

In the world of sports, predicting match outcomes is an enticing challenge. With the vast amount of data available, machine learning algorithms provide a promising approach to unravel patterns and make accurate predictions. In this project, I delve into the thrilling world of the Premier League and harness the power of the Random Forest algorithm to predict football match results.

In [1]:
# Import the packages I will need in the developemnt of my machine learning algorithm

import numpy as np # Linear algebra
import pandas as pd # Data processing

# Import the data set I downloaded and uploaded to Kaggle ("Premier league results for 2022-2023 season")
# After uploading and retrieve the file, I will ask teh program to output it's path, so that I will be able to import it.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/premierleague/matches.csv


**Showing the Data from the DataSet for the first time**

After importing the DataSet into my notebook, with the method .head() I will be able to show the first 5 rows of the DataSet, and in addittion, I will be able to take a look at the columns name, so that the develop of the algorithm will be easier.

In [2]:
# Load the data set in the Matches variable and print the first 5 rows
matches = pd.read_csv("/kaggle/input/premierleague/matches.csv")
matches.head()

Unnamed: 0.1,Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
0,1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,...,Match Report,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City
1,2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,...,Match Report,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
2,3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,...,Match Report,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City
3,4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,...,Match Report,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City
4,6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,...,Match Report,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City


**Understand better the shape of the DataSet**

After taking a first look at the DataSet we can take a look at the shape of it, which means how many column and rows the DataSet actually have, in this case we start working on the DataSet when the overall shape of it it's going to be 1389 rows an 28 columns

In [3]:
# Shows how many columns and row are in the dataset
matches.shape

(1389, 28)

**Fixing the DataTypes**

For most of machine algorithm to work, they must work with integers DataTyoe, and the Random Forest also need. So we have to convert some of the DataType from object into integer, or into some kind of integers. First of all we have to check what DataType we have in our DataSet, then we can choose the Data we want to work with, and if they are not integers already we have to convert it.

In [4]:
# Check what DataType we have in our DataSet
matches.dtypes

Unnamed: 0        int64
date             object
time             object
comp             object
round            object
day              object
venue            object
result           object
gf              float64
ga              float64
opponent         object
xg              float64
xga             float64
poss            float64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes           float64
sh              float64
sot             float64
dist            float64
fk              float64
pk              float64
pkatt           float64
season            int64
team             object
dtype: object

**Converting the Date type**

Here I decided to only convert the Date column, but in order to increase the accuracy, in the future more columns may be converted and used in teh algorithm

In [5]:
# Replace the column date of type " Object " with the new column with the same data of type datetime64
matches["date"] = pd.to_datetime(matches["date"])
matches.dtypes

Unnamed: 0               int64
date            datetime64[ns]
time                    object
comp                    object
round                   object
day                     object
venue                   object
result                  object
gf                     float64
ga                     float64
opponent                object
xg                     float64
xga                    float64
poss                   float64
attendance             float64
captain                 object
formation               object
referee                 object
match report            object
notes                  float64
sh                     float64
sot                    float64
dist                   float64
fk                     float64
pk                     float64
pkatt                  float64
season                   int64
team                    object
dtype: object

**Creating the Predictors**

In the Random Forest algorithm, predictors, also known as features or independent variables, are the variables used to make predictions. These predictors are the characteristics or attributes of the data that are input into the algorithm to train the model and subsequently used to make predictions on new, unseen data.

In the context of predicting Premier League football matches, our predictors will include various features related to the result, opponents, time and day of the week. In order to increase the accuracy these predictors might include team statistics (e.g., previous match results, goal difference, number of goals scored), player performance metrics (e.g., goals, assists, yellow cards), historical data (e.g., head-to-head records, past performance in similar scenarios), and various other factors that might influence the outcome of a match.

**How do we use the Predictors  ?**

Random Forest is an ensemble learning algorithm that combines the predictions of multiple decision trees. Each decision tree in the Random Forest is built using a different subset of the available predictors. This process is known as "feature bagging" or "feature sampling." By considering different subsets of predictors for each tree, the Random Forest algorithm reduces the correlation between individual trees and ensures a diverse set of predictions.

In [6]:
# Create a new column representing the Home or Away value
matches["venue_code"] = matches["venue"].astype("category").cat.codes

# Create a new column representing a code for each opponent
matches["opp_code"] = matches["opponent"].astype("category").cat.codes

# Create a new column representing the time as an integer value
matches["hour"] = matches["time"].str.replace(":.+", "", regex=True).astype("int")

# Create a new column representing the code for each day of the week
matches["day_code"] = matches["date"].dt.dayofweek


# Create a new column representing if the team won the game with a 1 or if it didn't with a 0
matches["target"] = (matches["result"] == "W").astype("int")

In [7]:
matches

Unnamed: 0.1,Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,...,fk,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
0,1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,...,1.0,0.0,0.0,2022,Manchester City,0,18,16,6,0
1,2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,...,1.0,0.0,0.0,2022,Manchester City,1,15,15,5,1
2,3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,...,0.0,0.0,0.0,2022,Manchester City,1,0,12,5,1
3,4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,...,0.0,0.0,0.0,2022,Manchester City,0,10,15,5,1
4,6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,...,1.0,0.0,0.0,2022,Manchester City,1,17,15,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1384,38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0.0,4.0,...,0.0,0.0,0.0,2021,Sheffield United,0,18,19,6,0
1385,39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0.0,2.0,...,1.0,0.0,0.0,2021,Sheffield United,1,6,15,5,0
1386,40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1.0,0.0,...,0.0,0.0,0.0,2021,Sheffield United,0,7,19,6,1
1387,41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0.0,1.0,...,1.0,0.0,0.0,2021,Sheffield United,0,14,18,2,0


**Training the model**

As soon as our basic predictors are created and added to the DataSet, we are able to Train our model. Training is one of the most, if not the most important part of our machine learnig algorithm, in my case, I decided to use all the data from the year 2021 as Training Data, and all the Data after the 2022 as testing data.

**Random Forest algorithm**

During the training phase of the Random Forest algorithm, each decision tree is trained to predict the target variable (in this case, the outcome of the Premier League football match) based on the subset of predictors assigned to it. The algorithm determines the importance of each predictor by measuring how much it contributes to reducing the impurity or increasing the information gain in the decision tree.

When making predictions with the trained Random Forest model, the algorithm combines the predictions of all the decision trees by averaging (for regression problems) or voting (for classification problems) to generate the final prediction. Each decision tree contributes its own prediction based on the assigned subset of predictors, and the final prediction is determined by the collective wisdom of the entire ensemble.

In summary, predictors in the Random Forest algorithm are the input variables that capture the relevant information and characteristics of the data used to train the model. These predictors are used by individual decision trees within the ensemble to make predictions, and the algorithm combines the predictions of all the decision trees to produce the final prediction.

In [8]:
# Training the machine learning model 
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 10, random_state = 1)

# Divide the dataset into train and test sections
train = matches[matches["date"] < '2022-01-01']
test = matches[matches["date"] > '2022-01-01']

# A list of all the prdictors 
predictors = ["venue_code", "opp_code", "hour", "day_code", "gf"]

# Fit the random forest model, fitting the predictors and trying to predict the target
rf.fit(train[predictors], train["target"])

**Generating the first prediction**

After succeffully train the model, we are ready to generate some prediction based the previous results. 
As soon as we have our first prediction we can then calculate the accuracy of the model.

In [9]:
# Generate prediction, passing the test data and the predictors
preds = rf.predict(test[predictors])

# Need to describe the accuracy
from sklearn.metrics import accuracy_score

# Calculate the accuracy passing the test data, with the predictors and the prediction
acc = accuracy_score(test["target"], preds)

**Accuracy**

Our first acccuracy is actually pretty good, but we have to remember that this accuracy isn't just the winning case accuracy, but it's going to reflect also the loose and draw case. We will have to chance the method of calculation for the accuracy

In [10]:
acc

0.8115942028985508

**Combining the DataFrame and shows better results**

Now we have calculated and predicted some matches, but we still haven't seen the results, by combining the data of teh actual and the predicted values we can create a new DataFrame and display it, so we will be able to see our first raw predictions.

In [11]:
# Create a new DataFrame to combine the actual values and the predicted values 
combined = pd.DataFrame(dict(actual=test["target"], prediction=preds))

# Show a table with the new DataFrame
pd.crosstab(index = combined["actual"], columns = combined["prediction"])

prediction,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,147,25
1,27,77


**Revise the accuracy metrics**

As we discussed before, we have to revise the accuracy metrics, here we will take in considerations only teh predictions where we predicted a winning case. So therefore, we can see that the accuracy decrease drastically. 

In [12]:
# Revise the accuracy metrics 

# Show the accuracy of when the win condition was predicted
from sklearn.metrics import precision_score
precision_score(test['target'], preds)

0.7549019607843137

**Have a more clear look a the predictions**

We can now work again on our new DataFram, the prediction DataFrame. In order to make it easier to read, we can add some information to it by combining other data, such as the Date of the match, both of the team and most important the result.

In [13]:
# Show the predicted and actual result of every matches, with more informations
combined = pd.DataFrame(dict(actual = test["target"], predicted = preds), index = test.index)
combined = combined.merge(matches[["date", "team", "opponent", "result"]], left_index = True, right_index = True)
combined 

Unnamed: 0,actual,predicted,date,team,opponent,result
21,1,0,2022-01-15,Manchester City,Chelsea,W
22,0,1,2022-01-22,Manchester City,Southampton,D
23,1,1,2022-02-09,Manchester City,Brentford,W
24,1,1,2022-02-12,Manchester City,Norwich City,W
25,0,1,2022-02-19,Manchester City,Tottenham,L
...,...,...,...,...,...,...
624,0,0,2022-03-13,Norwich City,Leeds United,L
625,0,0,2022-04-02,Norwich City,Brighton,D
626,1,1,2022-04-10,Norwich City,Burnley,W
627,0,0,2022-04-16,Norwich City,Manchester Utd,L


**Increasing the accuracy of our algorithm**

One thing we can do to increase the accuracy of our algorithm is simply having the model to predict both sides of the match, as we are doing now we are predicting one side of the match, by merging the DataFrame with itself we can make the job. 
But first we will have to map all the team names in our DataSet, because they will appear with different names.

In [14]:
# Fix and normalize the name of the teams

class MissingDict(dict):
    __missing__ = lambda self, key: key

map_values = {
    "Brighton and Hove Albion": "Brighton",
    "Manchester United": "Manchester Utd",
    "Newcastle United": "Newcastle Utd",
    "Tottenham Hotspur": "Tottenham",
    "West Ham United": "West Ham",
    "Wolverhampton Wanderers": "Wolves"
}
mapping = MissingDict(**map_values)
mapping["West Ham United"]

# Use the new mapped team name in the panda map method
combined["new_team"] = combined["team"].map(mapping)
combined


Unnamed: 0,actual,predicted,date,team,opponent,result,new_team
21,1,0,2022-01-15,Manchester City,Chelsea,W,Manchester City
22,0,1,2022-01-22,Manchester City,Southampton,D,Manchester City
23,1,1,2022-02-09,Manchester City,Brentford,W,Manchester City
24,1,1,2022-02-12,Manchester City,Norwich City,W,Manchester City
25,0,1,2022-02-19,Manchester City,Tottenham,L,Manchester City
...,...,...,...,...,...,...,...
624,0,0,2022-03-13,Norwich City,Leeds United,L,Norwich City
625,0,0,2022-04-02,Norwich City,Brighton,D,Norwich City
626,1,1,2022-04-10,Norwich City,Burnley,W,Norwich City
627,0,0,2022-04-16,Norwich City,Manchester Utd,L,Norwich City


**Showing the new results with more accuracy**

Now, after merging the DataFrame with itselft we are able to show the new DataFrame and have more confidence with our predicted Data.

In [15]:
# Use the new column to merge the Dataframe with itself, so we can check if the predictions is the same on both side
merged = combined.merge(combined, left_on = ["date", "new_team"], right_on = ["date", "opponent"])
merged

Unnamed: 0,actual_x,predicted_x,date,team_x,opponent_x,result_x,new_team_x,actual_y,predicted_y,team_y,opponent_y,result_y,new_team_y
0,1,0,2022-01-15,Manchester City,Chelsea,W,Manchester City,0,0,Chelsea,Manchester City,L,Chelsea
1,0,1,2022-01-22,Manchester City,Southampton,D,Manchester City,0,0,Southampton,Manchester City,D,Southampton
2,1,1,2022-02-09,Manchester City,Brentford,W,Manchester City,0,0,Brentford,Manchester City,L,Brentford
3,1,1,2022-02-12,Manchester City,Norwich City,W,Manchester City,0,0,Norwich City,Manchester City,L,Norwich City
4,0,1,2022-02-19,Manchester City,Tottenham,L,Manchester City,1,1,Tottenham Hotspur,Manchester City,W,Tottenham
...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,0,0,2022-03-13,Norwich City,Leeds United,L,Norwich City,1,1,Leeds United,Norwich City,W,Leeds United
258,0,0,2022-04-02,Norwich City,Brighton,D,Norwich City,0,0,Brighton and Hove Albion,Norwich City,D,Brighton
259,1,1,2022-04-10,Norwich City,Burnley,W,Norwich City,0,0,Burnley,Norwich City,L,Burnley
260,0,0,2022-04-16,Norwich City,Manchester Utd,L,Norwich City,1,1,Manchester United,Norwich City,W,Manchester Utd


**Another prediction**

One more thing we can do, in order to have more confidence is checking where the predicted values are actually coherent. If in the first match we predicted the team One to win and the team Two to Loose, than it means the data we predicted are more consisent. 

In [16]:
# take a look at where one time was predicted to win and the other was predicted to loose, so where the algorithm has more confidence 
merged[(merged["predicted_x"] == 1 ) & (merged["predicted_y"] == 0)]["actual_x"].value_counts()

1    65
0     5
Name: actual_x, dtype: int64

**Calculating the new accuracy**

With these new data we can calculate the new accuracy by just dividing our right prediction by the all matches. And as we can see we managed to increase our accuracy by 10%.

In [17]:
# Checking the accuracy by diving 26 / 22
26 / 48

0.5416666666666666

**Improve the accuracy even more**

In order to improve the accuracy even more, the things we can do are actually a lot, for example increasing the Dataset, increasing our predictors, use the rolling average method and so on...