Skip to content

aziztitu/football-match-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Football Match Predictor

Overview

This project uses Machine Learning to predict the outcome of a football match when given some stats from half time.

You can check out the demo here: https://football-predictor.projects.aziztitu.com/

Data Source

After doing some research, I landed on this site: https://datahub.io/collections/football, which contained structured datasets for a variety of football competitions ranging from national leagues to world cups.

For this project, I decided to select the datasets for the top 5 European Leagues that contained the match results for the last 9 years.

Here are the links to the datasets that I used:

Data Pre-Processing

Since the data that I obtained was already structured, it made this part a whole lot easier. But there were still a lot of work to be done.

Dataset Preview:
Note: The output column is FTR [H = Home Win, D = Draw, A = Away Win].

index    Div      Date           HomeTeam    AwayTeam  FTHG  FTAG FTR  HTHG  ...  BbAvAHH BbMxAHA BbAvAHA   PSH   PSD    PSA  PSCH  PSCD   PSCA

0          0  E0  2009-08-15  Aston Villa       Wigan     0     2   A   0.0  ...     1.22    4.40    3.99   NaN   NaN    NaN   NaN   NaN    NaN
1          1  E0  2009-08-15    Blackburn    Man City     0     2   A   0.0  ...     2.38    1.60    1.54   NaN   NaN    NaN   NaN   NaN    NaN
2          2  E0  2009-08-15       Bolton  Sunderland     0     1   A   0.0  ...     1.61    2.33    2.23   NaN   NaN    NaN   NaN   NaN    NaN
3          3  E0  2009-08-15      Chelsea        Hull     2     1   H   1.0  ...     1.02   17.05   12.96   NaN   NaN    NaN   NaN   NaN    NaN
4          4  E0  2009-08-15      Everton     Arsenal     1     6   A   0.0  ...     2.20    1.73    1.63   NaN   NaN    NaN   NaN   NaN    NaN
...      ...  ..         ...          ...         ...   ...   ...  ..   ...  ...      ...     ...     ...   ...   ...    ...   ...   ...    ...
17875    375  I1  26/05/2019        Inter      Empoli     2     1   H   0.0  ...     2.05    1.85    1.81  1.39  5.35   7.81  1.27  6.36  10.94
17876    376  I1  26/05/2019         Roma       Parma     2     1   H   1.0  ...     1.85    2.10    2.01  1.20  7.50  14.07  1.17  8.59  16.35
17877    377  I1  26/05/2019    Sampdoria    Juventus     2     0   H   0.0  ...     1.96    1.95    1.90  3.92  3.98   1.93  3.06  3.55   2.40
17878    378  I1  26/05/2019         Spal       Milan     2     3   A   1.0  ...     2.02    1.89    1.84  6.25  4.51   1.54  5.41  4.30   1.63
17879    379  I1  26/05/2019       Torino       Lazio     3     1   H   0.0  ...     2.03    1.88    1.84  2.34  3.76   3.01  2.36  3.56   3.12

Cleaning up the data

Firstly, there were a few missing data inside the dataset. For features such as HomeGoals, and AwayGoals, I was able to replace the missing data with the mean value of the feature for the respective team. But for features such as HomeTeam, AwayTeam, League, or any other discrete ones, I decided the best option was simply to drop those rows. Since the number of such rows was very small (less than 20), it was okay to drop them.

Rows with missing values (NaN):

home_encoded  away_encoded  HTHG   HTAG    HS    AS   HST  AST   HR   AR  FTR  ...

10585            16            95   NaN   NaN   NaN   NaN  NaN  NaN  NaN  NaN   A
15254            35           129   NaN   NaN   NaN   NaN  NaN  NaN  NaN  NaN   A
16757           132           121   NaN   NaN  13.0  15.0  3.0  5.0  0.0  0.0   A
...
...

Data Analysis

Feature Selection:

Now that the data was clean, it was time to find out which features contributed the most towards the match results. The dataset had 62 different stats for each match, but I had to choose the right ones that had the highest impact.

I started out by visualizing the distribution of some of the features that I thought were useful.

Home/Away Distribution:

Analyzing the Home/Away distribution, it was obvious that the match results favor the Home teams way more than the Away teams.

Shots:

Two other features that I thought were very important but turned out otherwise were 'Home Shots' and 'Away Shots'. On further exploration, I found that these had very little impact, if any, on the final results. But, what did have a massive impact were the 'Home Shots on Target', and 'Away Shots on Target'.

Yellow/Red Cards:

The number of yellow cards seemed to have little to no impact on the result. But the number of red cards however had a tremendous impact.

Statistical Tests:

After exploring some of the features manually, I went on to perform some statistical tests to see if these features were truly important.

This is a common problem in applied machine learning where you have to determine whether certain input features are relevant to the outcome.

In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem and can possibly be removed from the dataset.

One such test is the Pearson’s Chi-Squared statistical hypothesis. This was the result from the Chi-Squared Analysis:

HC is NOT an important predictor
AC is NOT an important predictor
HY is NOT an important predictor
AY is NOT an important predictor
AF is IMPORTANT for Prediction
AR is IMPORTANT for Prediction
AS is IMPORTANT for Prediction
AST is IMPORTANT for Prediction
HC is IMPORTANT for Prediction
HF is IMPORTANT for Prediction
HR is IMPORTANT for Prediction
HS is IMPORTANT for Prediction
HST is IMPORTANT for Prediction
HTAG is IMPORTANT for Prediction
HTHG is IMPORTANT for Prediction
ht_label is IMPORTANT for Prediction
at_label is IMPORTANT for Prediction

Another problem that we have is Collinearity, which is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. Add in more features that are collinear of each others and we get multicollinearity.

One of the methods we can use to check for multicollinear variables is calculating the Variance inflation factor (VIF). A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.

After calculating the VIF on this dataset, I found the following variables to have high VIF:

HS: 9
AS: 4
HF: 5
AF: 2

The earlier observation regarding the Home and Away shots is verified from this test. And we've also found more variables that are collinear.

After dropping the unnecessary features, these are the ones I selected:

home_encoded    non-null float64
away_encoded    non-null float64
HTHG            non-null float64
HTAG            non-null float64
HST             non-null float64
AST             non-null float64
HR              non-null float64
AR              non-null float64

Training & Testing the Models:

Since I'm using Python for this project. It is very easy to test multiple models to compare performance.

For this project, I selected the following 3 models:

  • Naive Bayes:
    • This is based on the famous Bayes’ Theorem which gives the probability of an event occuring given the probability of another event that has already occured.
    • The naive assumption that is made in this particular classifier is that all the features are •independent* of each other. This makes it easy to make the prediction, but that is exactly why the predictions are quite naive.
    • But in practice, there are quite a few real-world use cases of this type of classifier, namely document classification and spam-filtering among many others.
  • Random Forest
    • Random Forests are simply an ensemble of Decision Trees, where a large number of decision trees spit out a prediction of their own, and the prediction with the most votes becomes the model's prediction.
    • A decision tree, which is the building block of a Random Forest, is exactly what the name suggests. It is a tree-like structure in which the model makes a yes/no decision at each node to traverse the tree and ultimately reaches one of the leaf nodes where it makes a prediction.
  • Logistic Regression
    • Logistic regression is named for the function used at the core of the method, the logistic function or the sigmoid function.
    • It uses an equation as the representation, very much like linear regression, where the inputs are combined linearly using weights or coefficient values to predict an output value.
    • On their own, logistic regressions are only binary classifiers, meaning they cannot handle output with more than two classes. In our case we have 3 classes for our output (H, D, A).
    • However, there are clever extensions to logistic regression to do just that. In one-vs-rest logistic regression (OVR), which is what I used here, a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each classification problem (e.g. class H or not) is independent.

I split the dataset into a 4:1 ratio for training and testing. After the first run, these were the results:

Logistic Regression one vs All Classifier
--------------------
Model trained in 0.804028 seconds

Training Info:
F1 Score:0.6612824278022515
Accuracy:0.6612824278022515
Made Predictions in 0.001457 seconds

Test Metrics:
F1 Score:0.6633109619686801
Accuracy:0.6633109619686801
Made Predictions in 0.000830 seconds

Gaussain Naive Bayes Classifier
--------------------
Model trained in 0.016375 seconds

Training Info:
F1 Score:0.6289769946157612
Accuracy:0.6289769946157612
Made Predictions in 0.007448 seconds

Test Metrics:
F1 Score:0.6061856823266219
Accuracy:0.6061856823266219
Made Predictions in 0.001650 seconds

Random Forest Classifier
--------------------
Model trained in 1.326726 seconds

Training Info:
F1 Score:0.9999300748199427
Accuracy:0.9999300748199427
Made Predictions in 0.243328 seconds

Test Metrics:
F1 Score:0.6554809843400448
Accuracy:0.6554809843400448
Made Predictions in 0.111957 seconds

As you can see, it was not too bad for the first run. We have around 65% accuracy with Logistic Regression, and Random Forest, whereas 60% with Naive Bayes.

After playing around with it for a while I found that adding 'Home Shots', and 'Away Shots' back actually helped increase the accuracy a little bit.

After a lot of tweaking, here are the final results:

Logistic Regression one vs All Classifier
--------------------
Model trained in 0.714509 seconds

Training Info:
F1 Score:0.6815789473684211
Accuracy:0.6815789473684211
Made Predictions in 0.001436 seconds

Test Metrics:
F1 Score:0.7092105263157895
Accuracy:0.7092105263157895
Made Predictions in 0.000857 seconds

Gaussain Naive Bayes Classifier
--------------------
Model trained in 0.013822 seconds

Training Info:
F1 Score:0.6398026315789473
Accuracy:0.6398026315789473
Made Predictions in 0.007501 seconds

Test Metrics:
F1 Score:0.6526315789473685
Accuracy:0.6526315789473685
Made Predictions in 0.002132 seconds

Random Forest Classifier
--------------------
Model trained in 1.527298 seconds

Training Info:
F1 Score:0.999671052631579
Accuracy:0.999671052631579
Made Predictions in 0.267743 seconds

Test Metrics:
F1 Score:0.6907894736842105
Accuracy:0.6907894736842105
Made Predictions in 0.124963 seconds

Both Logistic Regression model and the Random Forest model had the best performance with 70% accuracy, and the Naive Bayes model had around 65%.

Next Steps

Initially, I did not think I was gonna get 70% accuracy with these models. But it is really cool to see it in action. But there are a few things I'd like to improve from here.

Team skill & strategy

One of the drawbacks at the moment is that the teams don't have a huge impact on the outcome. But in practice, that plays a huge role. A first-division team has a much higher chance of winning a game against a third-division team, even if the match was played at the third-division team's home ground.

The other thing I want the model to take into account is the ability of a team to bounce back. There are certain teams in football that play defensive in the first half, and are more aggressive in the second half or vice versa.

In order for the model to take these things into account, I plan to pre-compute these values for each team and store them locally. I can re-train the models with these features and during prediction, I can use the respective team's pre-computed values as supplemental features which should help it make better predictions.

Team Roster / Player Skills

I'd like the model to also take the players on the pitch into consideration when making the prediction. In practice, a team has a higher chance of winning the game when its star players are on the pitch.

Live Prediction

This is more of a long shot. As of now, the model makes the prediction based on the half-time stats. Eventually I'd like the model to predict the results for a live match all the way from minute 0 to minute 90. To do this, it must learn to account for the current match time. But training the model to account for this is going to be extremely hard.

References