In this notebook I will look to train multiple models from scikit learn and similar packages to evaluate their performance at predicting the outcome of a football match. In this instance, the idea will be to classify the result of a match (home, draw, away) as well as the number of goals scored per team to then get the match result for a particular game. We will test our models based on three seasons (19/20, 20/21, 21/22) of premier league data from: https://www.football-data.co.uk/.

The models we will look to execute are the following:
* Poisson
* Decision trees
* Random Forests
* SVM (SVR)

The data we will use is 3 seasons of premier league data from https://www.football-data.co.uk/. Note that data from football data is relatively basic but the purpose of this notebook is to demonstrate the functionality of the SVM functions offered by scikit learn. 

# Imports

In [1]:
import os
os.chdir("..")
%autosave 0

Autosave disabled


In [73]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report,confusion_matrix
from sklearn import linear_model
from src.football_data.etl.fetch import (
    get_football_data_seasons
)
pd.set_option('display.max_columns', None)

# Config

Decide the three seasons we will look to train our models for. 

In [3]:
season_name_list = ["2019_2020", "2020_2021", "2021_2022"]

# Cleaning training data.

The first step is to grab the data needed to train and test these models.

Let's grab a dataframe which has the data for our interested seasons concatenated.

In [4]:
cleaned_seasons_football_data_df = get_football_data_seasons(season_name_list)
cleaned_seasons_football_data_df.head(3)

Unnamed: 0,league_code,season_name,date,time,kickoff,hometeam,awayteam,fthg,ftag,ftr,hthg,htag,htr,b365h,b365d,b365a,hs,as,hst,ast,hf,af,hc,ac,hy,ay,hr,ar
0,E0,2019_2020,09/08/2019,20:00,2019-09-08 20:00:00,Liverpool,Norwich,4,1,H,4,0,H,1.14,10.0,19.0,15,12,7,5,9,9,11,2,0,2,0,0
1,E0,2019_2020,10/08/2019,12:30,2019-10-08 12:30:00,West Ham,Man City,0,5,A,0,1,A,12.0,6.5,1.22,5,14,3,9,6,13,1,1,2,2,0,0
2,E0,2019_2020,10/08/2019,15:00,2019-10-08 15:00:00,Bournemouth,Sheffield United,1,1,D,0,0,D,1.95,3.6,3.6,13,8,3,3,10,19,3,4,2,1,0,0


Now we will look to break the data above into the features used to train and test our models and define out target column(s).

NB that classification and regression feature columns will be the same but target columns will be different where classification will have have the result of the match (H, D, A) and regression the number of goals for both home and away team

Define classification features and target column

In [5]:
# define feature columns
numeric_column_names = cleaned_seasons_football_data_df.select_dtypes(include=['int64', 'float64']).columns
feature_column_names = [column for column in numeric_column_names if column not in ['fthg','ftag','hthg','htag']]
classification_features = cleaned_seasons_football_data_df[feature_column_names]

# define target column
classification_target = cleaned_seasons_football_data_df['ftr']

Define regression features and target column

In [6]:
cleaned_seasons_football_data_df.select_dtypes(include=['int64', 'float64'])

# define regression features
regression_features = classification_features

# define regression target
regression_target_home_goals = cleaned_seasons_football_data_df.fthg
regression_target_away_goals = cleaned_seasons_football_data_df.ftag

# Train test split data

Now that we have cleaned our data we will look to split our data into a training set and test set. We will do this for classification and regression separately.

Training and testing for classification

In [7]:
x_train_class, x_test_class, y_train_class, y_test_class = train_test_split(
    classification_features, 
    np.ravel(classification_target), 
    test_size=0.30, 
    random_state=101
    )

Training and testing for regression (home and away goals)

In [8]:
x_train_reg_hg, x_test_reg_hg, y_train_reg_hg, y_test_reg_hg = train_test_split(
    regression_features, 
    regression_target_home_goals, 
    test_size=0.30, 
    random_state=101
    )



In [9]:
x_train_reg_ag, x_test_reg_ag, y_train_reg_ag, y_test_reg_ag = train_test_split(
    regression_features, 
    regression_target_away_goals, 
    test_size=0.30, 
    random_state=101
    )


# Run sci-kit learn models (via classification)

Lets first define each of our classifiers.

In [120]:
scikit_classification_classifiers_list = [
    (SVC(),'SVM_C'),
    (DecisionTreeClassifier(),'DT_C'),
    (RandomForestClassifier(), 'RF_C')
]

Now let's run a loop around the sci-kit learn models of interest where we will need to fit our training data and make our predictions. We will also calculate the precision of our predictions for home, draw and away.

NB: for confusion matrix: 
* recall - TP/(TP+FN)
* precision - TP/(TP+FP)

In [121]:
metric_table_rows_classification = []

for classifier, classifier_name in scikit_classification_classifiers_list:
    classifier.fit(x_train_class,y_train_class)
    y_pred_class = classifier.predict(x_test_class)
    # confusion amtrix
    confusion_matrix_array = confusion_matrix(y_test_class, y_pred_class)
    # precision
    precision_home = confusion_matrix_array[0][0]/ sum(confusion_matrix_array[:, 0])
    precision_draw = confusion_matrix_array[1][1]/ sum(confusion_matrix_array[:, 1])
    precision_away = confusion_matrix_array[2][2]/ sum(confusion_matrix_array[:, 2])
    # accuracy
    no_correct_predictions = sum(np.diagonal(confusion_matrix_array))
    no_predictions = len(y_test_class)
    overall_precision = no_correct_predictions/no_predictions
    # recall
    recall_home = confusion_matrix_array[0][0]/ sum(confusion_matrix_array[0, :])
    recall_draw = confusion_matrix_array[1][1]/ sum(confusion_matrix_array[1, :])
    recall_away = confusion_matrix_array[2][2]/ sum(confusion_matrix_array[2, :])
    
    # append data
    metric_table_rows_classification.append([
        classifier_name,
        overall_precision,
        precision_home,
        precision_draw,
        precision_away,
        recall_home,
        recall_draw,
        recall_away
    ])
    


Now let's look at how our models compare when applying classification techniques

In [122]:
pd.DataFrame( 
    metric_table_rows_classification, 
    columns=[
        'classifier', 
        'accuracy', 
        'precision_home', 
        'precision_draw', 
        'precision_away',
        'recall_home',
        'recall_draw',
        'recall_away'
    ]
)

Unnamed: 0,classifier,overall_precision,precision_home,precision_draw,precision_away,recall_home,recall_draw,recall_away
0,SVM_C,0.622807,0.664336,0.363636,0.606383,0.748031,0.050633,0.838235
1,DT_C,0.459064,0.538462,0.238095,0.524823,0.496063,0.253165,0.544118
2,RF_C,0.584795,0.626984,0.3125,0.603261,0.622047,0.126582,0.816176


# Run sci-kit learn models (via regression)

Lets first define each of our classifiers.

In [128]:
scikit_regression_classifiers_list = [
    (SVR(),'SVM_R'),
    (DecisionTreeRegressor(),'DT_R'),
    (RandomForestRegressor(), 'RF_R'),
    (linear_model.PoissonRegressor(),'POI_R')
]

Now let's run a loop around the sci-kit learn models of interest where we will need to fit our training data and make our predictions. We are treating our problem here as a regression problem by predicting the number of goals scored per team and then we will work out the outcome of the game. We will also calculate the precision of our predictions for home, draw and away.

In [131]:
metric_table_rows_reg = []

for classifier, classifier_name in scikit_classifiers_list:
    # home goals
    home_goals_classifier = classifier
    home_goals_classifier.fit(x_train_reg_hg, y_train_reg_hg)
    y_pred_reg_hg = home_goals_classifier.predict(x_test_reg_hg)
    y_pred_reg_hg = [round(prediction) for prediction in y_pred_reg_hg]
    
    # away goals
    away_goals_classifier = classifier
    away_goals_classifier.fit(x_train_reg_ag, y_train_reg_ag)
    y_pred_reg_ag = away_goals_classifier.predict(x_test_reg_ag)
    y_pred_reg_ag = [round(prediction) for prediction in y_pred_reg_ag]
    
    # list of goals to series
    y_pred_reg_hg = pd.Series(y_pred_reg_hg)
    y_pred_reg_ag = pd.Series(y_pred_reg_ag)

    # grabbing outcome of match 
    blank_series = pd.Series(range(len(y_pred_reg_hg)))
    home_mask = y_pred_reg_hg.gt(y_pred_reg_ag)
    draw_mask = y_pred_reg_hg.eq(y_pred_reg_ag)
    away_mask = y_pred_reg_hg.lt(y_pred_reg_ag)

    y_pred_reg_classification = np.array(
        blank_series.where(~home_mask, 'H').where(~draw_mask, 'D').where(~away_mask, 'A')
    )
    # confusion amtrix
    confusion_matrix_array = confusion_matrix(y_test_class, y_pred_reg_classification)
    
    # precision
    precision_home = confusion_matrix_array[0][0]/ sum(confusion_matrix_array[:, 0])
    precision_draw = confusion_matrix_array[1][1]/ sum(confusion_matrix_array[:, 1])
    precision_away = confusion_matrix_array[2][2]/ sum(confusion_matrix_array[:, 2])
    # accuracy
    no_correct_predictions = sum(np.diagonal(confusion_matrix_array))
    no_predictions = len(y_test_class)
    overall_precision = no_correct_predictions/no_predictions
    # recall
    recall_home = confusion_matrix_array[0][0]/ sum(confusion_matrix_array[0, :])
    recall_draw = confusion_matrix_array[1][1]/ sum(confusion_matrix_array[1, :])
    recall_away = confusion_matrix_array[2][2]/ sum(confusion_matrix_array[2, :])
    
    
    # append data
    metric_table_rows_reg.append([
        classifier_name,
        overall_precision,
        precision_home,
        precision_draw,
        precision_away,
        recall_home,
        recall_draw,
        recall_away
    ])

Now let's look at how our models compare when applying regression techniques

In [132]:
pd.DataFrame( 
    metric_table_rows_reg, 
    columns=[
        'classifier', 
        'accuracy', 
        'precision_home', 
        'precision_draw', 
        'precision_away',
        'recall_home',
        'recall_draw',
        'recall_away'
    ]
)

Unnamed: 0,classifier,overall_precision,precision_home,precision_draw,precision_away,recall_home,recall_draw,recall_away
0,SVM_R,0.593567,0.718447,0.311828,0.684932,0.582677,0.367089,0.735294
1,DT_R,0.497076,0.583333,0.172414,0.54878,0.551181,0.126582,0.661765
2,RF_R,0.596491,0.765957,0.309278,0.675497,0.566929,0.379747,0.75
3,POI_R,0.567251,0.759036,0.316901,0.735043,0.496063,0.56962,0.632353


# Evaluation

In [150]:
evaluation_df = pd.DataFrame( 
    metric_table_rows_classification + metric_table_rows_reg, 
    columns=[
        'classifier', 
        'accuracy', 
        'precision_home', 
        'precision_draw', 
        'precision_away',
        'recall_home',
        'recall_draw',
        'recall_away'
    ]
)

evaluation_df

Unnamed: 0,classifier,accuracy,precision_home,precision_draw,precision_away,recall_home,recall_draw,recall_away
0,SVM_C,0.622807,0.664336,0.363636,0.606383,0.748031,0.050633,0.838235
1,DT_C,0.459064,0.538462,0.238095,0.524823,0.496063,0.253165,0.544118
2,RF_C,0.584795,0.626984,0.3125,0.603261,0.622047,0.126582,0.816176
3,SVM_R,0.593567,0.718447,0.311828,0.684932,0.582677,0.367089,0.735294
4,DT_R,0.497076,0.583333,0.172414,0.54878,0.551181,0.126582,0.661765
5,RF_R,0.596491,0.765957,0.309278,0.675497,0.566929,0.379747,0.75
6,POI_R,0.567251,0.759036,0.316901,0.735043,0.496063,0.56962,0.632353


From the above metrics we can see that models using regression techniques had better precision generally compared to when classification techniques were used for predicting home or away wins. 

Recall on the other hand for predicting home or away wins was higher for models using classification techniques but recall for draws was higher when regression techniques were used.

Accuracy was similar for both types of models.

# Conclusion

In this notebook I have successfully demonstrated a collection of predictive models available from scikit learn for classification and regression problems. We then used these models in our attempt to classify the outcome of a football match. This was achieved with regression techniques by first predicting the number of goals for each team and then working out the outcome from those two values. The accuracy of the models varied dependent on whether it was a classifcation or regression model used. We found that accuracy was pretty similar but precision was higher for classification models but lower for in regards to recall.


The models executed above are very basic and was only used to show the functionality of the models available from sci-kit learn. To improve on these models, I would look to find better data to be used for training and testing. Parameter tuning will also be important to try to maximise the performance of each model where an option could be to write the models from scratch.