In [1]:
import joblib
model = joblib.load('model.joblib')

After the model training, outlined in the previous notebook, the above best model used a RandomForestClassifier to attempt to classify the data. The model achieved an accuracy of 53% on the testing data and 47.8% on the data for the 2022 season. The code below is used to read the data of the 2022 season with win results grouped as a 2, a draw 1, and a loss 0.

In [2]:
import pandas as pd

svm_cols = ['Season', 'Teams_in_League', 'Home_Team_Goals_For_This_Far',
            'Home_Team_Goals_Against_This_Far', 'Away_Team_Goals_For_This_Far',
            'Away_Team_Goals_Against_This_Far', 'Home_Team_Points',
            'Away_Team_Points', 'Away_Team_Winning_Streak',
            'Home_Team_Unbeaten_Streak', 'Away_Team_Unbeaten_Streak', 'Elo_home',
            'Elo_away', 'Home_Wins_This_Far', 'Home_Draws_This_Far',
            'Home_Losses_This_Far', 'Away_Draws_This_Far',
            'Home_Wins_This_Far_at_Home', 'Home_Draws_This_Far_at_Home',
            'Home_Losses_This_Far_at_Home', 'Home_Draws_This_Far_Away',
            'Away_Wins_This_Far_at_Home', 'Away_Draws_This_Far_at_Home',
            'Away_Losses_This_Far_at_Home', 'Away_Wins_This_Far_Away',
            'Away_Draws_This_Far_Away', 'Capacity', 'Home_Yellow',
            'Away_Team_Yellows_This_Far', 'Away_Red', 'Home_Points_Per_Game',
            'Home_Goals_Per_Game', 'Home_Goals_Against_Per_Game',
            'Away_Points_Per_Game', 'Away_Goals_Per_Game',
            'Away_Goals_Against_Per_Game', 'Away_Cards_Per_Game', 'Pitch_Match',
            'League']

def read_2022_data():
    test_data = pd.read_csv('cleaned_results.csv')
    test_data = test_data[svm_cols]
    test_data.League = test_data.League.astype('category').cat.codes
    return test_data

def get_2022_results():
    test_data = pd.read_csv('cleaned_results.csv')
    return test_data['Result']

Unnamed: 0,Season,Teams_in_League,Home_Team_Goals_For_This_Far,Home_Team_Goals_Against_This_Far,Away_Team_Goals_For_This_Far,Away_Team_Goals_Against_This_Far,Home_Team_Points,Away_Team_Points,Away_Team_Winning_Streak,Home_Team_Unbeaten_Streak,...,Away_Red,Home_Points_Per_Game,Home_Goals_Per_Game,Home_Goals_Against_Per_Game,Away_Points_Per_Game,Away_Goals_Per_Game,Away_Goals_Against_Per_Game,Away_Cards_Per_Game,Pitch_Match,League
0,2022,20,1.0,0.0,0.0,3.0,3.0,0.0,0.0,1.0,...,1.0,3.000000,1.000000,0.000000,0.000000,0.000000,3.000000,3.000000,1,3
1,2022,20,2.0,3.0,1.0,4.0,3.0,0.0,0.0,0.0,...,0.0,1.500000,1.000000,1.500000,0.000000,0.500000,2.000000,1.000000,1,3
2,2022,20,5.0,10.0,6.0,10.0,6.0,2.0,0.0,0.0,...,0.0,1.500000,1.250000,2.500000,0.500000,1.500000,2.500000,1.000000,1,3
3,2022,20,9.0,17.0,11.0,11.0,9.0,10.0,1.0,0.0,...,1.0,1.500000,1.500000,2.833333,1.666667,1.833333,1.833333,1.333333,1,3
4,2022,20,11.0,20.0,16.0,11.0,12.0,16.0,2.0,0.0,...,0.0,1.500000,1.375000,2.500000,2.000000,2.000000,1.375000,2.000000,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5446,2022,20,56.0,30.0,40.0,19.0,54.0,57.0,0.0,8.0,...,0.0,1.862069,1.931034,1.034483,1.965517,1.379310,0.655172,2.517241,1,9
5447,2022,20,60.0,32.0,26.0,43.0,60.0,28.0,0.0,10.0,...,0.0,1.935484,1.935484,1.032258,0.903226,0.838710,1.387097,2.935484,1,9
5448,2022,20,61.0,33.0,29.0,54.0,63.0,32.0,1.0,1.0,...,0.0,1.909091,1.848485,1.000000,0.969697,0.878788,1.636364,3.030303,1,9
5449,2022,20,65.0,35.0,41.0,38.0,69.0,43.0,1.0,3.0,...,0.0,1.971429,1.857143,1.000000,1.228571,1.171429,1.085714,2.457143,1,9


The code below can generate a confusion matrix to help understand how the model performed.

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix

def scale_array(df):
    scaler = MinMaxScaler()
    scaler.fit(df)
    X_sc = scaler.transform(df)
    return X_sc

def accuracy(confusion_matrix):
    diagonal_sum = confusion_matrix.trace()
    sum_of_all_elements = confusion_matrix.sum()
    return diagonal_sum / sum_of_all_elements

X_test = scale_array(read_2022_data(svm_cols))
y_real = get_2022_results()
y_pred = model.predict(X_test)

cm = confusion_matrix(y_pred, y_real,labels=[2,1,0])
print(cm)
print(accuracy(cm))

[[1891 1063 1006]
 [  56   54   57]
 [ 343  321  660]]
0.4778939644102


In [10]:
print(f'Chance of a complete mismatch = {1349/cm.sum()}')

Chance of a complete mismatch = 0.2474775270592552


These figures show that from the available data, this model has an approximate 48% chance of categorising new matches correctly and 24% chance of a complete mismatch ie. predicting a win as a loss and vice versa. This is a significant improvement over simple chance which would give a 33% chance for each 3 possible outcomes but is still rather low for the models intended purpose. This is in part due to the difficulty of multiclass classification but also in the difficulty of finding results of matches on relatively little data. To construct a better model would likely require large amounts of information on players and club staff and even a club's financial dealings such as transfer budgets. Since different players play in each match, their form is a large factor and so a better model could use information on team sheets, injury lists, and player statistics such as goals scored, clean sheets, and individual bookings.