In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

## Loading data

For this notebook we resort again to the data set used for plotting, as it will also be used for modelling in this section.

In [2]:
data = pd.read_csv('plotting_data', index_col = [0])
data.head()

Unnamed: 0,year,round,division,local_goals,visitor_goals,points_local,wins_local,draws_local,losses_local,gf_local,...,pos_local,points_visitor,wins_visitor,draws_visitor,losses_visitor,gf_visitor,ga_visitor,avg_visitor,pos_visitor,match_winner
10,2016,2,1,3,1,1,0.0,1.0,0.0,1,...,7,3,1.0,0.0,0.0,1,0,1.0,5,0
11,2016,2,1,0,0,1,0.0,1.0,0.0,0,...,10,1,0.0,1.0,0.0,0,0,0.0,13,1
12,2016,2,1,1,0,3,1.0,0.0,0.0,1,...,4,1,0.0,1.0,0.0,0,0,0.0,9,0
13,2016,2,1,3,0,3,1.0,0.0,0.0,2,...,2,1,0.0,1.0,0.0,0,0,0.0,11,0
14,2016,2,1,5,0,1,0.0,1.0,0.0,0,...,12,1,0.0,1.0,0.0,1,1,0.0,6,0


It is important to drop the columns referred to the goals from the match in order to do not give this information to the model, as these columns contain information related to the result of the match, namely the goals scored by each team.

In [3]:
data = data.drop(['local_goals', 'visitor_goals'], axis=1)

## Modelling



We start with models based on this data before performing feature engineering. We will observe which classifiers obtain the best prediction values, and based on them, we will have a base prediction result that we will try to improve.

For each model, the classification report and the confusion matrix will be obtained. 

In the classification report you will be able to see the results of success in function of the metrics, they will be evaluated especially for their results in f1 score, especially the accuracy.

In the confusion matrix you will get information about the errors and successes made according to the predicted results. . The column number indicates the prediction of the model and the row number indicates the result. On the main diagonal you will find the home, draw and away successes respectively. Outside the diagonal, for example, the first column is the predictions that the model made of a home win based on what the final result was, i.e. the position corresponding to the first column and the second row is the number of matches that the model predicted as a home win that finally turned out to be a draw. The significance of the remaining elements of the matrix are also explained.

### Logistic Regression

In [4]:
features = data.values[:, :-1]
target = data.values[:, -1]
X, y = features, target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=30)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LogisticRegression(max_iter = 500)
model.fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)

y_pred = model.predict(X_test_scaled)

np.mean(y_pred == y_test)

0.5074946466809421

In [5]:
confusion_matrix(y_test, y_pred)

array([[191,   5,  22],
       [108,   9,  14],
       [ 79,   2,  37]], dtype=int64)

In [6]:
target_names = ['Local win', 'Draw', 'Visitor win']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

   Local win       0.51      0.88      0.64       218
        Draw       0.56      0.07      0.12       131
 Visitor win       0.51      0.31      0.39       118

    accuracy                           0.51       467
   macro avg       0.52      0.42      0.38       467
weighted avg       0.52      0.51      0.43       467



### Decision Tree

In [7]:
features = data.values[:, :-1]
target = data.values[:, -1]
X, y = features, target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=30)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = DecisionTreeClassifier()
model.fit(X_train_scaled, y_train)

predictions_dt = model.predict(X_test)

X_test_scaled = scaler.transform(X_test)

y_pred = model.predict(X_test_scaled)

np.mean(y_pred == y_test)

0.3747323340471092

In [8]:
confusion_matrix(y_test, y_pred)

array([[86, 77, 55],
       [44, 42, 45],
       [40, 31, 47]], dtype=int64)

In [9]:
target_names = ['Local win', 'Draw', 'Visitor win']
print(classification_report(y_test, predictions_dt, target_names=target_names))

              precision    recall  f1-score   support

   Local win       0.48      0.21      0.29       218
        Draw       0.25      0.24      0.24       131
 Visitor win       0.19      0.40      0.25       118

    accuracy                           0.26       467
   macro avg       0.31      0.28      0.26       467
weighted avg       0.34      0.26      0.27       467



### Random Forest

In [10]:
features = data.values[:, :-1]
target = data.values[:, -1]
X, y = features, target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=30)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = RandomForestClassifier()
model.fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)

y_pred = model.predict(X_test_scaled)

np.mean(y_pred == y_test)

0.45610278372591007

In [11]:
confusion_matrix(y_test, y_pred)

array([[155,  35,  28],
       [ 85,  28,  18],
       [ 63,  25,  30]], dtype=int64)

In [12]:
target_names = ['Local win', 'Draw', 'Visitor win']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

   Local win       0.51      0.71      0.60       218
        Draw       0.32      0.21      0.26       131
 Visitor win       0.39      0.25      0.31       118

    accuracy                           0.46       467
   macro avg       0.41      0.39      0.39       467
weighted avg       0.43      0.46      0.43       467



### KNeighbors

In [13]:
from sklearn.neighbors import KNeighborsClassifier
features = data.values[:, :-1]
target = data.values[:, -1]
X, y = features, target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=30)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

model =KNeighborsClassifier()
model.fit(X_train_scaled, y_train)

predictions = model.predict(X_test)

X_test_scaled = scaler.transform(X_test)

y_pred = model.predict(X_test_scaled)

np.mean(y_pred == y_test)

0.45610278372591007

In [14]:
confusion_matrix(y_test, y_pred)

array([[147,  51,  20],
       [ 80,  35,  16],
       [ 66,  21,  31]], dtype=int64)

In [15]:
target_names = ['Local win', 'Draw', 'Visitor win']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

   Local win       0.50      0.67      0.58       218
        Draw       0.33      0.27      0.29       131
 Visitor win       0.46      0.26      0.34       118

    accuracy                           0.46       467
   macro avg       0.43      0.40      0.40       467
weighted avg       0.44      0.46      0.44       467



The best prediction result is obtained by one of the simplest models, Logistic Regression. With this model a 51% success rate is obtained in predicting results. Logistic Regression also obtains the best results predicting the categories of 'Local win' and 'Visitor win', obtaining also a very low accuracy in the prediction of 'Draw' (only 12%). 

It can be observed that for all the classifiers the most difficult category to predict is 'Draw', the model that obtains the best results for this category is KNeighbors (29%), which is the second classifier with the best overall results (46% prediction accuracy).

Most of the correct predictions of these models are concentrated on home wins. It is worth mentioning the case of the Kneighbors classifier, which obtains a better final prediction value than random forest, betting a lower number of times on the local victory. This fact is consolidated when observing the classification reports of both, as mentioned in the previous paragraph, Kneighbors stands out for the successes obtained in matches with a draw or away win as opposed to the rest. 

The logic of these results may be due to the fact that the dataset is not very large, and therefore does not have a sufficiently large number of examples to learn more complex patterns. That is why the model that gives the best results is the simplest one, such as logistic regression.

